爬虫返回403，但是网页却可以访问，怎么伪装爬虫？

我用浏览器每秒刷新3次刷了一分钟都没有403，我的爬虫还加了sleep(3);就会403....

Could be Cookie or Referrer issue. Better use verbose mode and post logs here.Just a sample on how to add logging :$fp_err = fopen('verbose_log.txt', 'ab+');
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_STDERR, $fp_err);
Once done, post the contents of verbose_log.txt here.//Ali

Here is my log. Can you read Chinese? I think i'd better translate my Question here: I pretend my crawler to be a browser, but it still got the 403 ERROR, but when i use my browser to visit their site it's ok even i refresh it with a considerable frequency. How can i handle it? Thks a lot!{
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com
}

I just cut where the problem comes above, and here is the whole log :
{
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044639/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 23684
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="mRyYY+c5VLs"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044639"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:01 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044640/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 25035
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="FVAMqO5XkaQ"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044640"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:02 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044641/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com/*here's the same pattern with above '200 OK' code*///here comes the question
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044907/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com

Seems like it is caused by cookie, isn't it? How can i handle it?

Yes, it is due to cookie. You can use something like below to handle cookies in Curl
....
curl_setopt($curl, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($curl, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');
....

Well, it dosen't fix it. I still encounter the problem that when i use my crawler i get error 403 but i can visit the site with my browser (Dose this mean that my IP can still be used but my crawler doesnt pretend well?) I can't deny that DouBan is pretty powerful.好吧，没有解决掉，如果用爬虫的话还是会有403问题，依然可以用浏览器访问（证明IP没有问题）。不得不承认豆瓣太强了。

登陆干嘛？我直接用循环访问我需要的网页，他们每一个商品都有固定的编号like“douban.com/subject/xxxxxxx”，这样不也节省点他们的资源么。我是觉得豆瓣的信息整理的好，不去抓电商了，看来豆瓣太强悍，怎么都解决不了。感谢外国友人！！！

I just tried the below code on my local machine using the URL which you were using (seen from the logs) and I'm getting the results as intended:
$url = 'http://book.douban.com/subject/1044915/';
$c = curl_init();
$curl_header = array(
    'Accept: */*',
    'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )',
    'Connection: Keep-Alive');
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_HTTPHEADER, $curl_header);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($c, CURLOPT_TIMEOUT, 30);
curl_setopt($c, CURLOPT_HEADER, 0);$res = curl_exec($c);echo "<H1>HERE ARE THE RESULTS</H1>";
echo $res;I still believe you are missing some piece of information in your piece of code that's causing the 403 response on your side.Hope it helps.//Ali

After using your code, I cannot even get any data, I think the problem got different from the very first one i encountered. And here is my whole log(I set a limit that if it cannot get data it can only run 10 times), thanks again for your help!{
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
*   Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
}

if same code working on one side and not working on other side then could be something different in environment. Let's analyze in different way: I'm using below versions: (Actually I'm using WAMP):PHP :
PHP 5.2.5 (cli) (built: Nov  8 2007 23:18:51)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
    with the ionCube PHP Loader v3.3.18, Copyright (c) 2002-2010, by ionCube Ltd
., and
    with Xdebug v2.1.0, Copyright (c) 2002-2010, by Derick Rethanscurl
curl version: 7.16.0with features:
CURL_VERSION_SSL
CURL_VERSION_LIBZ
//Ali

My versions are HIGHER than yours, and i has enabled ALL the features you've mentioned. I used another way to solve my problem: I got over 200+ Proxy IP, and made them an array, once an IP is refused by the server, it will change to another one. Seems running well now. Though i solved my problem in that way, i'm just curious about why my crawler can run with proxy IP but cannot run with my local one. If that's because the server refused my IP why i can visit the site with my browser? Are there any aspect that i didn't disguise well in my crawler?Thk you guy very much! You do help!

关键是爬虫403但浏览器就可以访问啊。而且加了sleep3，浏览器每秒刷3次，连刷一分钟都没有问题。

用浏览器访问以下，抓一下包看看，应该是需要cookie，但是你的爬虫没有把cookie信息发送过去！

Glad you are able to solve your issue by employing different technique.Are you invoking multiple requests to same server concurrently? It could be the reason that server blocks the IP which is causing too many concurrent requests as it won't be possible for human behavior (that is persons using the browser and simultaneously accessing the same server).Just a thought :)//Ali

调试易

爬虫返回403，但是网页却可以访问，怎么伪装爬虫？

解决方案 »