Could be Cookie or Referrer issue. Better use verbose mode and post logs here.Just a sample on how to add logging :$fp_err = fopen('verbose_log.txt', 'ab+'); curl_setopt($ch, CURLOPT_VERBOSE, 1); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_STDERR, $fp_err); Once done, post the contents of verbose_log.txt here.//Ali
Here is my log. Can you read Chinese? I think i'd better translate my Question here: I pretend my crawler to be a browser, but it still got the 403 ERROR, but when i use my browser to visit their site it's ok even i refresh it with a considerable frequency. How can i handle it? Thks a lot!{ * The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 ) Host: book.douban.com Accept: */* Referer: www.douban.com }
I just cut where the problem comes above, and here is the whole log : { * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044639/ HTTP/1.1 User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 ) Host: book.douban.com Accept: */* Referer: www.douban.com< HTTP/1.1 200 OK < Server: nginx < Content-Type: text/html; charset=utf-8 < Connection: keep-alive < Keep-Alive: timeout=20 < Content-Length: 23684 < Expires: Sun, 1 Jan 2006 01:00:00 GMT < Pragma: no-cache < Cache-Control: must-revalidate, no-cache, private < P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" < Set-Cookie: bid="mRyYY+c5VLs"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT < Set-Cookie: viewed="1044639"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT < Date: Sun, 17 Apr 2011 18:34:01 GMT < * Connection #0 to host book.douban.com left intact * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044640/ HTTP/1.1 User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 ) Host: book.douban.com Accept: */* Referer: www.douban.com< HTTP/1.1 200 OK < Server: nginx < Content-Type: text/html; charset=utf-8 < Connection: keep-alive < Keep-Alive: timeout=20 < Content-Length: 25035 < Expires: Sun, 1 Jan 2006 01:00:00 GMT < Pragma: no-cache < Cache-Control: must-revalidate, no-cache, private < P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" < Set-Cookie: bid="FVAMqO5XkaQ"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT < Set-Cookie: viewed="1044640"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT < Date: Sun, 17 Apr 2011 18:34:02 GMT < * Connection #0 to host book.douban.com left intact * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044641/ HTTP/1.1 User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 ) Host: book.douban.com Accept: */* Referer: www.douban.com/*here's the same pattern with above '200 OK' code*///here comes the question * The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044907/ HTTP/1.1 User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 ) Host: book.douban.com Accept: */* Referer: www.douban.com
Seems like it is caused by cookie, isn't it? How can i handle it?
Yes, it is due to cookie. You can use something like below to handle cookies in Curl .... curl_setopt($curl, CURLOPT_COOKIEJAR, '/tmp/cookies.txt'); curl_setopt($curl, CURLOPT_COOKIEFILE, '/tmp/cookies.txt'); ....
Well, it dosen't fix it. I still encounter the problem that when i use my crawler i get error 403 but i can visit the site with my browser (Dose this mean that my IP can still be used but my crawler doesnt pretend well?) I can't deny that DouBan is pretty powerful.好吧,没有解决掉,如果用爬虫的话还是会有403问题,依然可以用浏览器访问(证明IP没有问题)。不得不承认豆瓣太强了。
I just tried the below code on my local machine using the URL which you were using (seen from the logs) and I'm getting the results as intended: $url = 'http://book.douban.com/subject/1044915/'; $c = curl_init(); $curl_header = array( 'Accept: */*', 'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )', 'Connection: Keep-Alive'); curl_setopt($c, CURLOPT_URL, $url); curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'GET'); curl_setopt($c, CURLOPT_RETURNTRANSFER, 1); curl_setopt($c, CURLOPT_HTTPHEADER, $curl_header); curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($c, CURLOPT_TIMEOUT, 30); curl_setopt($c, CURLOPT_HEADER, 0);$res = curl_exec($c);echo "<H1>HERE ARE THE RESULTS</H1>"; echo $res;I still believe you are missing some piece of information in your piece of code that's causing the 403 response on your side.Hope it helps.//Ali
After using your code, I cannot even get any data, I think the problem got different from the very first one i encountered. And here is my whole log(I set a limit that if it cannot get data it can only run 10 times), thanks again for your help!{ * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 * About to connect() to book.douban.com port 80 (#0) * Trying 211.147.4.31... * connected * Connected to book.douban.com (211.147.4.31) port 80 (#0) > GET /subject/1044915/ HTTP/1.1 Host: book.douban.com Accept: */* User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 ) Connection: Keep-Alive* The requested URL returned error: 403 * Closing connection #0 }
if same code working on one side and not working on other side then could be something different in environment. Let's analyze in different way: I'm using below versions: (Actually I'm using WAMP):PHP : PHP 5.2.5 (cli) (built: Nov 8 2007 23:18:51) Copyright (c) 1997-2007 The PHP Group Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies with the ionCube PHP Loader v3.3.18, Copyright (c) 2002-2010, by ionCube Ltd ., and with Xdebug v2.1.0, Copyright (c) 2002-2010, by Derick Rethanscurl curl version: 7.16.0with features: CURL_VERSION_SSL CURL_VERSION_LIBZ //Ali
My versions are HIGHER than yours, and i has enabled ALL the features you've mentioned. I used another way to solve my problem: I got over 200+ Proxy IP, and made them an array, once an IP is refused by the server, it will change to another one. Seems running well now. Though i solved my problem in that way, i'm just curious about why my crawler can run with proxy IP but cannot run with my local one. If that's because the server refused my IP why i can visit the site with my browser? Are there any aspect that i didn't disguise well in my crawler?Thk you guy very much! You do help!
Glad you are able to solve your issue by employing different technique.Are you invoking multiple requests to same server concurrently? It could be the reason that server blocks the IP which is causing too many concurrent requests as it won't be possible for human behavior (that is persons using the browser and simultaneously accessing the same server).Just a thought :)//Ali
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_STDERR, $fp_err);
Once done, post the contents of verbose_log.txt here.//Ali
Here is my log. Can you read Chinese? I think i'd better translate my Question here: I pretend my crawler to be a browser, but it still got the 403 ERROR, but when i use my browser to visit their site it's ok even i refresh it with a considerable frequency. How can i handle it? Thks a lot!{
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com
}
{
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044639/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 23684
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="mRyYY+c5VLs"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044639"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:01 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044640/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: text/html; charset=utf-8
< Connection: keep-alive
< Keep-Alive: timeout=20
< Content-Length: 25035
< Expires: Sun, 1 Jan 2006 01:00:00 GMT
< Pragma: no-cache
< Cache-Control: must-revalidate, no-cache, private
< P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
< Set-Cookie: bid="FVAMqO5XkaQ"; path=/; domain=.douban.com; expires=Thu, 01-Jan-2012 00:00:00 GMT
< Set-Cookie: viewed="1044640"; path=/; domain=.douban.com; expires=Wed, 01-Jan-2012 00:00:00 GMT
< Date: Sun, 17 Apr 2011 18:34:02 GMT
<
* Connection #0 to host book.douban.com left intact
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044641/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com/*here's the same pattern with above '200 OK' code*///here comes the question
* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044907/ HTTP/1.1
User-Agent: MSIE 7.0 (compatible; Mozilla/4.0; Windows NT 6.1 )
Host: book.douban.com
Accept: */*
Referer: www.douban.com
....
curl_setopt($curl, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($curl, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');
....
$url = 'http://book.douban.com/subject/1044915/';
$c = curl_init();
$curl_header = array(
'Accept: */*',
'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )',
'Connection: Keep-Alive');
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_HTTPHEADER, $curl_header);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($c, CURLOPT_TIMEOUT, 30);
curl_setopt($c, CURLOPT_HEADER, 0);$res = curl_exec($c);echo "<H1>HERE ARE THE RESULTS</H1>";
echo $res;I still believe you are missing some piece of information in your piece of code that's causing the 403 response on your side.Hope it helps.//Ali
After using your code, I cannot even get any data, I think the problem got different from the very first one i encountered. And here is my whole log(I set a limit that if it cannot get data it can only run 10 times), thanks again for your help!{
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
* About to connect() to book.douban.com port 80 (#0)
* Trying 211.147.4.31... * connected
* Connected to book.douban.com (211.147.4.31) port 80 (#0)
> GET /subject/1044915/ HTTP/1.1
Host: book.douban.com
Accept: */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1 )
Connection: Keep-Alive* The requested URL returned error: 403
* Closing connection #0
}
PHP 5.2.5 (cli) (built: Nov 8 2007 23:18:51)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies
with the ionCube PHP Loader v3.3.18, Copyright (c) 2002-2010, by ionCube Ltd
., and
with Xdebug v2.1.0, Copyright (c) 2002-2010, by Derick Rethanscurl
curl version: 7.16.0with features:
CURL_VERSION_SSL
CURL_VERSION_LIBZ
//Ali