php内容采集Warning:file_get_contents(http://auto.sina.com.cn/news/2012-07-21/0930100

三月前偶得一宝书，由杨宇等编著，清华大学出版社出版，书名《php典型模块与项目实战大全》,其中第八章，是讲内容采集的，给了一个爬虫程序，又给了一个延时函数，（脚本执行超过30秒后出现致使级错误，必须要用延时，可通过php.ini修改或延时函数），两者怎么结合书中没说，另外，无论怎么调试，都不得结果，老是出现三个错误提示：Warning: mysql_free_result(): supplied argument is not a valid MySQL result resource in C:\\spiders.php on line 69
；Warning: mysql_close(): supplied argument is not a valid MySQL-Link resource in C:\\spiders.php on line 71
: Warning:file_get_contents(http://auto.sina.com.cn/news/2012-07-21/09301004324.shtml) [function.file-get-contents]: failed to open stream: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败！！！无语，数据库能读能写，问题应在第三个提示，怎么解决？书中介绍太笼统，该怎么用，放在哪里，以及每行的注释，全没有。
难住我了，有类似经历的朋友或能解决问题的朋友
请回复我，不胜感激。

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

php.ini 中设置
allow_url_fopen = On   后再运行看看
<?php
//============================
// 文件: spiders.php
// 版本: 0.0.1
// 作者:
// 更新:
// 说明: 网页采集器程序
//============================
//获得栏目id
$Category=$_GET['cid'];
if($Category){
$con = mysql_connect('localhost', 'root', '4321') or die('Could not connect: ' . mysql_error());
mysql_query("set names gb2312");
//echo 'Connected successfully';
$db=mysql_select_db('get_content',$con);
if (!$db){
   die ("Can\'t use download : " . mysql_error());
}else{
      // 获得提取规则数组
        $sql = "SELECT * FROM `spiders` WHERE `Category`='".$Category."' ";
        $result = mysql_query($sql,$con);
$row=mysql_fetch_row($result);
//var_dump ($row);
if (!$result) {
        // 释放结果集
mysql_free_result($result);
}
}
}else{
exit("出错了:(");
}
//待获取页面的地址
$list_url = $row[5];
//获取链接列表
$list_content = file_get_contents($list_url);
//观察链接特征：<li> <a href="/news/2010-05-21/1705605024_4.shtml" title="自主的胜利？" target="_blank" class="fl">自主的胜利？</a><i>
//获取列表正则规则
$ch=curl_init();
$timeout=10;
curl_setopt($ch,CURLOPT_URL,$list_url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$list_content=curl_exec($ch);//执行句柄
curl_close($ch);//关闭链接
$list_match = $row[2];
//获得文章内容正则规则
$content_match=$row[3];
preg_match_all($list_match,$list_content,$list_data);
$i=0;
//获取到了一个链接列表
//print_r($list_data[1]);
foreach($list_data[1] as $detail_url){
//处理一下，如果已经抓取过了，就跳过去
$detail_content = file_get_contents("http://auto.sina.com.cn" . $detail_url);
//echo ($i++);
//获取标题
preg_match('/<h1 id="artibodyTitle".+?>(.+?)<\/h1>/U',$detail_content,$title_data);
$title=$title_data[1];
//die($title);
//获取内容,需要根据不同特点写正则
preg_match_all($content_match,$detail_content,$body_data); //die($body_data[2][0]);
$body = $body_data[2][0]; $sql = "INSERT INTO `get_content`.`articles` (`ID` ,`Title` ,`Click` ,`Content` ,`Date` ,`Category` )VALUES (NULL , '".$title."', '', '".$body."', now(), '1');";
$result1 =mysql_query($sql);
}
mysql_free_result($result1);
// 关闭连接
mysql_close($db); ?>
这是源文件中提供的爬虫程序，数据库怎么传呀？CSDN上传文件入口在哪里，有吗？其实里面就两个表，有需要的朋友加我邮箱，可以发给你。
我想大部分原因都是因为你的宝书内容一些参数过时了。
针对你上面的代码
你需要掌握几个地方1.数据库的连接
2.curl的使用
3.正则取得的内容
4.数据入库即可不需要管这个例子都在扯什么.按照我上面的提示去学习吧.不懂再来具体的问
你修改一下第一行
$Category=IsSet($_GET['cid']) ? $_GET['cid'] : '';他這個採集寫的一般,給你推薦一個
http://topic.csdn.net/u/20080824/07/0125890f-9a98-4296-ad84-c5c748c17581.html
<?php
$ctx = stream_context_create(
        array(
                 'http' => array(
                              'timeout' => 1 //设置一个超时时间，单位为秒
                                         )
                )
);
file_get_contents("http://example.com/", 0, $ctx);
?>
本帖最后由 xuzuning 于 2012-10-02 14:42:59 编辑
说实话，采集用file_get_contents是不是过时了？优先curl，fsockopen或者pfsockopen次之。最后才是file_get_contents
再次感谢热心网友的友好相助，Mysql的问题先放放，我怀疑是Apache服务器的配置问题，因为我从其官方网站下载全的安装包，（全英文的，具体内容不太懂）然后简单修改了下其配置文件，当时目标很简单，只要能顺利运行php即可。成功。可当运行深层次或复杂程序时，恐怕就会出问题，比如现在的采集函数，尤其是与file_get_contents相关的选项；还有就是与延时相关的选项，需不需要修改配置文件或添加什么东西，这个无从得知。
注释掉69和71行后，执行程序，显示如下错误
Warning: file_get_contents(http://auto.sina.com.cn/news/2012-09-25/07451037889.shtml) [function.file-get-contents]: failed to open stream: 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。 in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\spiders.php on line 47Fatal error: Maximum execution time of 240 seconds exceeded in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\spiders.php on line 47
$row[5]是数据库中数据表“spiders”的内容:http://auto.sina.com.cn/news/t
１、头部加入
set_time_limit(0);２、php.ini
allow_url_fopen = Off
改为
allow_url_fopen = On３、file_get_contents抓取网页不是很稳定……
推荐用curl４、mysql连接一般都是执行完自动关闭的，５、INSERT不缓存结果可以用mysql_unbuffered_query
Fatal error: Maximum execution time of 240 seconds exceeded in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\spiders.php on line 47
这是什么意思，怎么解决
Fatal error: Maximum execution time of 240 seconds exceeded
没有正确答复或连接的主机没有反应，连接错误：最大的执行时间为240秒