抓取了一页面中的信息，如何再抓取这一页中的一链接

本帖最后由 qqsd126 于 2009-12-26 13:45:51 编辑

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

#\<(a)\>[^>]*\<\/(a)\># 再去掉html标签就可以了
$file=fopen("http://excample.com","r");
$data="";
while(!feof($file))
{
$data.=fgets($file,1024);
}
//获取网站的链接
preg_match_all("/<a\s+?href=.+?>.+?<\/a>/",$data,$arr);
//echo $arr[0][0];

foreach($arr[0] as $a)
{
echo $a." ";
}

fclose($file);
*/
echo file_get_contents("http://example.com");这只是简单的例子，好一点的可以继续修改
抓取到连接地址   在使用file_get_contents  得到页面
使用正则取你想要的
把得到的链接存在url数组里。
继续用file_get_contents读。可以用递归实现。
不过你最好设置一个出口
要不然就死循环，没完没了了。
是要抓取链接中的内容吗？
使用snoopy比较方便了，可以自动过滤掉标签。
<?php
include "Snoopy.class.php";
$snoopy = new Snoopy;
$url="http://www.baidu.com";//测试用的百度，你给的页面里的连接太多了。
$snoopy->fetchlinks($url);
//print_r ($snoopy->results);foreach($snoopy->results as $results=>$urls)
{
 $show=$snoopy->fetch($urls);//提取每个连接中的内容。
 $FinalPage.=$show;
}
//$FinalPage=$snoopy->fetch($snoopy->results[0]);
echo $FinalPage;?>不过里面的链接特多，可能你还是需要正则进行筛选。不知道对你有帮助没。
先用正则获取页面里的信息，提取出所有的连接，去除重复的，排列成数组，然后循环着打开这些连接，再获取信息。前提还是要用正则匹配连接，抓取连接这个是我写的一个分析连接的代码，有点简陋（在ThinkPHP下的代码）public function _getAllLinks($site)
{
import('ORG.Net.XmlHTTP');
$http = new XmlHttp();
if (false == $http->openUrl('http://'.$site))
{
return false;
}
$html = $http->getHtml(); if ('' == $html) return false;
$charcode = mb_detect_encoding($html, array('ASCII', 'GB2312', 'GBK', 'UTF-8'));
if ('UTF-8' != $charcode)
{
$html = iconv($charcode, 'UTF-8', $html);
}
//$html = h($html);
$html = preg_replace('/[\n\r]/i', '', $html);
$html = preg_replace('/http:\/\/'.$site.'/i', '', $html);
//echo $html; preg_match_all('/<a.*?href\s*=\s*[\'"](.*?)[\'"].*?>(.*?)<\/a>/i', $html, $ms);
//print_r($ms);
$link = $ms[1];
$text = $ms[2]; $outs = 0; $anchors = 0; $javas = 0; $ins = 0;
for ($i=0; $i<count($link); $i++)
{
$clink = '';
$clink = $link[$i];
//$clink = str_replace('"', '', $clink);
$clink = strtolower($clink);
$ctext = '';
$ctext = $text[$i];
$cimage = '';
if (preg_match('/<img.*?src[ ]*=[ ]*[\'|"](.*?)[\'|"].*?\/?>/', $ctext, $ms))
{
// 链接包含图片
$cimage = $ms[1];
}
$ctext = preg_replace('/<.*?>|\s/', '', $ctext);
if ('http://' == substr($clink, 0, 7))
{
// 外链
$type = 'out';
$incres = $outs++;
//echo $cimage.' - '.$ctext.'link out : '.$clink.' ';
}
elseif ('#' == substr($clink, 0, 1))
{
// 锚链
$type = 'anchor';
$incres = $anchors++;
//echo $cimage.' - '.$ctext.'anchor : '.$clink.' ';
}
elseif ('javascript:' == substr($clink, 0, 11))
{
// JAVA脚本
$type = 'java';
$incres = $javas++;
//echo $cimage.' - '.$ctext.'java : '.$clink.' ';
}
else
{
// 内链
$type = 'in';
$incres = $ins++;
//echo $cimage.' - '.$ctext.'link : '.$clink.' ';
}
$result[$type][$incres]['image'] = $cimage;
$result[$type][$incres]['text'] = $ctext;
$result[$type][$incres]['link'] = $clink;
}
return $result;
}