我做抓取取前5页的信息,可是页数地址栏里看不到,也不知怎么传的,通过post传页数怎么写
好像应该用stream_context_create()函数,可我又看不太明白http://www.bjxw.gov.cn/XWzhwgk/XWnews/XWnewsjrxw.ycs
页数的表达:<FORM METHOD=POST ACTION="" name = "us_fun5991">
<TD align = right>
<INPUT TYPE="hidden" NAME="US_POST">总<FONT style = "COLOR:#CC0033">351</FONT>页 当前第
<FONT style = "COLOR:#CC0033">2</FONT>页
<A style = "CURSOR: hand;" onclick="us_fun5991.page_num.value='1';submit();">上一页</A> |
<A style = "CURSOR: hand;" onclick="us_fun5991.page_num.value='3';submit();">下一页</A>
转到:<INPUT TYPE="hidden" NAME="page_num"><SELECT id=page_num1 name=page_num1 style="width:80px" onchange = "us_fun5991.page_num.value=us_fun5991.page_num1.options[selectedIndex].value;submit();">
<option value=1>第1页</option>
<option selected value=2>第2页</option>
<option value=3>第3页</option>
<option value=4>第4页</option>
<option value=10>第10页</option>
好像应该用stream_context_create()函数,可我又看不太明白http://www.bjxw.gov.cn/XWzhwgk/XWnews/XWnewsjrxw.ycs
页数的表达:<FORM METHOD=POST ACTION="" name = "us_fun5991">
<TD align = right>
<INPUT TYPE="hidden" NAME="US_POST">总<FONT style = "COLOR:#CC0033">351</FONT>页 当前第
<FONT style = "COLOR:#CC0033">2</FONT>页
<A style = "CURSOR: hand;" onclick="us_fun5991.page_num.value='1';submit();">上一页</A> |
<A style = "CURSOR: hand;" onclick="us_fun5991.page_num.value='3';submit();">下一页</A>
转到:<INPUT TYPE="hidden" NAME="page_num"><SELECT id=page_num1 name=page_num1 style="width:80px" onchange = "us_fun5991.page_num.value=us_fun5991.page_num1.options[selectedIndex].value;submit();">
<option value=1>第1页</option>
<option selected value=2>第2页</option>
<option value=3>第3页</option>
<option value=4>第4页</option>
<option value=10>第10页</option>
解决方案 »
- 找不到Author/HTTP.php
- 怎样使用JPgraph将生成的图片保存为jpg或png文件?
- php中有类似于asp.net中的treeview控件吗?
- 请高手们推荐几个探索引擎不要向偷百度google那样的
- 玩了玩个人资料,里面有个 特长选择,竟然没有php选项,请版主向高层传达草根的呼吸……
- win2000 +apache2+php5+mysql4+zend安装后,访问数据库提示#2003 - 服务器没有响应
- 一个初级问题,求大家指点~
- 请问这个报错信息什么意思?怎样解决?
- php+ajax 无法得到后台执行文件的返回结果!
- 公司技术总监给了一个gitlab地址和账号,地址登录不了为什么,求牛人解决本人菜鸟一个!
- 显示一个页面的iframe中的某一个页面的问题带参数?如何做到
- PHP在windows下环境配置问题
就可以实现翻页了这样抓页面也就相当容易了
这个http://www.investbjxw.gov.cn/indexcu.ycs跟上边的类似,可那么做就不行,我用post传也没结果
<?
$pageurl="http://www.investbjxw.gov.cn/indexcu.ycs?page_num=";$pattern="/(\/indexzxxxs\.ycs?GUID=[\d]{6,})/i";
$out_pre="http://www.investbjxw.gov.cn";$opts = array('http'=>array('method'=>"POST",'header'=>"Content-type: application/x-www-form-urlencoded"));
$cxContext = stream_context_create($opts); $nums=array();
for($i=1;$i<=5;$i++){
$contents=@file_get_contents($pageurl.$i,null, $cxContext);
preg_match_all($pattern,$contents,$out);
foreach($out[1] as $url){
if(in_array($url,$nums)){
continue;
}
$nums[]=$url;
$allurl=$out_pre.$url;
echo '<a href="'.$allurl.'">'.$allurl."</a></br>";
}
}
?>
我用http://www.investbjxw.gov.cn/indexcu.ycs?page_num=3 可以直接訪問第3頁哦
网址:http://tjj.bjxw.gov.cn/XWTJJsjcx/XWTJJsjcxjdsj.ycs<TD align = right>
<FORM METHOD=POST ACTION="" NAME = 'NE_fun11304'>
<input type = 'hidden' NAME="GUID" value=''>
<input type = 'hidden' NAME="MAINKEY" value=''>
<td></td><td></td><TD align = "right" >
<FONT style="COLOR:#333333;">总<FONT style = "COLOR:#CC0033">7</FONT>页,当前第<FONT style = "COLOR:#CC0033">2</FONT>页,查看:</font>
<SELECT id=page_num name=11304page_num style="width:80px;COLOR:#333333;BACKGROUND-COLOR:#FFFFFF" onchange = "submit();">
<option value='11304$$1'>第1页</option>
<option selected value='11304$$2'>第2页</option>
<option value='11304$$3'>第3页</option>
<option value='11304$$4'>第4页</option>
<option value='11304$$5'>第5页</option>
<option value='11304$$6'>第6页</option>
<option value='11304$$7'>第7页</option>用
http://tjj.bjxw.gov.cn/XWTJJsjcx/XWTJJsjcxjdsj.ycs?page_num=11304$$3
或http://tjj.bjxw.gov.cn/XWTJJsjcx/XWTJJsjcxjdsj.ycs?page_num=3
都穿不了,是什么原因呢
$POST = 'US_POST=&page_num='.$pageNum.'&page_num1='.$pageNum; echo function_exists('curl_setopt_array');
if(function_exists('curl_setopt_array') == false)
{
function curl_setopt_array($curlObj,$paramsArray)
{
if(!!$paramsArray)
{
foreach( $paramsArray as $k=>$v)
{
curl_setopt($curlObj,$k,$v);
}
}
}
}
$url = 'http://www.bjxw.gov.cn/XWzhwgk/XWnews/XWnewsjrxw.ycs';
$curl = curl_init($url);
curl_setopt_array($curl,array(
CURLOPT_HEADER => false
,CURLOPT_POST => true
,CURLOPT_POSTFIELDS => $POST
,CURLOPT_RETURNTRANSFER => true
));
$content = curl_exec($curl);
curl_close($curl);
preg_match_all('#<td valign\s*=\s*top><a[^>]*>(.*?)</a></td>#is',$content,$m);
echo "List contents of page {$pageNum}:<pre/>";
print_r($m[1]);
你把$pageNum设置成1就是第一页的列表信息,还不明白吗?
'http://www.bjxw.gov.cn/XWzhwgk/XWnews/XWnewsjrxw.ycs'用上边的方法可以,我问的
网址是:http://tjj.bjxw.gov.cn/XWTJJsjcx/XWTJJsjcxjdsj.ycs
上面那个代码改改不就行了么?你想知道需要传什么参数到目标地址,怎么传的,用firefox+firebug看请求和返回http header + body就可以了。 $pageNum = 1;//这是第一页,要想取前5页,就自己写个for循环,发送5次http POST请求。
// $POST = 'US_POST=&page_num='.$pageNum.'&page_num1='.$pageNum;
$POST = 'GUID=&MAINKEY=&11304page_num='.urlencode('11304$$'.$pageNum);
if(function_exists('curl_setopt_array') == false)
{
function curl_setopt_array($curlObj,$paramsArray)
{
if(!!$paramsArray)
{
foreach( $paramsArray as $k=>$v)
{
curl_setopt($curlObj,$k,$v);
}
}
}
}
$url = 'http://tjj.bjxw.gov.cn/XWTJJsjcx/XWTJJsjcxjdsj.ycs';
$curl = curl_init($url);
curl_setopt_array($curl,array(
CURLOPT_HEADER => false
,CURLOPT_POST => true
,CURLOPT_POSTFIELDS => $POST
,CURLOPT_RETURNTRANSFER => true
,CURLOPT_SSL_VERIFYPEER => false
));
$content = curl_exec($curl);
curl_close($curl);
// echo $content;
preg_match_all('#<td><a target\s*=\s*(["\'])_blank\1[^>]*?>(.[^<>]*?)</a><table#is',$content,$m);
echo "List contents of page {$pageNum}:<pre/>";
print_r($m[2]);
下边这个网址的抓取信息跟上边的很像,但我写的不对,不知哪错了,好像是参数没选对,我老找不对参数
http://www.investsjs.gov.cn/EnvironMent/EconomyLaw/EconomyLaw.asp?type_id=20051102091002109173for($pageNum=1;$pageNum<=5;$pageNum++){
$POST = 'intPageNo=&intPageNo='.$pageNum;
//echo function_exists('curl_setopt_array');
if(function_exists('curl_setopt_array') == false)
{
function curl_setopt_array($curlObj,$paramsArray)
{
if(!!$paramsArray)
{
foreach( $paramsArray as $k=>$v)
{
curl_setopt($curlObj,$k,$v);
}
}
}
}
$url = $pageurl;
$curl = curl_init($url);
curl_setopt_array($curl,array(
CURLOPT_HEADER => false
,CURLOPT_POST => true
,CURLOPT_POSTFIELDS => $POST
,CURLOPT_RETURNTRANSFER => true
));
$content = curl_exec($curl);
curl_close($curl);
//$pattern="/(EconomyStateDetail\.asp\?ID=[\d]{15,})/i";
//$out_pre="http://www.investsjs.gov.cn/EconomyAbout/EconomyState/"; preg_match_all($pattern,$content,$m);
foreach($m[1] as $u){
$allurl=$out_pre.$u;
echo '<a href="'.$allurl.'">'.$allurl."</a></br>";
}
}page6("http://www.investsjs.gov.cn/EnvironMent/EconomyData/DistrictSituation.asp?type_id=20070914135237843780","/(situation_Detail\.asp\?ID=[\d]{15,}\&type_id=[\d]{15,})/i","http://www.investsjs.gov.cn/EnvironMent/EconomyData/");
for($pageNum=1;$pageNum<=5;$pageNum++){
$POST = 'intPageNo=&intPageNo='.$pageNum;
//echo function_exists('curl_setopt_array');
if(function_exists('curl_setopt_array') == false)
{
function curl_setopt_array($curlObj,$paramsArray)
{
if(!!$paramsArray)
{
foreach( $paramsArray as $k=>$v)
{
curl_setopt($curlObj,$k,$v);
}
}
}
}
$url = $pageurl;
$curl = curl_init($url);
curl_setopt_array($curl,array(
CURLOPT_HEADER => false
,CURLOPT_POST => true
,CURLOPT_POSTFIELDS => $POST
,CURLOPT_RETURNTRANSFER => true
));
$content = curl_exec($curl);
curl_close($curl);
//$pattern="/(EconomyStateDetail\.asp\?ID=[\d]{15,})/i";
//$out_pre="http://www.investsjs.gov.cn/EconomyAbout/EconomyState/"; preg_match_all($pattern,$content,$m);
foreach($m[1] as $u){
$allurl=$out_pre.$u;
echo '<a href="'.$allurl.'">'.$allurl."</a></br>";
}
}
}
点开firebug,然后点击分页按钮,察看页面下端的firebug toolbar,点击'网络',看第一个POST请求的headerx信息和post参数,正则该怎么写就看页面源代码.
======================================
POST EconomyLaw.asp
http://www.investsjs.gov.cn/EnvironMent/EconomyLaw/EconomyLaw.asp
200 OK
.....
...
..
参数application/x-www-form-urlencoded
INTPAGENO 2
type_id 20051102091002109173
安装了,可点击“网络”时说是禁用了 god