请教大虾们一个问题，网络爬虫如何应对延时跳转

我只知道怎么用java代码，给定一个url,获取jsp页面html代码,不知道有没有用，帮你顶顶

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

这个我是知道的哦，我的意思是，假如你获取的不是你要的html代码，他中间有一个跳转网页，你该怎么办呢？
都没人见过这个问题吗？贴上代码吧，public String foreignSearch(String key,Integer pageNo){

Document document = DocumentHelper.createDocument();

Element root = document.addElement("records");

String[] keyList = key.split(" ");

try{

int len = keyList.length;

String temp = keyList[0];

for(int i=1;i<len;i++){

temp += "+"+keyList[i];

}

System.out.println(temp);

// HtmlProcessor pros = new HtmlProcessor();

String urlString = "http://ccc.calis.edu.cn/result.php?op=&date_from=&date_to=&order=&max=1000&dbid=&at[]="+temp+"&from[]=0&pos[]=&op=complex&changepage=0&p="+pageNo+"&pyear=&jtitle=&authorfacet=&dbfacet=&max=200&pos_sec=0&at_sec=&from_sec=0&searchSec=0";

System.out.println(urlString);

URL url = new URL(urlString);

InputStream openStream = url.openStream();

BufferedReader reader = new BufferedReader(new InputStreamReader(openStream,"utf-8"));

String line = "";

String s = "class=\"pen1\" style=\"font-size: 14px; font-weight: normal\">";

String str = "; ";

int count = 0;

//openStream = url.openStream();

//reader = new BufferedReader(new InputStreamReader(openStream,"utf-8"));

while((line = reader.readLine())!=null){

//System.out.println(line);

if(line.contains(s)){

System.out.println(line);

int titlePrefixLen = s.length();

int begin = line.indexOf(s)+titlePrefixLen;

int end = line.indexOf("</a>");

root.addElement("title").addText(line.substring(begin, end));

count++;

}
if(line.contains("href=\"detail.php?op=read&cccid=")){

System.out.println(line);

int begin = line.indexOf("detail.php?op=read&cccid=");

int end = line.indexOf("\" class=\"bbb\"");

root.addElement("url").addText("http://ccc.calis.edu.cn/"+line.substring(begin, end));

}
if(line.contains("; ")){

System.out.println(line);

int authorPrefixLen = str.length();

int begin = line.indexOf("; ")+authorPrefixLen;

int end = line.indexOf(" ");

root.addElement("author").addText(line.substring(begin, end));

}

}

root.addElement("resultCount").addText(String.valueOf(count));

return document.asXML();

}catch(Exception e){

e.printStackTrace();
root.addElement("resultCount").addText("0");
return document.asXML();
}

}
请各位给看看，有没有什么问题
说实话，没有特别好的办法。土鳖一点就是，首先判断页面长度，如果页面长度比较短，且页面中包含“自动跳转”；
那么就假定其为过渡页面。接着检索有没有页面跳转的关键 JS 代码，比如：
location.href=
window.location=
window.navigate(此外还有另一种做法，就是用HtmlUnit来执行该页面（它带JS执行引擎），然后捕捉页面跳转事件。
参见：
http://blog.csdn.net/strawbingo/article/details/5989879
然后就去尝试抓取目标URL，再进行二次跳转。
你的爬虫不是递归的么，这种js跳转的，url必定是写在前端的，你拿到第一层html，继续递归就好了
后台处理的话, 可以使用正则校验是否是过渡页面,如果是,解析出没有自动跳转请点击的那个URL进行访问.
可以查看head里的location地址   那个是最终的