新手求助如何从一个网站页面按一定要求抓取出自己想要的数据？

比如http://www.appannie.com/top/iphone/united-states/games/这个页面，我想把FREE一栏的游戏排位上升大于30的游戏名称都抓取出来，该怎么办呢？貌似可以用jsoup，不过我找了很多例子，看不太懂

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

jsoup这种连接即断开的方式很容易被认为是网络攻击，所以会报503错误，像LZ说的这个网站就不能用jsoup直接抓取，不过可以先用HttpClient将网页保存到本地，然后再用jsoup来分析//先保存到本地硬盘
HttpClient client = new HttpClient();
String htmlurl = "http://www.appannie.com/top/iphone/united-states/games/";
System.out.println(htmlurl);
HttpMethod method = new GetMethod(htmlurl);
try
{
client.executeMethod(method);
System.out.println(method.getStatusLine());
String html = method.getResponseBodyAsString();
FileWriter fw = new FileWriter("C:\\download\\Top Charts - iPhone - United States - Games  App Annie.htm" );
fw.write(html);
fw.close();
} catch (HttpException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
try
{
//URL url = new URL("http://www.appannie.com/top/iphone/united-states/games/");
//Document doc = Jsoup.parse(url, 3000000);
File f = new File("C:\\download\\Top Charts - iPhone - United States - Games  App Annie.htm");
Document doc = Jsoup.parse(f,"UTF-8");
Elements tables = doc.select("table");
Element table = tables.get(1);
Elements trs =table.getElementsByTag("tr");
for(Element tr: trs)
{
Elements tds = tr.children();
Element td = tds.get(2);//表示 Free那一列
Elements span =td.getElementsByTag("span");
String content = span.get(0).html();
if(content.contains("\u25b2"))
{
String up = content.replace("\u25b2", "");//正三角，倒三角是\u25bc
int upnum = Integer.parseInt(up);
if(upnum >=30)
{
Elements a = td.getElementsByTag("a");
System.out.println(a.get(0).html());
}
}

}

} catch (MalformedURLException e)
{
e.printStackTrace();
} catch (IOException e)
{
e.printStackTrace();
}
如果我想把数据存到本地txt文档而不是直接输出呢？bw吗

新手求助 如何从一个网站页面按一定要求抓取出自己想要的数据？

解决方案 »

新手求助如何从一个网站页面按一定要求抓取出自己想要的数据？