java如何抓取网页局部内容 如题,例如我需要抓取<table class='mystyle'> </table>之间的内容,在一个网页中有很多这样的table,需要将其中的内容都提取出来,用什么方法比较利于重用? 解决方案 » 免费领取超大流量手机卡,每月29元包185G流量+100分钟通话, 中国电信官方发货 使用HTMLParser解析html。Parser parser = xxx;parser.reset();NodeFilter filter1 = new TagNameFilter("table");NodeFilter filter2 = new HasAttributeFilter("class", "mystyle");NodeFilter filter = new AndFilter(filter1, filter2);nodeList = parser.extractAllNodesThatMatch(filter);要不就写正则,如果table里还嵌套其他的table,估计不好写。 嗯,用正则比较好,你也可以用substring,indexOf等等截取字符串,当然,要放循环里,这估计比较麻烦 Pattern pattern = Pattern.compile("<.+?>", Pattern.DOTALL);Matcher matcher = pattern.matcher("<table class=\"mystyle\">需要获取的内容</table>String string = matcher.replaceAll("");System.out.println(string); 用正则表达式可以,也可以用htmlparser或者用nekohtml把其转成xml然后用xpath来进行查找。有一个java项目jager提供根据模板来提取html的功能。 <div class="class_right_list"> <div class="title1">10日大连焊管市场价格下跌</div> <div class="author1">来源: 发布日期:2012-08-10 11:34:31</div> <div style="background:url(images/line_dot1.gif) center center no-repeat; height:10px; margin:8px 0px; "></div> <div class="big" style="width:620px; margin:0px auto;"><p> 10日大连焊管市场价格下跌。现唐山产27*2.5报价4030元/吨;48*3.25报价3930元/吨;114*3.75报价3880元/吨;219*5.0报价4080元/吨。以上报价较昨日整体下跌70元/吨,本周市场成交水平一般,终端需求采购活跃度仍不高,市场价格难寻支撑,因此弱势下跌。(中规钢铁网)</p></div> <div>关键字: <a href="#"></a> </div> </div>采集<div class="class_right_list">里面的内容?(该怎么写?)我的代码 :try { URL url = new URL(startUrl); URLConnection urlConnection = url.openConnection(); InputStream inputStream = urlConnection.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader( inputStream)); StringBuffer is = new StringBuffer(); StringBuilder builder = new StringBuilder(); String date; while ((date = reader.readLine()) != null) { is.append(date); } String htmlcode = is.toString(); // 创建Parser对象根据传给字符串和指定的编码 Parser parser = Parser.createParser(htmlcode, "GBK"); // 创建HtmlPage对象HtmlPage(Parser parser) HtmlPage page = new HtmlPage(parser); try { parser.visitAllNodesWith(page); } catch (ParserException e1) { e1 = null; } NodeList nodeList = page.getBody(); NodeFilter filter1 = new TagNameFilter("div"); NodeFilter filter2 = new HasAttributeFilter("class", ""); NodeFilter filter = new AndFilter(filter1, filter2); nodeList = parser.extractAllNodesThatMatch(filter); for (int m = 0; m < nodeList.size(); m++) { Div newsContenTag = (Div) nodeList.elementAt(m); builder = builder.append(newsContenTag.getText()); } content = builder.toString(); if (content != null) { parser.reset(); parser = Parser.createParser(content, "gbk"); StringBean sb = new StringBean(); sb.setCollapse(true); parser.visitAllNodesWith(sb); content = sb.getStrings(); } System.out.println(content); } catch (Exception e) { e.printStackTrace(); } return content; } JAVA核心技术 JScrollBar.HORIZONTAL在哪里 用jsdk编译一个servlet程序报错,请指点 如何交换两个对象 文件写操作的问题?为什么不能添加内容进已写入过内容的文件? 请问设计完报表(iReport)用jsp怎么引用这个报表? 求助关于HashTable的问题? 怎样利用java解压缩文件? 求助!数据结构题. “好的” 接分! swing类中表的问题 用ArrayList或LinkedList实现先进先出队列Queue
parser.reset();
NodeFilter filter1 = new TagNameFilter("table");
NodeFilter filter2 = new HasAttributeFilter("class", "mystyle");
NodeFilter filter = new AndFilter(filter1, filter2);
nodeList = parser.extractAllNodesThatMatch(filter);要不就写正则,如果table里还嵌套其他的table,估计不好写。
Pattern pattern = Pattern.compile("<.+?>", Pattern.DOTALL);
Matcher matcher = pattern.matcher("<table class=\"mystyle\">需要获取的内容</table>
String string = matcher.replaceAll("");
System.out.println(string);
<div class="title1">10日大连焊管市场价格下跌</div>
<div class="author1">来源: 发布日期:2012-08-10 11:34:31</div>
<div style="background:url(images/line_dot1.gif) center center no-repeat; height:10px; margin:8px 0px; "></div>
<div class="big" style="width:620px; margin:0px auto;"><p> 10日大连焊管市场价格下跌。现唐山产27*2.5报价4030元/吨;48*3.25报价3930元/吨;114*3.75报价3880元/吨;219*5.0报价4080元/吨。以上报价较昨日整体下跌70元/吨,本周市场成交水平一般,终端需求采购活跃度仍不高,市场价格难寻支撑,因此弱势下跌。(中规钢铁网)</p></div>
<div>关键字:
<a href="#"></a>
</div>
</div>
采集<div class="class_right_list">里面的内容?(该怎么写?)我的代码 :
try {
URL url = new URL(startUrl);
URLConnection urlConnection = url.openConnection();
InputStream inputStream = urlConnection.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(
inputStream));
StringBuffer is = new StringBuffer();
StringBuilder builder = new StringBuilder();
String date;
while ((date = reader.readLine()) != null) {
is.append(date);
} String htmlcode = is.toString();
// 创建Parser对象根据传给字符串和指定的编码
Parser parser = Parser.createParser(htmlcode, "GBK");
// 创建HtmlPage对象HtmlPage(Parser parser)
HtmlPage page = new HtmlPage(parser);
try {
parser.visitAllNodesWith(page);
} catch (ParserException e1) {
e1 = null;
}
NodeList nodeList = page.getBody();
NodeFilter filter1 = new TagNameFilter("div");
NodeFilter filter2 = new HasAttributeFilter("class", "");
NodeFilter filter = new AndFilter(filter1, filter2);
nodeList = parser.extractAllNodesThatMatch(filter);
for (int m = 0; m < nodeList.size(); m++) {
Div newsContenTag = (Div) nodeList.elementAt(m);
builder = builder.append(newsContenTag.getText());
}
content = builder.toString();
if (content != null) {
parser.reset();
parser = Parser.createParser(content, "gbk");
StringBean sb = new StringBean();
sb.setCollapse(true);
parser.visitAllNodesWith(sb);
content = sb.getStrings();
}
System.out.println(content);
} catch (Exception e) {
e.printStackTrace();
}
return content;
}