求htmlparser高手，我想取得标签中的文本，但是要移除<<a href=".."中的文本。

我想取得标签中的文本，但是要移除<a href=".."中的文本。我想要抓取一个新闻页中的正文，我发现90%就是的新闻页正文都写在标签中，于是我就提取下的文本，但是发现有的网站下还有<a href="..">广告<a>，我想把"广告 "去掉怎么做啊，试了好久。谢谢。
private String extractPText(String content) {
String ptext = "";
try {
Parser parser = new Parser(content);
parser.setEncoding("GB2312"); TagNameFilter filter = new TagNameFilter("p");
NodeList nodelist = parser.extractAllNodesThatMatch(filter);
System.out.println("p数量" + nodelist.size());
for (int i = 0; i < nodelist.size(); i++) {
NodeList nodelistA = nodelist.extractAllNodesThatMatch(filterA);
if(nodelist.elementAt(i).getChildren().toHtml().contains("a")){
nodelist.remove(i);
}
TextExtractingVisitor visitor = new TextExtractingVisitor();
nodelist.visitAllNodesWith(visitor);
ptext = visitor.getExtractedText(); } catch (Exception e) {
e.printStackTrace();
}
return ptext.trim();
}

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

HTTPParser 解析HTML中的Table的行列手工设定需要解析的table
http://blog.csdn.net/eqxu/archive/2007/06/06/1640699.aspx
用html parser 来获取HTML网页中Form的各个属性名值组
http://blog.csdn.net/eqxu/archive/2007/05/29/1629820.aspx
你这个最好用正则来做，首先匹配然后匹配<a>就可以，具体的你查查资料吧

调试易

求htmlparser高手，我想取得<p>标签中的文本，但是要移除<<a href=".."中的文本。

解决方案 »