如何分析一个网页的HTML源码？

现在需要做一个多网页源码分析的程序，比如获取<title>标题</title>中的标题，或是分析某个标签出现的次数，标签的值。
像这种用什么做比较方便？谢谢

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

用htmlparser把网页HTML源码转成XML格式, 然后分析就特别简单了
先转换成XML吧，然后在解析XML，这样容易些。
光用程序控制  你可能要写好多判断
给两端代码你看看吧：
/**
     * 取正文信息
     *
     * @param input 表示html的文本
     * @throws Exception
     */
    private String getLinkInfo(String input) throws Exception {
int beginIndex = -1, endIndex = -1;
beginIndex = input.indexOf("<body");
endIndex = input.indexOf("</body>");
String needText = "";
if (beginIndex != -1 && endIndex != -1) {
    needText = input.substring(beginIndex, endIndex);
    beginIndex = needText.indexOf(">");
    needText = needText.substring(beginIndex + 1);
}
String bodyText = this.parseContent(needText);// ok
bodyText = this.removeLinks(bodyText);// ok
bodyText = this.processImages(bodyText);// ok
return bodyText;
// System.out.println(bodyText);
    }
////////////////////////////////////////////////////////////////
private String parseContent(String contentText) throws Exception {
String content = "";
int beginIndex = -1;
int endIndex = -1;
beginIndex = contentText.indexOf("<div class=\"Con\">");
if (beginIndex != -1) {
    content = contentText.substring(beginIndex);
    endIndex = content.indexOf("</div>");
    if (endIndex != -1) {
content = content.substring(0, endIndex + 6);
    }
}
return content;
    }
如果网页 HTML 结构简单，要提取的信息量少，直接用正则表达式最方便。
如果要分析和提取的信息量比较大，而且网页结构复杂，建议结合使用 HttpParser + Dom4J，或 HttpCleaner + Dom4J，利用前者将 HTML 处理成格式良好的 XML 文档，然后使用 Dom4J（XPath相当强大）提取信息相当方便快捷。
用htmlparser把网页HTML源码转成XML格式,然后解析啊，建议楼主去找本书去看看
《java网络机器人》