求救 java解析html

求救，java用哪个类可以解析html里面的标签啊，例如我要解析html里面的<a href=""></a>，用哪个类啊？

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

比如我想获得http://csbbs.soufun.com/2710156784~-2~683/5236858_5236858.htm网址中社区这个标签的联结地址，可以这样
import java.net.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.*;
public class UrlCodeRegex
{
  public static void main(String[] args)
   {
    try
    {
        String ur="http://csbbs.soufun.com/2710156784~-2~683/5236858_5236858.htm"; //获取远程网上的信息
        URL MyURL=new URL(ur);
        String str;
        URLConnection con=MyURL.openConnection();
        InputStreamReader ins=new InputStreamReader(con.getInputStream());
        BufferedReader in=new  BufferedReader(ins);
        StringBuffer sb = new StringBuffer();
        while ((str=in.readLine())!=null)
        {
         sb.append(str);
        }
            in.close();

            Pattern p = Pattern.compile(".*<a href=\"(.*)\">社区</a>.*");
     Matcher m  = p.matcher(sb.toString());
     m.matches();
     System.out.println("这个社区的网址是"+m.group(1));
     }
    catch (MalformedURLException mfURLe) {
      System.out.println("MalformedURLException: " + mfURLe);
    }
    catch (IOException ioe) {
      System.out.println("IOException: " + ioe);
    }
  }
}
JDK自带的javax.swing.text.html.parser这个包
或者这个更好
/**
* 此程序是获得网页源代码中某个关键字的链接网址，
* 如<a href="http:\\www.sina.com" target="blank">新浪</a>
*/
import java.net.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.*;
public class UrlCodeRegex
{
  public static void main(String[] args)
   {
    try
    {
        String ur="http://csbbs.soufun.com/2710156784~-2~683/5236858_5236858.htm"; //获取远程网上的信息
        URL MyURL=new URL(ur);
        String str;
        URLConnection con=MyURL.openConnection();
        InputStreamReader ins=new InputStreamReader(con.getInputStream());
        BufferedReader in=new  BufferedReader(ins);
        StringBuffer sb = new StringBuffer();
        while ((str=in.readLine())!=null)
        {
         sb.append(str);
        }
            in.close();

              Pattern p = Pattern.compile(".*<a href=\"(http://([\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?)\".*>社区</a>.*");
     Matcher m  = p.matcher(sb.toString());
     m.matches();
     System.out.println("这个社区的网址是"+m.group(1));
     }
    catch (MalformedURLException mfURLe) {
      System.out.println("MalformedURLException: " + mfURLe);
    }
    catch (IOException ioe) {
      System.out.println("IOException: " + ioe);
    }
  }
}
用  javax.xml.parsers.SAXParser
自己写个 handler 就好了！像这样 public static void main(String[] args)throws Exception{ SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
URL url = new URL("xxxx");//xxxx is you URL
URLConnection con = url.openConnection();
MyHandler myhandler = new MyHandler();
parser.parse(con.getInputStream(), myhandler);
System.out.println(myhandler.list);
}class MyHandler extends DefaultHandler{
public ArrayList list = new ArrayList();
public void startElement(String s, String s1, String s2,
Attributes attributes) throws SAXException {
if(s.equals("a")||s1.equals("a")){
list.add(attributes.getValue("href"));
}
}
}
我刚才测试了下
这样解析有点漏洞：  要求html必须是一个规范的xml格式  基本上的网页都不符合  我想 7楼的解析可能要好些
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.tags.TableTag;
import org.htmlparser.util.NodeList;public class ExtractTable { static String html = "<html><head></head><body>"
   + "<table><tr><td>hello table</td></tr></table> <table><tr><td>hello table<a href=http://www.baidu.com>tt</a></td></tr></table>  "
   + "</body></html>";//这里是html的内容 //static String html = "http://www.baidu.com";

public static void test5(String resource) throws Exception {
  Parser myParser = new Parser(resource);
  // Parser parser = new Parser(content);
  // 设置编码
  myParser.setEncoding("utf-8");
  //String filterStr = "table";//这里析取得是标签为table的元素
String filterStr="table";
  NodeFilter filter = new TagNameFilter(filterStr);//过滤这个标签
  NodeList nodeList = myParser.extractAllNodesThatMatch(filter);//抽取所有table列表
  for (int i = 0; i < nodeList.size(); i++) {
   TableTag tabletag = (TableTag) nodeList.elementAt(i);
   System.out.println(tabletag.toHtml());//打印出来

  } } /**
  * @param args
  * @throws Exception
  */
public static void main(String[] args) throws Exception {
  // TODO Auto-generated method stub
  test5(html);//当然这里可以写成一个链接地址比如将html代替为"http://www.baidu.com"
}}
在String filterStr="table";这里，我把filterStr改为filterStr="a";为什么报错了啊？？