JAVA文章采集怎么做?

现在用下面的代码获取了一篇网文的源码内容，现想获取到里面的每个章节的标题，代码要怎样写？各位大哥大姐帮帮忙！
package cn.test;import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;public class CaiJi01 { public static void main(String[] args) {

CaiJi01 cj=new CaiJi01();
String s=cj.getOneHtml();
System.out.println(s); } public String getOneHtml() {
URL url;
String temp;
final StringBuffer sb = new StringBuffer();
try {
url = new URL("http://www.luoqiu.com/html/38/38320/index.html");
final BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
while ((temp = in.readLine()) != null) {
sb.append(temp);
}
in.close();

} catch (IOException e) {
e.printStackTrace();
}
return sb.toString();
}}

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

其实就是解析html了，找这个jsoup-1.6.1.jar，
用这个 Parser.parse(sb.toString(), "") 生成Document对象
Elements trs = Document.select("tr[class=smallText]");
能不能举个小例子？例如：Html源代码中有这些
<a href="4015239.html">第一章星空中的青铜巨棺</a>
<a href="4026955.html">第二章素问</a>
<a href="4052138.html">第三章今昔</a>我想获得所有像“第一章星空中的青铜巨棺”的标题，我要怎么写？
Document doc = Parser.parse(html, "");
Elements spans = doc.select("span");
for(int i = 0; i < spans.length; i++) {
 System.err.println(spans[i].text());
}
今天工作之余有点时间，帮你做好了：import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;/**
* @author Sugar.Tan
* 2011-08-05
*/public class ReadNetXml {
 public static void main(String[] args) throws Exception{
 List<String> lstTitle = new ArrayList<String>();
 URL url = new URL("http://www.luoqiu.com/html/38/38320/index.html");
 BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
 StringBuffer sb = new StringBuffer();
 String temp;
 while ((temp = in.readLine()) != null) {
 sb.append(temp);
 }
 while (true) {
 int start = sb.indexOf("<a");
 if (start == -1) break;//结束
 sb.delete(0, start);//把以<a开始的删掉
 start = sb.indexOf("第");
 int end = sb.indexOf("</a>");
 String title = null;
 if (start < end && start > 0) {
 title = sb.substring(start, end);
 } else {
 //有异常，“第” 怎么能在“</a>”标签的后面？把</a>及前面的都删掉
 start = sb.indexOf("</a>") + "</a>".length() - 1;
 sb.delete(0, start);
 continue;
 }
 lstTitle.add(title);
 start = sb.indexOf("</a>") + "</a>".length() - 1;
 sb.delete(0, start);//把</a>及前面的都删掉
 }

 for (String str : lstTitle) {
 System.out.println(str);
 }
 }
}