自己写的爬虫程序,把url保存到数据库三个表中 :tocraw,保存已经搜索过但还没爬过的表,crawled 保存已经爬过的表,result 保存包含指定字符处 searchString的网页的url,程序可以运行,数据库中也可以看到结果,但运行几分钟就报错,错误如下:
java.lang.NullPointerException
at com.microsoft.jdbc.base.BaseResultSet.cancelPendingUpdates(Unknown Source)
at com.microsoft.jdbc.base.BaseResultSet.next(Unknown Source)
at crawler.SearchCrawler.crawl(SearchCrawler.java:369)
at crawler.SearchCrawler.run(SearchCrawler.java:87)369行是while (rs.next())这一行,红线已标出。实在看不出有啥毛病,望各位大哥赐教。
程序核心代码如下:public void crawl(String startUrl, String searchString, boolean limithost, boolean caseSensitive) throws SQLException {
URL verifiedUrl;
String pageContents;
ArrayList<String> links;
String sql;
String s;
String str;
ResultSet rs;
Statement st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
if (searchString.length() < 1) {
System.out.println("Missing search String");
System.exit(0);
}
// 从开始URL中移出www
startUrl = removeWwwFromUrl(startUrl);
// 把初始url保存到表 tocrawl中
sql = "insert into tocrawl(tocrawl) values('"+startUrl+"')";
st.executeUpdate(sql);
rs=st.executeQuery("select * from tocrawl");
while (rs.next()){
// Remove URL from the tocrawl table.
s=rs.getString(1);
verifiedUrl=verifyUrl(s);
rs.deleteRow();
rs.close();
st.close();
if (!isRobotAllowed(verifiedUrl)) {
continue;
} // 增加已处理的URL到crawled 表中
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql = "insert into crawled(crawled) values('"+s+"')";
st.executeUpdate(sql);
st.close();
pageContents = downloadPage(verifiedUrl); if (pageContents != null && pageContents.length() > 0) {
// ArrayList<String> links = retrieveLinks(verifiedUrl,
links = retrieveLinks(verifiedUrl,
pageContents,limitHost);
Iterator <String> iter = links.iterator(); while(iter.hasNext())
{
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
str = (String) iter.next();
sql="insert into tocrawl(tocrawl) values('"+str+"')";
st.executeUpdate(sql);
st.close();
}
if (searchStringMatches(pageContents, searchString,
caseSensitive)) {
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql = "insert into result(result) values('"+s+"')";
st.executeUpdate(sql);
st.close();
}
}
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
rs=st.executeQuery("select * from tocrawl");
}
rs.close();
st.close();
}
java.lang.NullPointerException
at com.microsoft.jdbc.base.BaseResultSet.cancelPendingUpdates(Unknown Source)
at com.microsoft.jdbc.base.BaseResultSet.next(Unknown Source)
at crawler.SearchCrawler.crawl(SearchCrawler.java:369)
at crawler.SearchCrawler.run(SearchCrawler.java:87)369行是while (rs.next())这一行,红线已标出。实在看不出有啥毛病,望各位大哥赐教。
程序核心代码如下:public void crawl(String startUrl, String searchString, boolean limithost, boolean caseSensitive) throws SQLException {
URL verifiedUrl;
String pageContents;
ArrayList<String> links;
String sql;
String s;
String str;
ResultSet rs;
Statement st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
if (searchString.length() < 1) {
System.out.println("Missing search String");
System.exit(0);
}
// 从开始URL中移出www
startUrl = removeWwwFromUrl(startUrl);
// 把初始url保存到表 tocrawl中
sql = "insert into tocrawl(tocrawl) values('"+startUrl+"')";
st.executeUpdate(sql);
rs=st.executeQuery("select * from tocrawl");
while (rs.next()){
// Remove URL from the tocrawl table.
s=rs.getString(1);
verifiedUrl=verifyUrl(s);
rs.deleteRow();
rs.close();
st.close();
if (!isRobotAllowed(verifiedUrl)) {
continue;
} // 增加已处理的URL到crawled 表中
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql = "insert into crawled(crawled) values('"+s+"')";
st.executeUpdate(sql);
st.close();
pageContents = downloadPage(verifiedUrl); if (pageContents != null && pageContents.length() > 0) {
// ArrayList<String> links = retrieveLinks(verifiedUrl,
links = retrieveLinks(verifiedUrl,
pageContents,limitHost);
Iterator <String> iter = links.iterator(); while(iter.hasNext())
{
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
str = (String) iter.next();
sql="insert into tocrawl(tocrawl) values('"+str+"')";
st.executeUpdate(sql);
st.close();
}
if (searchStringMatches(pageContents, searchString,
caseSensitive)) {
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
sql = "insert into result(result) values('"+s+"')";
st.executeUpdate(sql);
st.close();
}
}
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
rs=st.executeQuery("select * from tocrawl");
}
rs.close();
st.close();
}
顶一下.
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
rs=st.executeQuery("select * from tocrawl");
之所以关闭是因为从tocrawl表中的记录发生了变化,需要重新查询。
我是关闭了,但是又连接上了啊
st = getConnection().createStatement(ResultSet.TYPE_SCROLL_SENSITIVE,ResultSet.CONCUR_UPDATABLE);
rs=st.executeQuery("select * from tocrawl");
之所以关闭是因为从tocrawl表中的记录发生了变化,需要重新查询。-------------------------------------------------------------
频繁的建立数据库连接,说明你的业务逻辑写得已经有问题了,请重新理一下思路,之后在行动.
continue;
}