网页抓取处理乱码遇到的问题

    本程序想实现的是把网页保存到本地没有乱码，并能成功的提取网页的标题和关键字。
    若把“注释一”下面的几行代码注释掉，“注释二”下的while语句内容就能正常运行，从中解析其他网页内容；若“注释一”内容不注释，“注释二”下的while语句内容不能正常运行，不能从本网页中解析url地址。public void getWebByUrl(String strUrl, String charset, String fileIndex) {
try {
// if(charset==null||"".equals(charset))charset="utf-8";
System.out.println("Getting web by url: " + strUrl);
addReport("Getting web by url: " + strUrl + "\n");
URL url = new URL(strUrl);

byte bytes[] = new byte[1024 * 1000];
int index = 0;

URLConnection conn = url.openConnection();
conn.setDoOutput(true);
InputStream is = null;
is = url.openStream();
//注释一
// int count = is.read(bytes, index, 1024 * 100);//
// while (count != -1) {//
// index += count;
// count = is.read(bytes, index, 1);
// } String filePath = fPath + "/web" + fileIndex + ".htm";
//PrintWriter pw = null;
FileOutputStream fos = new FileOutputStream(filePath);

//OutputStreamWriter writer = new OutputStreamWriter(fos);
//pw = new PrintWriter(writer);
BufferedReader bReader = new BufferedReader(new InputStreamReader(
is));
StringBuffer sb = new StringBuffer();
String rLine = null;
String tmp_rLine = null;
//注释二
while ((rLine = bReader.readLine()) != null) {
tmp_rLine = rLine;
int str_len = tmp_rLine.length();
if (str_len > 0) {
sb.append("\n" + tmp_rLine);
//pw.println(tmp_rLine);
//pw.flush();
if (deepUrls.get(strUrl) < webDepth)
getUrlByString(tmp_rLine, strUrl);
}

}

tmp_rLine = null;
fos.write(bytes, 0, index);//
is.close();
//pw.close();
fos.close();

String context = sb.toString();
String tt = getTitle(context);
String t = getKeywords(context);

System.out.println("Get web successfully! " + strUrl);

System.out.println("title:" + tt);
System.out.println("keywords:" + t);

addReport("Get web successfully! " + strUrl + "\n");
addWebSuccessed();
} catch (Exception e) {
System.out.println("Get web failed!       " + strUrl);
addReport("Get web failed!       " + strUrl + "\n");
addWebFailed();
}
}
public void getUrlByString(String inputArgs, String strUrl) {
String tmpStr = inputArgs;
String regUrl = "(?<=(href=)[\"]?[\']?)[http://][^\\s\"\'\\?]*("
+ myDomain + ")[^\\s\"\'>]*";
Pattern p = Pattern.compile(regUrl, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(tmpStr);
boolean blnp = m.find();
// int i = 0;
while (blnp == true) {
if (!allUrls.containsKey(m.group(0))) {
System.out.println("Find a new url,depth:"
+ (deepUrls.get(strUrl) + 1) + " " + m.group(0));
addReport("Find a new url,depth:" + (deepUrls.get(strUrl) + 1)
+ " " + m.group(0) + "\n");
arrUrls.add(m.group(0));
arrUrl.add(m.group(0));
allUrls.put(m.group(0), getIntWebIndex());
deepUrls.put(m.group(0), (deepUrls.get(strUrl) + 1));
}
tmpStr = tmpStr.substring(m.end(), tmpStr.length());
m = p.matcher(tmpStr);
blnp = m.find();
}
}

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货