首先代码如下:
Spider类public class Spider implements Runnable{
HttpURLConnection huc;
InputStream is;
BufferedReader reader;
String url;
public Spider(String str){
try {
url=str;
} catch (Exception e) {
e.printStackTrace();
}
try {
huc=(HttpURLConnection)new URL(url).openConnection();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
new Thread(this).start();
} public void run() {
try {
huc.setRequestMethod("GET");
huc.setRequestProperty("user-agent","mozilla/4.0 (compatible; msie 6.0; windows 2000)");
} catch (ProtocolException e) {
e.printStackTrace();
}
try {
huc.setUseCaches(true);
huc.connect();
} catch (IOException e) {
e.printStackTrace();
}
try {
is=huc.getInputStream();
reader=new BufferedReader(new InputStreamReader(is,huc.getContentType().equals("text-html; charset=gb2312")?"gb2312":"UTF-8"));
String str;
System.out.flush();
while((str=reader.readLine())!=null){
System.out.println(str);
System.out.flush();
}
} catch (IOException e) {
e.printStackTrace();
}finally{
try {
reader.close();
is.close();
huc.disconnect();
} catch (IOException e) {
e.printStackTrace();
}
}
return;
}
}
结果在输出的结果与右键点浏览器查看源码的结果比较发现:
输出的结果如下:
<html>
<title>XXXXXXXXXXXXX</title>
</head>
<body > <head>原本应该在第二行的<head>跑到了后面,发生了串行的现象,不只为何,而且串行的位置也不固定,也就是说每次运行得到的结果都不一致,求高人解答
Spider类public class Spider implements Runnable{
HttpURLConnection huc;
InputStream is;
BufferedReader reader;
String url;
public Spider(String str){
try {
url=str;
} catch (Exception e) {
e.printStackTrace();
}
try {
huc=(HttpURLConnection)new URL(url).openConnection();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
new Thread(this).start();
} public void run() {
try {
huc.setRequestMethod("GET");
huc.setRequestProperty("user-agent","mozilla/4.0 (compatible; msie 6.0; windows 2000)");
} catch (ProtocolException e) {
e.printStackTrace();
}
try {
huc.setUseCaches(true);
huc.connect();
} catch (IOException e) {
e.printStackTrace();
}
try {
is=huc.getInputStream();
reader=new BufferedReader(new InputStreamReader(is,huc.getContentType().equals("text-html; charset=gb2312")?"gb2312":"UTF-8"));
String str;
System.out.flush();
while((str=reader.readLine())!=null){
System.out.println(str);
System.out.flush();
}
} catch (IOException e) {
e.printStackTrace();
}finally{
try {
reader.close();
is.close();
huc.disconnect();
} catch (IOException e) {
e.printStackTrace();
}
}
return;
}
}
结果在输出的结果与右键点浏览器查看源码的结果比较发现:
输出的结果如下:
<html>
<title>XXXXXXXXXXXXX</title>
</head>
<body > <head>原本应该在第二行的<head>跑到了后面,发生了串行的现象,不只为何,而且串行的位置也不固定,也就是说每次运行得到的结果都不一致,求高人解答
public class Spider implements Runnable{
HttpURLConnection huc;
InputStream is;
BufferedReader reader;
String url;
public Spider(String str){
url=str;
huc=(HttpURLConnection)new URL(url).openConnection();
new Thread(this).start();
} public void run() {
huc.setRequestMethod("GET");
huc.setRequestProperty("user-agent","mozilla/4.0 (compatible; msie 6.0; windows 2000)"); huc.setUseCaches(true);
huc.connect();
is=huc.getInputStream();
reader=new BufferedReader(new InputStreamReader(is,huc.getContentType().equals("text-html; charset=gb2312")?"gb2312":"UTF-8"));
String str;
System.out.flush();
while((str=reader.readLine())!=null){
System.out.println(str);
System.out.flush();
}
reader.close();
is.close();
huc.disconnect();
return;
}
}
右键查看源码的话每次都一样
其他行也串,不多;也不一定是<head>,感觉像随机的,也许是我没找到规律,串的位置也像是随机的
把这个去掉,
页面是解析出来的,你这样读不对,按字节读,
int read = 0;
while ((read = is.read(buffer)) != -1) {
.............................
}
is.close();
is = null;
StringBuffer temp = new StringBuffer();
try {
System.out.println(leibie);
System.out.println(num);
String url = "http://www.yb983.com/jiaojing/ser.php";
HttpURLConnection uc = (HttpURLConnection)new URL(url).
openConnection();
uc.setConnectTimeout(10000);
uc.setDoOutput(true);
uc.setRequestMethod("GET");
uc.setUseCaches(false);
DataOutputStream out = new DataOutputStream(uc.getOutputStream()); // 要传的参数
String s = URLEncoder.encode("ra", "GB2312") + "=" +
URLEncoder.encode(leibie, "GB2312");
s += "&" + URLEncoder.encode("keyword", "GB2312") + "=" +
URLEncoder.encode(num, "GB2312");
// DataOutputStream.writeBytes将字符串中的16位的unicode字符以8位的字符形式写道流里面
out.writeBytes(s);
out.flush();
out.close();
InputStream in = new BufferedInputStream(uc.getInputStream());
Reader rd = new InputStreamReader(in, "Gb2312");
int c = 0;
while ((c = rd.read()) != -1) {
temp.append((char) c);
}
System.out.println(temp.toString());
in.close(); } catch (Exception e) {
e.printStackTrace();
}
return temp.toString();
}public static void main(String[] a){
test.cc("1","吉H");
}
复制粘贴 可以运行看下控制台输出的效果 把URL换成你要抓取的网页的地址 传入对应的参数 可以用POST或GET方法。不知道你要的是这个东西不
while((str=reader.readLine())!=null){
System.out.println(str);
System.out.flush();
}
(str=reader.readLine())!=null
读取网页数据时,页面上有时会有很大的空白,但不是空,不知道这里会不会有影响
java.io.IOException: Stream closed
at java.io.BufferedReader.ensureOpen(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at Spider.run(Spider.java:44)
at java.lang.Thread.run(Unknown Source)
发现我之前用的代码也和你一样的,能不能把地址发一下给我,我用我的试一下:
/**
* 处理页面,得到页面的源码
* @param tempurl
* @return - 页面内容
*/
public static String getHtml(String tempurl, String code) { try {
URL url = new URL(tempurl);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.connect();
InputStream is = conn.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is, code));
String line = "";
StringBuffer resultBuffer = new StringBuffer();
while ((line = br.readLine()) != null) {
resultBuffer.append(line);
}
br.close();
is.close();
conn.disconnect();
return resultBuffer.toString();
} catch (Exception e) {}
return null;
}
万分感谢!
Perfect!搞定了,没有串行了!
可是为嘛,我之前的代码似乎也差不多啊?为什么会串行呢?
这个我试了下,结果多出来了600多行重复的,不知为何
anyway,谢谢啦~