http://www.worldmetals.com.cn/search/metsearch.jsp?search=(铁矿石)%20and%20docchannel=(36)该地址在IE中能得到正常的结果,但无论是用java.net来抓取,还是用Socket来抓取,都只能抓取出结果集为0的页面来。分析头文件为:
HTTP/1.1 200 OK
Date: Thu, 24 May 2007 03:42:59 GMT
Server: IBM_HTTP_SERVER/1.3.19.3 Apache/1.3.20 (Win32)
Set-Cookie: JSESSIONID=0000OQ1AS0ENLEHAPX2IXG4VCQY:vdebn6i3;Path=/
Cache-Control: no-cache="set-cookie,set-cookie2"
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Transfer-Encoding: chunked
Content-Type: text/html;charset=gb2312
Content-Language: zh开始怀疑是chunked的问题,但发现http://www.worldmetals.com.cn/search/metsearch.jsp?search=(china)%20and%20docchannel=(36)能正常抓取所以怀疑是中文传递的问题,对里面的url进行多次转码,可还是抓不到想要的结果集
HTTP/1.1 200 OK
Date: Thu, 24 May 2007 03:42:59 GMT
Server: IBM_HTTP_SERVER/1.3.19.3 Apache/1.3.20 (Win32)
Set-Cookie: JSESSIONID=0000OQ1AS0ENLEHAPX2IXG4VCQY:vdebn6i3;Path=/
Cache-Control: no-cache="set-cookie,set-cookie2"
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Transfer-Encoding: chunked
Content-Type: text/html;charset=gb2312
Content-Language: zh开始怀疑是chunked的问题,但发现http://www.worldmetals.com.cn/search/metsearch.jsp?search=(china)%20and%20docchannel=(36)能正常抓取所以怀疑是中文传递的问题,对里面的url进行多次转码,可还是抓不到想要的结果集
public void getHtml(String url)
{
try
{
String sCurrentLine; String sTotalString; sCurrentLine=""; sTotalString=""; java.io.InputStream l_urlStream; java.net.URL l_url = new java.net.URL(url); java.net.HttpURLConnection l_connection = (java.net.HttpURLConnection) l_url.openConnection(); l_connection.connect(); l_urlStream = l_connection.getInputStream(); java.io.BufferedReader l_reader = new java.io.BufferedReader(new java.io.InputStreamReader(l_urlStream)); while ((sCurrentLine = l_reader.readLine()) != null) { sTotalString+=sCurrentLine+"\n"; }
System.out.println(sTotalString);
}
catch(Exception ex)
{
System.out.println(ex.toString());
}
}
public static void main(String args[])
{
try
{
System.out.println("begin");
String strServer= "www.worldmetals.com.cn";
String strPage = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=(china)%20and%20docchannel=(36)";
String keyword="铁矿石";
strPage = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=("+keyword+")%20and%20docchannel=(36)";
try
{
String hostname = strServer;
int port = 80;
InetAddress addr = InetAddress.getByName(hostname);
Socket socket = new Socket(addr, port); //建立一个Socket
//发送命令
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream(), "UTF8"));
wr.write("GET " + strPage + " HTTP/1.1\r\n");
wr.write("HOST:" + strServer + "\r\n");
wr.write("\r\n");
wr.flush();
//接收返回的结果
BufferedReader rd = new BufferedReader(new InputStreamReader(socket.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
System.out.println(line);
}
wr.close();
rd.close();
}
catch (Exception e)
{
System.out.println(e.toString());
}
System.out.println("end");
}
catch(Exception ex)
{
System.out.println(ex.toString());
}
}
String url = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=(" + condition + ")%20and%20docchannel=(36)";
getHtml(url);暂时试了这个好像可以
http://www.worldmetals.com.cn/search/metsearch.jsp?search=(%CC%FA%BF%F3%CA%AF)%20and%20docchannel=(36)GET /search/metsearch.jsp?search=(%CC%FA%BF%F3%CA%AF)%20and%20docchannel=(36) HTTP/1.1
Host: www.worldmetals.com.cn
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: zh-cn,zh;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: gb2312,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-aliveHTTP/1.x 200 OK
Date: Wed, 30 Apr 2008 02:09:20 GMT
Server: IBM_HTTP_SERVER/1.3.19.3 Apache/1.3.20 (Win32)
Cache-Control: no-cache="set-cookie,set-cookie2"
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Transfer-Encoding: chunked
Content-Type: text/html;charset=gb2312
Content-Language: zh
Set-Cookie: JSESSIONID=00002G2YLZSV4J4U4V4YV3LWEHA:vdebn6i3;Path=/
Connection: Keep-Alive