http://www.worldmetals.com.cn/search/metsearch.jsp?search=(铁矿石)%20and%20docchannel=(36)该地址在IE中能得到正常的结果,但无论是用java.net来抓取,还是用Socket来抓取,都只能抓取出结果集为0的页面来。分析头文件为:
HTTP/1.1 200 OK
Date: Thu, 24 May 2007 03:42:59 GMT
Server: IBM_HTTP_SERVER/1.3.19.3  Apache/1.3.20 (Win32)
Set-Cookie: JSESSIONID=0000OQ1AS0ENLEHAPX2IXG4VCQY:vdebn6i3;Path=/
Cache-Control: no-cache="set-cookie,set-cookie2"
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Transfer-Encoding: chunked
Content-Type: text/html;charset=gb2312
Content-Language: zh开始怀疑是chunked的问题,但发现http://www.worldmetals.com.cn/search/metsearch.jsp?search=(china)%20and%20docchannel=(36)能正常抓取所以怀疑是中文传递的问题,对里面的url进行多次转码,可还是抓不到想要的结果集

解决方案 »

  1.   

    用java.net包抓取的程序
    public void getHtml(String url)
    {
    try
    {

    String sCurrentLine;   String sTotalString;   sCurrentLine="";   sTotalString="";   java.io.InputStream l_urlStream;   java.net.URL l_url = new java.net.URL(url);   java.net.HttpURLConnection l_connection = (java.net.HttpURLConnection) l_url.openConnection();   l_connection.connect();   l_urlStream = l_connection.getInputStream();   java.io.BufferedReader l_reader = new java.io.BufferedReader(new java.io.InputStreamReader(l_urlStream));   while ((sCurrentLine = l_reader.readLine()) != null)   {   sTotalString+=sCurrentLine+"\n";   }  
    System.out.println(sTotalString);
    }
    catch(Exception ex)
    {
    System.out.println(ex.toString());
    }
    }
      

  2.   

    Socket抓取程序
    public static void main(String args[])
    {
    try
    {

    System.out.println("begin");
                              String strServer= "www.worldmetals.com.cn";
     String strPage = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=(china)%20and%20docchannel=(36)";
     String keyword="铁矿石";
     strPage = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=("+keyword+")%20and%20docchannel=(36)";

      try 
      {
       String hostname = strServer;
       int port = 80;
       InetAddress addr = InetAddress.getByName(hostname);
       Socket socket = new Socket(addr, port); //建立一个Socket
        
       //发送命令
       BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream(), "UTF8"));
       wr.write("GET " + strPage + " HTTP/1.1\r\n");
       wr.write("HOST:" + strServer + "\r\n");
       wr.write("\r\n");
       wr.flush();

       
       //接收返回的结果
       BufferedReader rd = new BufferedReader(new InputStreamReader(socket.getInputStream()));
       String line;
       while ((line = rd.readLine()) != null) {
        System.out.println(line);
       }
       
       wr.close();
       rd.close();
      } 
      catch (Exception e) 
      {
       System.out.println(e.toString());
       }
    System.out.println("end");
    }
    catch(Exception ex)
    {
    System.out.println(ex.toString());
    }
    }
      

  3.   

    抓个全部是英文字符没有转译的url试试,看看结果先
      

  4.   

            String condition = java.net.URLEncoder.encode("铁矿石", "UTF-8"); 
            String url = "http://www.worldmetals.com.cn/search/metsearch.jsp?search=(" + condition + ")%20and%20docchannel=(36)";
            getHtml(url);暂时试了这个好像可以
      

  5.   

    这个是我抓的包, 你用socket发GET头估计是要转一下编码了,最好也模拟一下写一下其他浏览器通常支持的headers吧
    http://www.worldmetals.com.cn/search/metsearch.jsp?search=(%CC%FA%BF%F3%CA%AF)%20and%20docchannel=(36)GET /search/metsearch.jsp?search=(%CC%FA%BF%F3%CA%AF)%20and%20docchannel=(36) HTTP/1.1
    Host: www.worldmetals.com.cn
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
    Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    Accept-Language: zh-cn,zh;q=0.5
    Accept-Encoding: gzip,deflate
    Accept-Charset: gb2312,utf-8;q=0.7,*;q=0.7
    Keep-Alive: 300
    Connection: keep-aliveHTTP/1.x 200 OK
    Date: Wed, 30 Apr 2008 02:09:20 GMT
    Server: IBM_HTTP_SERVER/1.3.19.3  Apache/1.3.20 (Win32)
    Cache-Control: no-cache="set-cookie,set-cookie2"
    Expires: Thu, 01 Dec 1994 16:00:00 GMT
    Transfer-Encoding: chunked
    Content-Type: text/html;charset=gb2312
    Content-Language: zh
    Set-Cookie: JSESSIONID=00002G2YLZSV4J4U4V4YV3LWEHA:vdebn6i3;Path=/
    Connection: Keep-Alive