网页地址为 http://sc.hiapk.com/apps_0_1_1
但是,下载后看到的中文都是乱码,查阅了以前相关问题的帖子,确认
1,不是字符集的问题: InputStreamReader(input,"utf-8")
2,网页不是压缩的
3, IE可以正常打开该网页请大家帮忙看看这究竟是为什么?以下是代码:
import java.net.*;
import java.awt.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;public class getDataByURL{ public static void main(String[] args)
{
(new getDataByURL()).doWrite();
} private void doStore(String file_name,InputStream input)
{
try{
OutputStreamWriter ow = new OutputStreamWriter( new FileOutputStream(file_name,true));
BufferedReader io = new BufferedReader(new InputStreamReader(input,"utf-8"));
String s;
while((s = io.readLine())!=null){
ow.write(s);
}
ow.flush();
ow.close();
}
catch(IOException e){}
}
public void doWrite()
{
String url_str = "http://sc.hiapk.com/apps_0_1_1";
try{ HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet(url_str);
httpget.addHeader("Accept-Language", "en-us");
httpget.addHeader("Accept-Encoding", "gzip,deflate");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
doStore("1.html",entity.getContent());
}
catch(MalformedURLException el){System.out.println("exception");}
catch(IOException e2){System.out.println("exception");}
}
}
但是,下载后看到的中文都是乱码,查阅了以前相关问题的帖子,确认
1,不是字符集的问题: InputStreamReader(input,"utf-8")
2,网页不是压缩的
3, IE可以正常打开该网页请大家帮忙看看这究竟是为什么?以下是代码:
import java.net.*;
import java.awt.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;public class getDataByURL{ public static void main(String[] args)
{
(new getDataByURL()).doWrite();
} private void doStore(String file_name,InputStream input)
{
try{
OutputStreamWriter ow = new OutputStreamWriter( new FileOutputStream(file_name,true));
BufferedReader io = new BufferedReader(new InputStreamReader(input,"utf-8"));
String s;
while((s = io.readLine())!=null){
ow.write(s);
}
ow.flush();
ow.close();
}
catch(IOException e){}
}
public void doWrite()
{
String url_str = "http://sc.hiapk.com/apps_0_1_1";
try{ HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet(url_str);
httpget.addHeader("Accept-Language", "en-us");
httpget.addHeader("Accept-Encoding", "gzip,deflate");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
doStore("1.html",entity.getContent());
}
catch(MalformedURLException el){System.out.println("exception");}
catch(IOException e2){System.out.println("exception");}
}
}
网页上有个字“浏”
其UTF编码是 E6 B5 8F
网络数据看到也是 E6 B5 8F
但是,保存到文件中的却是 E6 B5 3F ,导致不能正常解码下一就是查查,究竟在哪一步发生了这个问题?查了一些关于字符集的问题,看来是个普遍现象;下面是关于字符集的总结,值得一看
http://tech.163.com/06/0518/09/2HD6OPIV0009159T.html
http://www.tot.name/show/3/7/20051201213200.htm
http://technic.txwm.com/webpage/v35508.html
就有些字转码不过来,结果都变成???了。居然还有常见字。我当时没深究。改C++调用Win32 API就没事。我建议你保持为UTF8,UNICODE的,GB2312等编码的文件再试一试。也许是Java的字符集支持的问题。
1,我的机器是英文系统,所以代码源文件的默认保存格式识别为Cp1252
2,当从下载网页数据,并表明其为UTF-8模式,且存到BufferedReader时,还没有问题。
3,当输出到文件时,我的系统把BufferedReader中的内容当作是Cp1252的了,当保存到文件时,就出了问题。第3步还只是猜测,准备写个例程,具体论证一下。