Unicode is a relatively inefficient encoding when most of your text consists of ASCII characters. Every character requires the same number of bytes—two—even though some characters are used much more frequently than others. A more efficient encoding would use fewer bits for the more common characters. This is what UTF-8 does. In UTF-8 the ASCII alphabet is encoded using a single byte, just as in ASCII. The next 1,919 characters are encoded in two bytes. The remaining Unicode characters are encoded in three bytes. However, since these three-byte characters are relatively uncommon,[1] especially in English text, the savings achieved by encoding ASCII in a single byte more than makes up for it. Java's .class files use UTF-8 internally to store string literals. Data input streams and data output streams also read and write strings in UTF-8. However, this is all hidden from direct view of the programmer, unless perhaps you're trying to write a Java compiler or parse output of a data stream without using the DataInputStream class.
Java's .class files use UTF-8 internally to store string literals. Data input streams and data output streams also read and write strings in UTF-8
这是为了照顾英文等.一则,JAVA是他们创立的,首先要考虑到自己再考虑他人,再则,大多网络文献是用英文写的,如此可以节约空间.
但对中日韩等则是增加了空间浪费,因为CJK中的字符如存UTF-8其空间平均是UTF-16的1.5倍.
可以调用java.nio.charset.defaultCharset();来查看缺省使用的编码方式,如果想得到其他方式的编码,可以调用String类的byte[] getBytes(String charsetName);
characters. Every character requires the same number of bytes—two—even though some
characters are used much more frequently than others. A more efficient encoding would use
fewer bits for the more common characters. This is what UTF-8 does.
In UTF-8 the ASCII alphabet is encoded using a single byte, just as in ASCII. The next 1,919
characters are encoded in two bytes. The remaining Unicode characters are encoded in three
bytes. However, since these three-byte characters are relatively uncommon,[1] especially in
English text, the savings achieved by encoding ASCII in a single byte more than makes up for
it.
Java's .class files use UTF-8 internally to store string literals. Data input streams and data
output streams also read and write strings in UTF-8. However, this is all hidden from direct
view of the programmer, unless perhaps you're trying to write a Java compiler or parse output
of a data stream without using the DataInputStream class.
output streams also read and write strings in UTF-8