今天在求证一个问题时得出下面的一些问题,不是很明白 String test = "a";
byte[] tests = test.getBytes();
for (int i = 0; i < tests.length; i++) {
System.out.println(tests[i]);
}getBytes() 是采用的是系统默认字符集GBK,输出的是97
而大家看看下面的代码: String test = "a";
byte[] tests = test.getBytes("unicode");
for (int i = 0; i < tests.length; i++) {
System.out.println(tests[i]);
}采用UNICODE时,输出的却是-1 -2 97 0 String test = "ab";
byte[] tests = test.getBytes("unicode");
for (int i = 0; i < tests.length; i++) {
System.out.println(tests[i]);
}输出的是-1 -2 97 0 98 0
按说对一个字符的字符串的unicode码应该只有两个字节啊!
前面的-1 -2 是什么呢??
byte[] tests = test.getBytes();
for (int i = 0; i < tests.length; i++) {
System.out.println(tests[i]);
}getBytes() 是采用的是系统默认字符集GBK,输出的是97
而大家看看下面的代码: String test = "a";
byte[] tests = test.getBytes("unicode");
for (int i = 0; i < tests.length; i++) {
System.out.println(tests[i]);
}采用UNICODE时,输出的却是-1 -2 97 0 String test = "ab";
byte[] tests = test.getBytes("unicode");
for (int i = 0; i < tests.length; i++) {
System.out.println(tests[i]);
}输出的是-1 -2 97 0 98 0
按说对一个字符的字符串的unicode码应该只有两个字节啊!
前面的-1 -2 是什么呢??
而 FFFE 是不存在的字符,不应该被编码。因为在传输时有顺序问题,因此需要给定顺序,当接收到 FEFF 时 表示 Big-Endian,
接收到 FFFE 时表示 Little-Endian所谓的 Big-Endian 是指,把 UTF-16 的高字节位放在前面,低字节位放在后面
而 Little-Endian 与 Big-Endian 相反,把低字节位放在前面,高字节位放在后面。接收到 FFFE 时,是高字节位与低字节位相反的 FEFF 标志位。根据网上资料称 PC 上一般采用 Little-Endian 方式进行传输,而 Mac 机上采用
Big-Endian 方式。
也就是说在Mac机上的话 可能显示就是-1 -2 0 97 0 98 对吧?
在第 18 页上有说明。http://www.unicode.org/notes/tn23/Muller-Slides+Narr.pdfCharacter encoding schemes* mapping of code units to bytes
* UTF-8: obvious
* UTF-16LE
little endian
initial FF FE (if present) is a character
* UTF-16BE
big endian
initial FE FF (if present) is a character
* UTF-16
either endianness
may have a BOM: FF FE or FE FF, not part of text
if no BOM, then must be BE
* UTF-32: similarly, UTF-32LE, UTF-32BE and UTF-32The final layer of the character model deals with the serialization in bytes of
the code units.For UTF-8, where the code units are already bytes, this step is trivial, and
there is a single encoding scheme.For UTF-16, there are three encoding schemes:in UTF-16LE, the least significant byte of each code unit comes first. If the
string starts with the bytes FF FE, those two bytes should be interpreted as
the FEFF code unit, i.e. as the character U+FEFF ZERO WIDTH NO-BREAK SPACE.in UTF-16BE, the most significant byte of each code unit comes first. If the
string starts with the bytes FE FF, those two bytes should be interpreted as
the FEFF code unit, i.e. as the character U+FEFF ZERO WIDTH NO-BREAK SPACE.in UTF-16, either endianness is possible. The endianness may be indicated by
starting the byte stream with FF FE (little endian) or FE FF (big endian), and
those bytes are not part of the string. If no endianness is specified, then the
byte order must be big endian.UTF-32 also has three encoding schemes, defined in a similar way.