代码点、代码单元、辅助字符问题

'java字符串是有char序列组成。'问：这个命题是否成立？
常用的Unicode字符使用一个代码单元就可以表示，而辅助字符需要一对代码单元表示。
代码单元是什么？代码点是什么？
public static void main(String[] args) {
String str = "中文aA";
for(int i=0;;){
if(i<str.length()){
int cp = str.codePointAt(i);
System.out.println((char)cp +"--"+ cp);
if(Character.isSupplementaryCodePoint(cp)){
i += 2;
}else{
i += 1;
}
}else{
break;
}

}

}
运行结果：
中--20013
文--25991
a--97
A--6541在什么情况下表示A？

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

41在什么情况下表示A？0x41->65->A
不是任何情况下 0x41 都能代表 A，要看字符集和前一个码是什么，有可能 0x41 只是某个双码或三码字符的后半部分。“java字符串是由char序列组成”这个我认为基本正确。
代码点，代码单元首先都是代码（code），所以就不会是字符code point又称code position，顾名思义，是指组成代码空间的数字值，比如ASCII就由 128 code points组成（0 to 7F）
code unit：特定的比特序列具体：
http://en.wikipedia.org/wiki/Code_point
UNICODE共定义了0x10FFF也就是1,114,111个代码点，如果每个代码点与一个字符定义的话，最多可以包容一百多万个字符。但是目前只用了一部分，也就是说有一部分位置（代码点）被占了。
你看到的
中--20013
文--25991
a--97
A--65数字就是表示的这些被占的位置（编号）
代码单元是跟具体编码方式相关的概念，比如UTF-32，就是用固定的32位(3字节)作为一个代码单元，而UTF-8则是用1个字节～6个字节作为一个代码单元，因为不定长，所以必须有些约定，如3字节的时候，一定是1110 XXXX  10XX XXXX  10XX XXXX ，实际可用的位数为4+6+6=16位，相当于2个字节。汉字在UTF-8里面是3字节，比如“中”，代码位是20013，那么16进制是0x4E2D，二进制是100111000101101
你把这16位对应插入前面的X上，就是“中”的UTF-8码了。
谢谢指点，不过：
比如UTF-32，就是用固定的32位(3字节)作为一个代码单元。这句是不是有点问题？
最后，再次谢谢指点。
对code unit是对的，不过我原来给出的那个定义不完整，没说到code unit的比特序列如何划分
代码点的话未必是16进制，它只是可以用16进制来表示。
Code Point: Any value in the Unicode codespace; that is, the range of integers from 0x0 to 0x10FFFF. unicode codespace中的一个值。Code Unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D28a in Section 3.9, Unicode Encoding Forms.) 一个可以为所表示的编码文本起处理和交换作用的最小比特组合。http://blogs.msdn.com/b/michkap/archive/2005/08/12/451043.aspxWith US-ASCII, code unit is 7 bits.
With UTF-8, code unit is 8 bits.
With EBCDIC, code unit is 8 bits.
With UTF-16, code unit is 16 bits.
With UTF-32, code unit is 32 bits.
With UTF-8, code unit is 8 bits.
拿U8来举例，code unit是8bit，也就是一个字节，但是中文的‘中’2进制表示是：100111000101101(15位，前面补0，就是16位，两个字节)，难道是说这‘中’字是由两个code unit组成的？是这个意思吗？
应该是这个意思，因为是minimal bit combination嘛，UTF-8最小表示单位是1字节
像US-ASCII, code unit is 7 bits. 是7位也是对的
给你写了一个例子，能清楚地理解字符集和编码：
[Java code]
package com.catmiw.csdn;import java.io.UnsupportedEncodingException;public class CharacterSetTest {
public static void main(String[] args) throws UnsupportedEncodingException {
String str="中";
int codepoint=str.codePointAt(0);

System.out.println("'"+str+"'的UNICODE编号(编码点,Unicode code point)="+codepoint+"[0x"+Integer.toHexString(codepoint)+"]");
System.out.println();

byte[] bytes=str.getBytes("utf32");
System.out.println("'中'的UTF-32编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("utf16");
System.out.println("'中'的UTF-16编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("UTF-16BE");
System.out.println("'中'的UTF-16BE编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("UTF-16LE");
System.out.println("'中'的UTF-16LE编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("utf8");
System.out.println("'中'的UTF-8编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
byte b=bytes[0];
System.out.println("3字节UTF8的第1个字节:"+toBinaryString(b)+"[1110xxxx]");
b=bytes[1];
System.out.println("3字节UTF8的第2个字节:"+toBinaryString(b)+"[10xxxxxx]");
b=bytes[2];
System.out.println("3字节UTF8的第3个字节:"+toBinaryString(b)+"[10xxxxxx]");
bytes=str.getBytes("GBK");
System.out.println("'中'的gbk编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("GB2312");
System.out.println("'中'的GB2312编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("US-ASCII");
System.out.println("'中'的US-ASCII编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("ISO-8859-1");
System.out.println("'中'的ISO-8859-1编码单元长度为："+bytes.length+",内容为："+toHexString(bytes)+",可以看出被丢了一半");
System.out.println();

str="A";
bytes=str.getBytes("utf32");
System.out.println("'A'的UTF-32编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("utf16");
System.out.println("'A'的UTF-16编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("UTF-16BE");
System.out.println("'A'的UTF-16BE编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("UTF-16LE");
System.out.println("'A'的UTF-16LE编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("utf8");
System.out.println("'A'的UTF-8编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("GBK");
System.out.println("'A'的gbk编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));
bytes=str.getBytes("GB2312");
System.out.println("'A'的GB2312编码单元长度为："+bytes.length+",内容为："+toHexString(bytes));

System.out.println();

byte[] bytes2={0x4E,0x2D};
String str2=new String(bytes2,"UTF-16");
System.out.println("从0x4E2D 按UTF-16 转出来的："+str2); byte[] bytes4={(byte)0xfe,(byte)0xff,0x4E,0x2D};
String str4=new String(bytes4,"UTF-16");
System.out.println("从0xFEFF4E2D 按UTF-16 转出来的："+str4); byte[] bytes3={(byte) 0xE4,(byte) 0xB8,(byte) 0xAD};
String str3=new String(bytes3,"UTF8");
System.out.println("从0xE4B8AD 按UTF-8 转出来的："+str3); } public static String toBinaryString(byte b){
StringBuilder sb=new StringBuilder("");
String temp=Integer.toBinaryString(b&0xff);
sb.append("00000000".substring(temp.length())).append(temp);
return sb.toString();
}

public static String toHexString(byte[] bytes){
StringBuilder sb=new StringBuilder("0x");
for (int i=0;i<bytes.length;i++){
String temp=Integer.toHexString(bytes[i]&0xff);
sb.append((temp.length()==1)?"0"+temp:temp);
}
return sb.toString();
}
}[/Java code]
纠错
1.“就是用固定的32位(3字节)作为一个代码单元”下面已提出
2.“UTF-8则是用1个字节～6个字节作为一个代码单元“
UTF-8依据代码点数值可以包含1-4个代码单元每个代码单元8位
UTF-32每个字符不管代码点值为何只包含一个代码单元每个代码单元32位
以此类推
以下为测试例子public class Test
{
    public static void main(String[] args)
    {
        String s = "hello\uD840\uDC00";
        int b = s.length();
        int n = s.codePointCount(0, s.length());
        char f = s.charAt(4);
        char l = s.charAt(6);
        int index=s.offsetByCodePoints(0,4);
        int cp=s.codePointAt(index);
        System.out.println(cp);
        System.out.println(f);
        System.out.println(l);
        System.out.println(b);
        System.out.println(n);
    }
}可参考java核心技术卷1 3.35