目前较常用的UTF-8编码检测算法如下:
public static boolean isUTF8(byte[] data) {
int count_good_utf = 0;
int count_bad_utf = 0;
byte current_byte = 0x00;
byte previous_byte = 0x00;
for (int i = 1; i < data.length; i++) {
current_byte = data[i];
previous_byte = data[i - 1];
if ((current_byte & 0xC0) == 0x80) {10xxxxxx
if ((previous_byte & 0xC0) == 0xC0) {11xxxxxx
count_good_utf++;
} else if ((previous_byte & 0x80) == 0x00) {0xxxxxxx
count_bad_utf++;
}
} else if ((previous_byte & 0xC0) == 0xC0) {11xxxxxx
count_bad_utf++;
}
}
// System.out.println(count_good_utf);
// System.out.println(count_bad_utf);
if (count_good_utf > count_bad_utf) {
return true;
} else {
return false;
}
}简单分析一下该算法,大致就是
11xxxxxx 10xxxxxx good++;
0xxxxxxx 10xxxxxx bad++;
11xxxxxx 0xxxxxxx bad++;
11xxxxxx 11xxxxxx bad++;对于此算法有些地方不太明白
1)如果出现了 count_bad_utf的情况,为什么不直接返回false?
2)为什么只判断一个字节的前两位?
2)既然这个算法得到的结果并不准确,为什么还要用它?有没有更好的算法?
public static boolean isUTF8(byte[] data) {
int count_good_utf = 0;
int count_bad_utf = 0;
byte current_byte = 0x00;
byte previous_byte = 0x00;
for (int i = 1; i < data.length; i++) {
current_byte = data[i];
previous_byte = data[i - 1];
if ((current_byte & 0xC0) == 0x80) {10xxxxxx
if ((previous_byte & 0xC0) == 0xC0) {11xxxxxx
count_good_utf++;
} else if ((previous_byte & 0x80) == 0x00) {0xxxxxxx
count_bad_utf++;
}
} else if ((previous_byte & 0xC0) == 0xC0) {11xxxxxx
count_bad_utf++;
}
}
// System.out.println(count_good_utf);
// System.out.println(count_bad_utf);
if (count_good_utf > count_bad_utf) {
return true;
} else {
return false;
}
}简单分析一下该算法,大致就是
11xxxxxx 10xxxxxx good++;
0xxxxxxx 10xxxxxx bad++;
11xxxxxx 0xxxxxxx bad++;
11xxxxxx 11xxxxxx bad++;对于此算法有些地方不太明白
1)如果出现了 count_bad_utf的情况,为什么不直接返回false?
2)为什么只判断一个字节的前两位?
2)既然这个算法得到的结果并不准确,为什么还要用它?有没有更好的算法?
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx 请高手详细解释一下我上面提出的几个问题