目前较常用的UTF-8编码检测算法如下:
    public static boolean isUTF8(byte[] data) {
        int count_good_utf = 0;
        int count_bad_utf = 0;
        byte current_byte = 0x00;
        byte previous_byte = 0x00;
        for (int i = 1; i < data.length; i++) {
            current_byte = data[i];
            previous_byte = data[i - 1];
            if ((current_byte & 0xC0) == 0x80) {10xxxxxx
                if ((previous_byte & 0xC0) == 0xC0) {11xxxxxx
                    count_good_utf++;
                } else if ((previous_byte & 0x80) == 0x00) {0xxxxxxx
                    count_bad_utf++;
                }
            } else if ((previous_byte & 0xC0) == 0xC0) {11xxxxxx
                count_bad_utf++;
            }
        }
        //    System.out.println(count_good_utf);
        //    System.out.println(count_bad_utf);
        if (count_good_utf > count_bad_utf) {
            return true;
        } else {
            return false;
        }
    }简单分析一下该算法,大致就是
11xxxxxx 10xxxxxx good++;
0xxxxxxx 10xxxxxx bad++;
11xxxxxx 0xxxxxxx bad++;
11xxxxxx 11xxxxxx bad++;对于此算法有些地方不太明白
1)如果出现了 count_bad_utf的情况,为什么不直接返回false?
2)为什么只判断一个字节的前两位?
2)既然这个算法得到的结果并不准确,为什么还要用它?有没有更好的算法?