http://www.wu-ftpd.org/rfc/rfc2640.html以上是 UTF-8 规范, 看 B.1 Valid UTF-8 check 部分. 演示了如何判断合法 UTF-8 字符串:) The following routine checks if a byte sequence is valid UTF-8. This is done by checking for the proper tagging of the first and following bytes to make sure they conform to the UTF-8 format. It then checks to assure that the data part of the UTF-8 sequence conforms to the proper range allowed by the encoding. Note: This routine will not detect characters that have not been assigned and therefore do not exist.int utf8_valid(const unsigned char *buf, unsigned int len) { const unsigned char *endbuf = buf + len; unsigned char byte2mask=0x00, c; int trailing = 0; // trailing (continuation) bytes to follow while (buf != endbuf) { c = *buf++; if (trailing) if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format? {if (byte2mask) // Need to check 2nd byte for proper range? if (c&byte2mask) // Are appropriate bits set? byte2mask=0x00; else return 0; trailing--; } else return 0; else if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8 else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8 if (c&0x1E) // Is UTF-8 byte in // proper range? trailing =1; else return 0; else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8 {if (!(c&0x0F)) // Is UTF-8 byte in // proper range? byte2mask=0x20; // If not set mask // to check next byte trailing = 2;} else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8 {if (!(c&0x07)) // Is UTF-8 byte in // proper range? byte2mask=0x30; // If not set mask // to check next byte trailing = 3;} else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8 {if (!(c&0x03)) // Is UTF-8 byte in // proper range? byte2mask=0x38; // If not set mask // to check next byte trailing = 4;} else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8 {if (!(c&0x01)) // Is UTF-8 byte in // proper range? byte2mask=0x3C; // If not set mask // to check next byte trailing = 5;} else return 0; } return trailing == 0; }
听你说像是 UTF8, 这样的话你就自己写一个截取函数, 对字符串的字节数组进行迭代, 判断当前字节的 ASCII 值, 根据上下文应该可以判断该直接是独立的还是某字符的部分.
我对 UTF8 的规则没详细看, 记得是 1/2/3 位长度的字符每个字节的值是有固定规定的.
is done by checking for the proper tagging of the first and following
bytes to make sure they conform to the UTF-8 format. It then checks
to assure that the data part of the UTF-8 sequence conforms to the
proper range allowed by the encoding. Note: This routine will not
detect characters that have not been assigned and therefore do not
exist.int utf8_valid(const unsigned char *buf, unsigned int len)
{
const unsigned char *endbuf = buf + len;
unsigned char byte2mask=0x00, c;
int trailing = 0; // trailing (continuation) bytes to follow while (buf != endbuf)
{
c = *buf++;
if (trailing)
if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format?
{if (byte2mask) // Need to check 2nd byte for proper range?
if (c&byte2mask) // Are appropriate bits set?
byte2mask=0x00;
else
return 0;
trailing--; }
else
return 0;
else
if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8
else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8
if (c&0x1E) // Is UTF-8 byte in
// proper range?
trailing =1;
else
return 0;
else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8
{if (!(c&0x0F)) // Is UTF-8 byte in
// proper range?
byte2mask=0x20; // If not set mask
// to check next byte
trailing = 2;}
else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8
{if (!(c&0x07)) // Is UTF-8 byte in
// proper range? byte2mask=0x30; // If not set mask
// to check next byte
trailing = 3;}
else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8
{if (!(c&0x03)) // Is UTF-8 byte in
// proper range?
byte2mask=0x38; // If not set mask
// to check next byte
trailing = 4;}
else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8
{if (!(c&0x01)) // Is UTF-8 byte in
// proper range?
byte2mask=0x3C; // If not set mask
// to check next byte
trailing = 5;}
else return 0;
}
return trailing == 0;
}
private String subAttrbuteValue(String attrValue) {
String attrStr = attrValue;
int maxLenth = 4000;
if(attrStr != null && attrStr.length() > 0){
byte[] attrBytes = attrValue.getBytes();
if(attrBytes.length > maxLenth){ byte[] subAttrBytes = new byte[maxLenth];
for(int i = 0; i < maxLenth; i++){
subAttrBytes[i] = attrBytes[i];
}
String subStr=new String(subAttrBytes);
int subStrLen = subStr.length();
if(attrStr.substring(0, subStrLen).getBytes().length > maxLenth){
attrStr = attrValue.substring(0, subStrLen -1);
}else{
attrStr = attrValue.substring(0, subStrLen);
}
}
}
return attrStr;
}