截取byte数组中的数据

如题：现有一个byte数组，里面存有数据，现在想截取两个字符串之间的数据。
例如有一段xml格式的报文，是byte数组格式存放的，我想截取两个起始标签之间的数据，返回类型也是一个byte格式的。
现在有两种方法：1.先将byte数组转换为String，然后截取字符串在转换为byte类型。2。用两个for循环遍历截取后，copy到一个新的byte数组。
第一种方法，感觉是效率高，性能低，第二种随着报文的增大循环次数太多。不知道如何取舍，请高手指教。或者有更好的方法最好

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

第一种把，用正则表达式得到两个标签中的内容，然后再getBytes
第一种方式：
public static byte[] interceptByte(byte[] data, String startLabel,
            String endLabel)
    {
        String str = null;
        byte[] resultData = null;
        try
        {
            str = new String(data, "utf-8");
            int startIndex = str.indexOf(startLabel);
            int endIndex = str.indexOf(endLabel);
            if (startIndex != -1 && endIndex != -1 && endIndex > startIndex)
            {
                resultData = str.substring(startIndex,
                        endIndex + endLabel.length()).getBytes("utf-8");
            }
        }
        catch (UnsupportedEncodingException e)
        {
            e.printStackTrace();
        }
        finally
        {
            str = null;
        }
        return resultData;
    }第二种：
public static byte[] interceptByte(byte[] data, String startLabel,
            String endLabel)
    {
        // 要截取的data的其实位置
        int start = -1;
        int end = -1;
        // 存放截取的数据的容器
        byte[] newData = null;
        if (data != null && data.length > 0 && startLabel != null)
        {
            // 将标签转化成char数组
            char[] charsLabel = startLabel.toCharArray();
            char[] chareLabel = endLabel.toCharArray();
            int labelLen = charsLabel.length;
            // byte数组长度
            int dataLen = data.length;
            // 循环标识
            int i = 0;
            System.out.println("-----"+labelLen);
            System.out.println("*****"+dataLen);
            while (i < dataLen)
            {
                System.out.println("^^^^^"+i);
                // 判断是否是起始标签
                if (isLabel(data, i, charsLabel))
                {
                    start = i;
                    i += labelLen;
                    break;
                }
                else
                {
                    i++;
                }
            }
            while ((i + labelLen) < dataLen)
            {
                // 获取结束标签位置
                if (isLabel(data, i, chareLabel))
                {
                    end = i;
                    break;
                }
                else
                {
                    i++;
                }
            }
            // 如果起始位置和结束位置都存在
            if (start != -1 && end != -1 && end > start)
            {
                // 大小为end-start+结束标签的长度
                int shouldLen = end - start + labelLen + 1;
                newData = new byte[shouldLen];
                // 截取存放
                System.arraycopy(data, start, newData, 0, shouldLen);
            }
        }        return newData;
    }
private static boolean isLabel(byte[] data, int start, char[] charLabel)
    {
        boolean islabel = false;
        for (int i = 0; i < charLabel.length; ++i)
        {

            if (data[start + i] == charLabel[i])
            {
                islabel = true;
            }
            else
            {
                islabel = false;
                break;
            }
        }
        return islabel;
    }
不光因为其简洁，而且我觉得String类相关的算法应该（仅仅是推测）被优化得很好了。
LZ拿byte和char比较是否相等不合适吧，中文会出问题的。
这个没事，因为只是比的标签内容不会出现中文。第一种有个trycatch和转来转去，我怕会比循环耗用更多的内存，第二种方法有的报文很长的话，会循环很多次，有时会超过1W次
byte[] data=  "aabbccddeeffgghhiijj".getBytes()  ;

byte[] target =null;

byte[] startLabel = "bbc".getBytes();
byte[] endLabel = "gghh".getBytes();
if (data != null && data.length > 0 && startLabel != null){
int index = 0;
int match = 0;
int start = 0;
int end = 0;
boolean matchingStart = true;
for(;index<data.length;index++){
if(matchingStart){
if(data[index]==startLabel[match]){
match++;
}else{
match = 0;
}
}else{
if(data[index]==endLabel[match]){
match++;
}else{
match = 0;
}
}
System.out.println(match);
if(matchingStart&&match==startLabel.length){
start = index+(startLabel.length-1);
System.out.println("start:"+start);
match = 0;
matchingStart = false;
}else if(!matchingStart&&match==endLabel.length){
end = index-(endLabel.length-1);
System.out.println("end:"+end);
break;
}
}
int length = end -start;
target = new byte[length];
System.arraycopy(data, start, target, 0, length);
System.out.println(new String(target));
一个for循环就可以了吧？
byte[] data=  "aabbccddeeffgghhiijj".getBytes()  ;

byte[] target =null;

byte[] startLabel = "bc".getBytes();
byte[] endLabel = "gghh".getBytes();
if (data != null && data.length > 0 && startLabel != null){
int index = 0;
int match = 0;
int start = 0;
int end = 0;
boolean matchingStart = true;
for(;index<data.length;index++){
if(matchingStart){
if(data[index]==startLabel[match]){
match++;
}else{
if(match!=0){
index-=match;
}
match = 0;
}
}else{
if(data[index]==endLabel[match]){
match++;
}else{
if(match!=0){
index-=match;
}
match = 0;
}
}
System.out.println(match);
if(matchingStart&&match==startLabel.length){
start = index+(startLabel.length-1);
System.out.println("start:"+start);
match = 0;
matchingStart = false;
}else if(!matchingStart&&match==endLabel.length){
end = index-(endLabel.length-1);
System.out.println("end:"+end);
break;
}
}
int length = end -start;
target = new byte[length];
System.arraycopy(data, start, target, 0, length);
System.out.println(new String(target));

}
改了一下。这样OK了吧？
我通常都是转成string再操作~
习惯了都~
使用byte比较好..万一有中文这个可以进行处理
楼主为什么要先把byte数组转成String，匹配定位，截取，之后在转成byte数组呢？
我们都知道，楼主在处理字符。
但是，我们也应该知道，字符就是被编码以后的二进制数，即byte数组。
什么意思？
就是说，楼主可以直接点，
将你要截取的两个标签，先转换成byte数组，
然后到先前那个byte数组中进行模式匹配，并截取两个标签对应byte数组之间的二进制数据，
截取到的，就是楼主所要的数据。总体来讲，楼主要完成的事情，其瓶颈就在byte数组和String的转换上面，
那么，完全可以绕过转换，进行问题处理。
当然，有的模式匹配的方法可以很高效率的定位数据，就看楼主能不能实现了。
如果不会产生嵌套的话，也甭转 String, char 了，用 KMP 算法直接使用 byte 进行查找。
给楼主推荐KMP算法，程序如下： /**
* 在字节数组bytes中，从offset位置开始匹配sub字节序列，返回第一次匹配成功时sub的位置。失配返回-1。
*/
public static int index(byte[] bytes, byte[] sub, int offset) {
int i = offset ,j = 0;
int next[]=getNext(sub);
while (i < bytes.length - 1 && j < sub.length - 1) {
if (j == 0 || bytes[i] == sub[j]) {
i++; j++;
} else
j = next[j] - 1;
}
if (j > sub.length - 2) {
return (i - sub.length + 1);
} else{
return -1;
}
}

private static int[] getNext(byte [] sub) {
int i = 0, j = -1;
int[] next = new int[sub.length];
while (i < next.length - 1) {
if (j == -1 || sub[i] == sub[j]) {
i++; j++;
next[i]= sub[i]==sub[j]?next[j]:j+1;
} else {
j = next[j]-1;
}
}
return next;
} /** 测试一下程序结果  */
public static void main(String[] args) {
String xmlStr = "<root><books><book name=\"n1\"/><book name=\"n2\"/></books></root>";
String startStr ="<books>";
String endStr = "</books>";
//生成参数
byte [] xml = xmlStr.getBytes();
byte [] start = startStr.getBytes();
byte [] end = endStr.getBytes();
//开始截取
int startPos = index(xml, start, 0);//从0位置开始匹配start，返回start的位置。
int endPos = index(xml,end,startPos+start.length);//从start的后一个位置开始，匹配end，返回end的位置。
byte [] sub = new byte[endPos-startPos-start.length];
System.arraycopy(xml, startPos+start.length, sub, 0, sub.length);
//看看截取的内容
System.out.println(new String(sub));
}
运行结果<book name="n1"/><book name="n2"/>
呵呵，楼上已经给出了用 byte 的 KMP 查找。最后组合的话，用这个构造会比较好一些：String(byte[] bytes, int offset, int length) 呵呵