读取MARC文件的问题！高人指点！

最近做个程序，需要从MARC文件中读出图书信息的数据，是ISO文件，格式如下：00906nam 2200265 450 001001700000010005600017010002800073016002100101100004100122101001300163102001500176105001800191106000600209200006700215210004000282215002500322312002100347320002100368330013400389510002500523606002400548690001400572701002800586801002600614CAL 010090028423 a978-7-5062-9043-2dCNY16.80 (含光盘)z7-5062-9043-X a978-7-88765-122-8b光盘 aCN-M46-07-0131-0 a20080307d2007 em y0chiy50 ea0 achiaeng aCNb610000 ay a 000yy ar1 a奥运英语大家说Aao yun ying yu da jia shuoi提高篇f主编李林波 a西安c世界图书出版西安公司d2007.09 a284页d21cme光盘1片 a英文题名取自封面 a有书目 (第284页) a本书中英文对照, 共分为6个单元, 每个单元都分为文章和对话两部分, 相关的重难点词汇及释义可帮助读者快速、有效地掌握文章和对话的内容。1 aOlympic Englishzeng0 a英语Aying yux口语 aH319.9v4 0a李林波Ali lin bo4主编 0aCNbRENTIANc20080307
没什么头绪，哪位做过类似的程序，指导一下，谢谢。

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

00906nam  2200265  450 001001700000010005600017010002800073016002100101100004100122101001300163102001500176105001800191106000600209200006700215210004000282215002500322312002100347320002100368330013400389510002500523606002400548690001400572701002800586801002600614CAL 010090028423  a978-7-5062-9043-2dCNY16.80 (含光盘)z7-5062-9043-X  a978-7-88765-122-8b光盘  aCN-M46-07-0131-0  a20080307d2007    em y0chiy50      ea0 achiaeng  aCNb610000  ay  a  000yy  ar1 a奥运英语大家说Aao yun ying yu da jia shuoi提高篇f主编李林波  a西安c世界图书出版西安公司d2007.09  a284页d21cme光盘1片  a英文题名取自封面  a有书目 (第284页)  a本书中英文对照, 共分为6个单元, 每个单元都分为文章和对话两部分, 相关的重难点词汇及释义可帮助读者快速、有效地掌握文章和对话的内容。1 aOlympic Englishzeng0 a英语Aying yux口语  aH319.9v4 0a李林波Ali lin bo4主编 0aCNbRENTIANc20080307
文件分解方法（按GBK/GB2312编码）
头标区：0~23，去中0~4是这条数据的总长度，按字节计算（本例为00906），12~16为数据基地址，即数据起始位置（本例为00265）。
目次区：24~00264（n=数据基地址-1），表示这条MARC中有多少"字段"，结束位置（00264）有一个'\x1E'，其长度=数据基地址-24
      目次区每12字符为一段，每段按3、4、5再分段，第一段为字段名，第二段为字段长度，第三段为字段在MARC中的起始位置（以数据基地址为0）
数据区：n~00904
记录结束符：00905，字符'\x1D'字段：
    在每个字段中，又有'\1F'分割的"子字段"，并在字段开头有两个字符，称为“指示符”。
    '\x1F'一般读做"Dollar"，在可读数据中一般用'$'或'@'表示。'$'后面有一个字母或数字，这是子字段名。
    每个字段的结尾处有一个字段结束符（'\x1E'）。
    需要注意，字段名是"00x"的那些是没有子字段和指示符的，整个字段就是一个内容（称为控制字段）。
你的例子分解开就是这个样子:
00906nam  2200265  450
001 0017 00000  CAL 010090028423
010 0056 00017   $a978-7-5062-9043-2$dCNY16.80 (含光盘)$z7-5062-9043-X
010 0028 00073    $a978-7-88765-122-8$b光盘
016 0021 00101   $aCN-M46-07-0131-0
100 0041 00122   $a20080307d2007    em y0chiy50      ea0
101 0013 00163  $achi$aeng
102 0015 00176    $aCN$b610000
105 0018 00191   $ay  a  000yy
106 0006 00209   $ar
200 0067 00215  1 $a奥运英语大家说Aao yun ying yu da jia shuoi提高篇f主编李林波
210 0040 00282   $a西安$c世界图书出版西安公司$d2007.09
215 0025 00322   $a284页$d21cm$e光盘1片
312 0021 00347   $a英文题名取自封面
320 0021 00368    $a有书目 (第284页)
330 0134 00389  $a本书中英文对照, 共分为6个单元, 每个单元都分为文章和对话两部分, 相关的重难点词汇及释义可帮助读者快速、有效地掌握文章和对话的内容。
510 0025 00523  1 $aOlympic English$zeng
606 0024 00548  0 $a英语$Aying yu$x口语
690 0014 00572   $aH319.9$v4
701 0028 00586  0$a李林波$Ali lin bo$4主编
801 0026 00614  0$aCN$bRENTIAN$c20080307常用字段：
　　题名：200$a
　　作者：200$f
　　出版社：210$c
　　ISBN：010$a
00906nam  2200265  450
001 0017 00000  CAL 010090028423
010 0056 00017   $a978-7-5062-9043-2$dCNY16.80 (含光盘)$z7-5062-9043-X
010 0028 00073    $a978-7-88765-122-8$b光盘
016 0021 00101   $aCN-M46-07-0131-0
100 0041 00122   $a20080307d2007    em y0chiy50      ea0
101 0013 00163  $achi$aeng
102 0015 00176    $aCN$b610000
105 0018 00191   $ay  a  000yy
106 0006 00209   $ar
200 0067 00215  1 $a奥运英语大家说Aao yun ying yu da jia shuoi提高篇f主编李林波
210 0040 00282   $a西安$c世界图书出版西安公司$d2007.09
215 0025 00322   $a284页$d21cm$e光盘1片
312 0021 00347   $a英文题名取自封面
320 0021 00368    $a有书目 (第284页)
330 0134 00389  $a本书中英文对照, 共分为6个单元, 每个单元都分为文章和对话两部分, 相关的重难点词汇及释义可帮助读者快速、有效地掌握文章和对话的内容。
510 0025 00523  1 $aOlympic English$zeng
606 0024 00548  0 $a英语$Aying yu$x口语
690 0014 00572   $aH319.9$v4
701 0028 00586  0$a李林波$Ali lin bo$4主编
801 0026 00614  0$aCN$bRENTIAN$c20080307看放到代码里是不是能整齐些
因为每个字段都有一个字段结束符，所以，你目视看到的实际内容总是比目次区所指的长度少一个，比如
001 0017 00000  CAL 010090028423
字符串"CAL 010090028423"的长度是16个，而目次区指示001字段的长度是0017，起始位置为00000
下面是数据区，根据目录区来从数据区取数据，如，“001，0017，00000”，001相当于是目录名字，0017是数据长度，00000是起始地址
但是为什么取出来的数据有点乱？问题...CAL 010090028423  a978-7-5062-9043-2dCNY16.80 (含光盘)z7-5062-9043-X  a978-7-88765-122-8b光盘  aCN-M46-07-0131-0  a20080307d2007    em y0chiy50      ea0 achiaeng  aCNb610000  ay   a   000yy  ar1 a奥运英语大家说Aao yun ying yu da jia shuoi提高篇f主编李林波  a西安c世界图书出版西安公司d2007.09  a284页d21cme光盘1片  a英文题名取自封面  a有书目 (第284页)  a本书中英文对照, 共分为6个单元, 每个单元都分为文章和对话两部分, 相关的重难点词汇及释义可帮助读者快速、有效地掌握文章和对话的内容。1 aOlympic Englishzeng0 a英语Aying yux口语  aH319.9v4 0a李林波Ali lin bo4主编 0aCNbRENTIANc20080307
00906nam  2200265  450
001 0017 00000  CAL 010090028423
010 0056 00017   $a978-7-5062-9043-2$dCNY16.80 (含光盘)$z7-5062-9043-X
010 0028 00073    $a978-7-88765-122-8$b光盘
016 0021 00101   $aCN-M46-07-0131-0
100 0041 00122   $a20080307d2007    em y0chiy50      ea0
101 0013 00163  $achi$aeng
102 0015 00176    $aCN$b610000
105 0018 00191   $ay  a  000yy
106 0006 00209   $ar
200 0067 00215  1 $a奥运英语大家说Aao yun ying yu da jia shuoi提高篇f主编李林波
210 0040 00282   $a西安$c世界图书出版西安公司$d2007.09
215 0025 00322   $a284页$d21cm$e光盘1片
312 0021 00347   $a英文题名取自封面
320 0021 00368    $a有书目 (第284页)
330 0134 00389  $a本书中英文对照, 共分为6个单元, 每个单元都分为文章和对话两部分, 相关的重难点词汇及释义可帮助读者快速、有效地掌握文章和对话的内容。
510 0025 00523  1 $aOlympic English$zeng
606 0024 00548  0 $a英语$Aying yu$x口语
690 0014 00572   $aH319.9$v4
701 0028 00586  0$a李林波$Ali lin bo$4主编
801 0026 00614  0$aCN$bRENTIAN$c20080307
读取MARC记录的时候，应当以字节方式读取。
前面我说了，以GBK/GB2312编码分解，即你可以把读出来的字节转换为字符串，但是.NET中的字符串是Unicode编码的，汉字的长度是1，而GBK编码的长度是2，因此会导致目次区与实际数据不符。
另外一方面，绝大多数的MARC数据文件中，每条记录后面会有一个回车换行符号，需要把它们去掉。所以，读取一个整个的MARC文件的方法是：
以流的方式打开文件
读取前面5个字节，并转换字符串，然后变为int变量，如marcLen
继续读取marcLen-5个字节
将数据基地址部分取出，转换为int变量，如dataStart
将目次区（不叫目录区）分解到一个n*3的数组中，如aEntrys[n,3]
从dataStart位置开始，将数据部分单独取出
遍历aEntrys[]，依次取出各个字段内容，转变为字符串，放到另一个数组中aFields[n,2]，其中[n,1]放字段名，[n,2]放字段数据。
MARC分解的主要工作完成，然后根据需要，提取需要的具体信息（参见前面的帖子）
->准备读取下一条数据
读取数据，如果是回车、换行符号，则忽略，直到遇到连续5个数字（因为MARC的前5个字节一定是是数字）
回到第一步，读取下一条MARC。此外，有一个问题，就是并非所有人提供的MARC都是符合标准的，有的软件尽管声称支持MARC，但实际上它输出的MARC数据是不合格的，常见的是目次区错误，总长度值错误，将记录结束符记入最后一个字段内等等
此时要根据具体情况进行分析，加入容错部分，比如总长度错误可以通过多读取一部分数据来解决（一般就是差1个字节），目次区错误可以通过字段结束符判定每个字段的位置等。这是一个基本的MARC数据读取程序，还有很多可优化的地方
下面是JAVA的实现
* Linpz 读取马克文件数据的程序
*/
import java.io.*;class ReadMarc {
public static void main(String args[]) throws Exception {
String file = "data.ISO";
FileInputStream fin = new FileInputStream(file);
//数组bytes用于存放读取的所有字节
int fileSize = fin.available();
byte bytes[] = new byte[fileSize];
fin.read(bytes);
//读取此条数据的总长度
byte marcB[] = new byte[5];
for(int i = 0; i < 5; i++) {
marcB[i]= bytes[i];
}
String marcS = new String(marcB);
int marcLen = Integer.parseInt(marcS);
//System.out.println(marcLen);
//读取数据基地址
byte marcB2[] = new byte[5];
for(int i = 0; i < 5; i++) {
marcB2[i] = bytes[i+12];
}
String marcS2 = new String(marcB2);
int dataStart = Integer.parseInt(marcS2);
//System.out.println(dataStart);
//读取次目录区数据
int cmLength = dataStart-24-1;
byte marcB3[] = new byte[cmLength];
for(int i = 0; i < cmLength; i++) {
marcB3[i] = bytes[i+24];
}
//读取记录控制信息
String marcS3 = new String(marcB3);
int n = cmLength/12;
String controls[] = new String[n];
for(int i = 0; i < n; i++) {
controls[i] = marcS3.substring(i*12, (i+1)*12);
//System.out.println(controls[i]);
}
//读取数据区信息
int dataLength = marcLen - dataStart -1;
byte data[] = new byte[dataLength];
for(int i = 0; i < dataLength; i++) {
data[i] = bytes[i+dataStart];
}
String OKData[][] = new String[n][2];
for(int i = 0; i < n; i++) {
OKData[i][0]=controls[i].substring(0, 3);
int length = Integer.parseInt(controls[i].substring(3,7));
int start = Integer.parseInt(controls[i].substring(7));
byte temp[] = new byte[length];
for(int j = start; j < length; j++) {
temp[j] = data[j];
}
OKData[i][1]=new String(temp);
System.out.println(OKData[i][0]+"  "+OKData[i][1]);
}

//System.out.println(new String(data));
fin.close();
}
}
很遗憾，这样的结果导致数据区的数据有部分无法显示，如果String sData = new String(data);
OKData[i][1] = sData.substring(start,start+length);这样的话由于sData.length() != data.length;会出错，请高手帮忙解决！
package bin;为了发现问题，我把程序做如下修改：
/**
* Linpz 读取马克文件数据的程序
*/
import java.io.*;class ReadMarc {
public static void main(String args[]) throws Exception {
String file = "data.ISO";
FileInputStream fin = new FileInputStream(file);
//数组bytes用于存放读取的所有字节
int fileSize = fin.available();
byte bytes[] = new byte[fileSize];
fin.read(bytes);
//读取此条数据的总长度
byte marcB[] = new byte[5];
for(int i = 0; i < 5; i++) {
marcB[i]= bytes[i];
}
String marcS = new String(marcB);
int marcLen = Integer.parseInt(marcS);
//System.out.println(marcLen);
//读取数据基地址
byte marcB2[] = new byte[5];
for(int i = 0; i < 5; i++) {
marcB2[i] = bytes[i+12];
}
String marcS2 = new String(marcB2);
int dataStart = Integer.parseInt(marcS2);
//System.out.println(dataStart);
//读取次目录区数据
int cmLength = dataStart-24-1;
byte marcB3[] = new byte[cmLength];
for(int i = 0; i < cmLength; i++) {
marcB3[i] = bytes[i+24];
}
//读取记录控制信息
String marcS3 = new String(marcB3);
int n = cmLength/12;
String controls[] = new String[n];
for(int i = 0; i < n; i++) {
controls[i] = marcS3.substring(i*12, (i+1)*12);
//System.out.println(controls[i]);
}
//读取数据区信息
int dataLength = marcLen - dataStart -1;
byte data[] = new byte[dataLength];
for(int i = 0; i < dataLength; i++) {
data[i] = bytes[i+dataStart];
}
//System.out.println(dataLength);
String OKData[][] = new String[n][2];
for(int i = 0; i < n; i++) {
OKData[i][0]=controls[i].substring(0, 3);
int length = Integer.parseInt(controls[i].substring(3,7));
int start = Integer.parseInt(controls[i].substring(7));
byte temp[] = new byte[length];
for(int j = start; j < length; j++) {
temp[j] = data[j];
}
//OKData[i][1]=new String(temp,"utf-8");
System.out.print(OKData[i][0]+"  ");
for(int t=0; t<length; t++) {
System.out.print(temp[t]);
}
System.out.println();
}

//System.out.println(new String(data));
fin.close();
}
}运行结果：
802
001  48495048485548484848484930
005  000000000000050484852
010  0000000000000000000000000000
100  00000000000000000000000000000000000000000
101  00000000
102  000000000000000
105  000000000000000000
106  000000
200  0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
210  00000000000000000000000000000000000000000
215  00000000000000000000
330  000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
606  00000000000000000000000
690  00000000000
701  000000000000000000000000000000000
702  000000000000000000000000000000
801  0000000000000000000000
905  000000000000000000000000000000000000000000000
906  0000000000000000000000000000000000000000000
999  000000000000000000
把上面的程序稍微修改一下：
测试数据为：
00976nam0 2200301   450 001001100000005001700011010002800028100004100056101000800097102001500105105001800120106000600138200004200144210002700186215002000213330013300233510002900366606001700395606002400412690001300436701003100449801002200480801002400502856005700526905003500583992001500618996004100633010200002920050613171212.0  a7-03-010025-5dCNY75.00  a20030121d2003    em y0chiy0110    ea0 achi  aCNb110000  ay   z   001yy  ar1 a统计手册9tong ji shou cef茆诗松主编  a北京c科学出版社d2003  a31,1091页d21cm  a本书共18章，包括了各领域使用的共同的统计方法，对社会经济统计、生物统计、可靠性统计等领域内的特殊方法还专门列出一章予以特别介绍。1 aStatistics Handbookzeng0 a统计学j手册0 2CT3S075223a统计学  aC8-62v4 0a茆诗松9mao shi song4主编 0aCNbNLCc20031111 2aCNbSWUFEc20060908  ubook://192.168.30.128/02/diskmaf/maf47/14/!00001.pdg  a251430b1153451-6dC8-62e4434  a02-600-029  a成都世云书店b02-600-029n科技新书目程序如下：package com.swufe.module;import java.io.FileInputStream;/**
* 西南财经大学网络教育学院
* @author xianhai
*
*/
class ReadMarc { public static void main(String args[]) throws Exception {
String file = "data.ISO";
FileInputStream fin = new FileInputStream(file);
// 数组bytes用于存放读取的所有字节
int fileSize = fin.available();
byte bytes[] = new byte[fileSize];
fin.read(bytes);
// 读取此条数据的总长度
byte marcB[] = new byte[5];
for (int i = 0; i < 5; i++) {
marcB[i] = bytes[i];
}
String marcS = new String(marcB);
int marcLen = Integer.parseInt(marcS);
// System.out.println(marcLen);
// 读取数据基地址
byte marcB2[] = new byte[5];
for (int i = 0; i < 5; i++) {
marcB2[i] = bytes[i + 12];
}
String marcS2 = new String(marcB2);
int dataStart = Integer.parseInt(marcS2);
// System.out.println(dataStart);
// 读取次目录区数据
int cmLength = dataStart - 24 - 1;
byte marcB3[] = new byte[cmLength];
for (int i = 0; i < cmLength; i++) {
marcB3[i] = bytes[i + 24];
}
// 读取记录控制信息
String marcS3 = new String(marcB3);
int n = cmLength / 12;
String controls[] = new String[n];
for (int i = 0; i < n; i++) {
controls[i] = marcS3.substring(i * 12, (i + 1) * 12);
// System.out.println(controls[i]);
}
// 读取数据区信息
int dataLength = marcLen - dataStart - 1;
byte data[] = new byte[dataLength];
for (int i = 0; i < dataLength; i++) {
data[i] = bytes[i + dataStart];
}
// System.out.println(dataLength);
String OKData[][] = new String[n][2];
for (int i = 0; i < n; i++) {
OKData[i][0] = controls[i].substring(0, 3);
int length = Integer.parseInt(controls[i].substring(3, 7));
int start = Integer.parseInt(controls[i].substring(7));
byte temp[] = new byte[length];
for (int j = start; j < start + length; j++) {
temp[j - start] = data[j];
}
OKData[i][1] = new String(temp);
System.out.print(OKData[i][0] + "  ");
System.out.print(controls[i].substring(7) + " " + controls[i].substring(3, 7) + " ");
System.out.print(OKData[i][1] + "  ");
System.out.println();
}
System.out.println(new String(data));
fin.close();
}
}
运行结果如下：001  00000 0011 0102000029
005  00011 0017 20050613171212.0
010  00028 0028   a7-03-010025-5dCNY75.00
100  00056 0041   a20030121d2003    em y0chiy0110    ea
101  00097 0008 0 achi
102  00105 0015   aCNb110000
105  00120 0018   ay   z   001yy
106  00138 0006   ar
200  00144 0042 1 a统计手册9tong ji shou cef茆诗松主编
210  00186 0027   a北京c科学出版社d2003
215  00213 0020   a31,1091页d21cm
330  00233 0133   a本书共18章，包括了各领域使用的共同的统计方法，对社会经济统计、生物统计、可靠性统计等领域内的特殊方法还专门列出一章予以特别介绍。
510  00366 0029 1 aStatistics Handbookzeng
606  00395 0017 0 a统计学j手册
606  00412 0024 0 2CT3S075223a统计学
690  00436 0013   aC8-62v4
701  00449 0031  0a茆诗松9mao shi song4主编
801  00480 0022  0aCNbNLCc20031111
801  00502 0024  2aCNbSWUFEc20060908
856  00526 0057   ubook://192.168.30.128/02/diskmaf/maf47/14/!00001.pdg
905  00583 0035   a251430b1153451-6dC8-62e4434
992  00618 0015   a02-600-029
996  00633 0041   a成都世云书店b02-600-029n科技新书目
010200002920050613171212.0  a7-03-010025-5dCNY75.00  a20030121d2003    em y0chiy0110    ea0 achi  aCNb110000  ay   z   001yy  ar1 a统计手册9tong ji shou cef茆诗松主编  a北京c科学出版社d2003  a31,1091页d21cm  a本书共18章，包括了各领域使用的共同的统计方法，对社会经济统计、生物统计、可靠性统计等领域内的特殊方法还专门列出一章予以特别介绍。1 aStatistics Handbookzeng0 a统计学j手册0 2CT3S075223a统计学  aC8-62v4 0a茆诗松9mao shi song4主编 0aCNbNLCc20031111 2aCNbSWUFEc20060908  ubook://192.168.30.128/02/diskmaf/maf47/14/!00001.pdg  a251430b1153451-6dC8-62e4434  a02-600-029  a成都世云书店b02-600-029n科技新书目