Microsoft's HTML Help (.chm) format Preface This is documentation on the .chm format used by Microsoft HTML Help. This format has been reverse engineered in the past, but as far as I know this is the first freely available documentation on it. One Usenet message indicates that these .chm files are actually IStorage files documented in the Microsoft Platform SDK. However, I have not been able to locate such documentation. Note The word "section" is badly overloaded in this document. Sorry about that.All numbers are in hexadecimal unless otherwise indicated in the text. Except in tabular listings, this will be indicated by $ or 0x as appropriate. All values within the file are Intel byte order (little endian) unless indicated otherwise. The overall format of a .chm file The .chm file begins with a short ($38 byte) initial header. This is followed by the header section table, the offset to the content, and a number of bytes of information of unknown use. Collectively, this is the "header". The header is followed by the header sections. There are two header sections. One header section is the file directory, the other contains the file length and some unknown data. Immediately following the header sections is the content. The Header The header starts with the initial header, which has the following format 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD Total header length, including header section table and following data. 000C: DWORD 1 (unknown) 0010: DWORD a timestamp. 0014: DWORD Windows Language ID. The two I've seen $0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US $0407 = LANG_GERMAN/SUBLANG_GERMAN 0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC} 0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC}Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.It is followed by the header section table, which is 2 entries, where each entry is $10 bytes long and has this format: 0000: QWORD Offset of section from beginning of file 0008: QWORD Length of sectionFollowing the header section table is 8 bytes of additional header data. In Version 2 files, this data is not there and the content section starts immediately after the directory. 0000: QWORD Offset within file of content section 0The Header Sections Header Section 0 This section contains the total size of the file, and not much else 0000: DWORD $01FE (unknown) 0004: DWORD 0 (unknown) 0008: QWORD File Size 0010: DWORD 0 (unknown) 0014: DWORD 0 (unknown)Header Section 1: The Directory Listing The central part of the .chm file: A directory of the files and information it contains. Directory header The directory starts with a header; its format is as follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory chunk size 0014: DWORD "Density" of quickref section, usually 2. 0018: DWORD Depth of the index tree 1 there is no index, 2 if there is one level of PMGI chunks. 001C: DWORD Chunk number of root index chunk, -1 if there is none (though at least one file has 0 despite there being no index chunk, probably a bug.) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C: DWORD Number of directory chunks (total) 0030: DWORD Windows language ID 0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050: DWORD -1 (unknown)The Listing Chunks The header is directly followed by the directory chunks. There are two types of directory chunks -- index chunks, and listing chunks. The index chunk will be omitted if there is only one listing chunk. A listing chunk has the following format:
0000: char[4] 'PMGL' 0004: DWORD Length of free space and/or quickref area at end of directory chunk 0008: DWORD Always 0. 000C: DWORD Chunk number of previous listing chunk when reading directory in sequence (-1 if this is the first listing chunk) 0010: DWORD Chunk number of next listing chunk when reading directory in sequence (-1 if this is the last listing chunk) 0014: Directory listing entries (to quickref area) Sorted by filename; the sort is case-insensitive.The quickref area is written backwards from the end of the chunk. One quickref entry exists for every n entries in the file, where n is calculated as 1 + (1 << quickref density). So for density = 2, n = 5. Chunklen-0002: WORD Number of entries in the chunk Chunklen-0004: WORD Offset of entry n from entry 0 Chunklen-0008: WORD Offset of entry 2n from entry 0 Chunklen-000C: WORD Offset of entry 3n from entry 0 ...The format of a directory listing entry is as follows BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT: lengthThe offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate). The length also refers to length of the file in the section after decompression. There are two kinds of file represented in the directory: user data and format related files. The files which are format-related have names which begin with '::', the user data files have names which begin with "/". The Index Chunk An index chunk has the following format 0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of directory chunk 0008: Directory index entries (to quickref/free area)The quickref area in an PMGI is the same as in an PMGL The format of a directory index entry is as follows BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: directory listing chunk which starts with nameWhen higher-level indexes exist (when the depth of the index tree is 3 or higher), presumably the upper-level indexes will contain the numbers of lower-level index chunks rather than listing chunks Encoded Integers An ENCINT is a variable-length integer. The high bit of each byte indicates "continued to the next byte". Bytes are stored most significant to least significant. So, for example, $EA $15 is (((0xEA&0x7F)<<7)|0x15) = 0x3515. The Content The content typically immediately follows the header sections, and is at the location indicated by the DWORD following the header section table. All content section 0 locations in the directory are relative to that point. The other content sections are stored WITHIN content section 0. The Namelist file There exists in content section 0 and in the directory a file called "::DataSpace/NameList". This file contains the names of all the content sections. The format is as follows: 0000: WORD Length of file, in words 0002: WORD Number of entries in fileEach entry: 0000: WORD Length of name in words, excluding terminating null 0002: WORD Double-byte characters xxxx: WORD 0Yes, the names have a length word AND are null terminated; sort of a belt-and-suspenders approach. The coding system is likely UTF-16 (little endian). The section names seen so far are Uncompressed MSCompressed "Uncompressed" is self-explanatory. The section "MSCompressed" is compressed with Microsoft's LZX algorithm. The Section Data For each section other than 0, there exists a file called '::DataSpace/Storage/<Section Name>/Content'. This file contains the compressed data for the section. So, conceptually, getting a file from a nonzero section is a multi-step process. First you must get the content file from section 0. Then you decompress (if appropriate) the section. Then you get the desired file from your decompressed section.
http://www.speakeasy.org/~russotto/chm
DocWizard [email protected]欢迎下载,提出改进建议 http://www.csdn.net/cnshare/soft/16/16229.shtm您在写文档时是否被数量众多的类成员函数所烦恼?是否为了美观的 版面、格式而困扰?是否为了做成CHM而不得不进行大量的手工操作? 本软件可以帮助您 ● 生成C++的类的HTML格式的文档,条分缕析,版面美观而朴实。 并且可以集成到 HTML Workshop 中,和其他类一起形成CHM。 ● 而且可以生成 content of topic (.hhc) 文件以及生成 index (.hhk) 文件。避免您手工编辑生成文件的麻烦。要知 道在 HTML Workshop 中编辑生成这两种文件是非常麻烦的。 ● 在马上就要推出的新版本中,将提供二次开发接口,您将可以 操纵文档生成过程,形成更符合您要求的文档。当然源代码的 解析本软件已经帮您完成了,您所要做的仅仅是生成您所关心 的“部分”文档。 欢迎下载,提出改进建议 http://www.csdn.net/cnshare/soft/16/16229.shtm
Preface
This is documentation on the .chm format used by Microsoft HTML Help. This format has been reverse engineered in the past, but as far as I know this is the first freely available documentation on it. One Usenet message indicates that these .chm files are actually IStorage files documented in the Microsoft Platform SDK. However, I have not been able to locate such documentation. Note
The word "section" is badly overloaded in this document. Sorry about that.All numbers are in hexadecimal unless otherwise indicated in the text. Except in tabular listings, this will be indicated by $ or 0x as appropriate. All values within the file are Intel byte order (little endian) unless indicated otherwise. The overall format of a .chm file
The .chm file begins with a short ($38 byte) initial header. This is followed by the header section table, the offset to the content, and a number of bytes of information of unknown use. Collectively, this is the "header". The header is followed by the header sections. There are two header sections. One header section is the file directory, the other contains the file length and some unknown data. Immediately following the header sections is the content. The Header
The header starts with the initial header, which has the following format 0000: char[4] 'ITSF'
0004: DWORD 3 (Version number)
0008: DWORD Total header length, including header section table and
following data.
000C: DWORD 1 (unknown)
0010: DWORD a timestamp.
0014: DWORD Windows Language ID. The two I've seen
$0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US
$0407 = LANG_GERMAN/SUBLANG_GERMAN
0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC}
0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC}Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.It is followed by the header section table, which is 2 entries, where each entry is $10 bytes long and has this format: 0000: QWORD Offset of section from beginning of file
0008: QWORD Length of sectionFollowing the header section table is 8 bytes of additional header data. In Version 2 files, this data is not there and the content section starts immediately after the directory. 0000: QWORD Offset within file of content section 0The Header Sections
Header Section 0
This section contains the total size of the file, and not much else 0000: DWORD $01FE (unknown)
0004: DWORD 0 (unknown)
0008: QWORD File Size
0010: DWORD 0 (unknown)
0014: DWORD 0 (unknown)Header Section 1: The Directory Listing
The central part of the .chm file: A directory of the files and information it contains. Directory header
The directory starts with a header; its format is as follows: 0000: char[4] 'ITSP'
0004: DWORD Version number 1
0008: DWORD Length of the directory header
000C: DWORD $0a (unknown)
0010: DWORD $1000 Directory chunk size
0014: DWORD "Density" of quickref section, usually 2.
0018: DWORD Depth of the index tree
1 there is no index, 2 if there is one level of PMGI
chunks.
001C: DWORD Chunk number of root index chunk, -1 if there is none
(though at least one file has 0 despite there being no
index chunk, probably a bug.)
0020: DWORD Chunk number of first PMGL (listing) chunk
0024: DWORD Chunk number of last PMGL (listing) chunk
0028: DWORD -1 (unknown)
002C: DWORD Number of directory chunks (total)
0030: DWORD Windows language ID
0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC}
0044: DWORD $54 (This is the length again)
0048: DWORD -1 (unknown)
004C: DWORD -1 (unknown)
0050: DWORD -1 (unknown)The Listing Chunks
The header is directly followed by the directory chunks. There are two types of directory chunks -- index chunks, and listing chunks. The index chunk will be omitted if there is only one listing chunk. A listing chunk has the following format:
0004: DWORD Length of free space and/or quickref area at end of
directory chunk
0008: DWORD Always 0.
000C: DWORD Chunk number of previous listing chunk when reading
directory in sequence (-1 if this is the first listing chunk)
0010: DWORD Chunk number of next listing chunk when reading
directory in sequence (-1 if this is the last listing chunk)
0014: Directory listing entries (to quickref area) Sorted by
filename; the sort is case-insensitive.The quickref area is written backwards from the end of the chunk. One quickref entry exists for every n entries in the file, where n is calculated as 1 + (1 << quickref density). So for density = 2, n = 5. Chunklen-0002: WORD Number of entries in the chunk
Chunklen-0004: WORD Offset of entry n from entry 0
Chunklen-0008: WORD Offset of entry 2n from entry 0
Chunklen-000C: WORD Offset of entry 3n from entry 0
...The format of a directory listing entry is as follows BYTE: length of name
BYTEs: name (UTF-8 encoded)
ENCINT: content section
ENCINT: offset
ENCINT: lengthThe offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate). The length also refers to length of the file in the section after decompression. There are two kinds of file represented in the directory: user data and format related files. The files which are format-related have names which begin with '::', the user data files have names which begin with "/". The Index Chunk
An index chunk has the following format 0000: char[4] 'PMGI'
0004: DWORD Length of quickref/free area at end of directory chunk
0008: Directory index entries (to quickref/free area)The quickref area in an PMGI is the same as in an PMGL The format of a directory index entry is as follows BYTE: length of name
BYTEs: name (UTF-8 encoded)
ENCINT: directory listing chunk which starts with nameWhen higher-level indexes exist (when the depth of the index tree is 3 or higher), presumably the upper-level indexes will contain the numbers of lower-level index chunks rather than listing chunks Encoded Integers
An ENCINT is a variable-length integer. The high bit of each byte indicates "continued to the next byte". Bytes are stored most significant to least significant. So, for example, $EA $15 is (((0xEA&0x7F)<<7)|0x15) = 0x3515. The Content
The content typically immediately follows the header sections, and is at the location indicated by the DWORD following the header section table. All content section 0 locations in the directory are relative to that point. The other content sections are stored WITHIN content section 0. The Namelist file
There exists in content section 0 and in the directory a file called "::DataSpace/NameList". This file contains the names of all the content sections. The format is as follows: 0000: WORD Length of file, in words
0002: WORD Number of entries in fileEach entry:
0000: WORD Length of name in words, excluding terminating null
0002: WORD Double-byte characters
xxxx: WORD 0Yes, the names have a length word AND are null terminated; sort of a belt-and-suspenders approach. The coding system is likely UTF-16 (little endian). The section names seen so far are Uncompressed
MSCompressed
"Uncompressed" is self-explanatory. The section "MSCompressed" is compressed with Microsoft's LZX algorithm. The Section Data
For each section other than 0, there exists a file called '::DataSpace/Storage/<Section Name>/Content'. This file contains the compressed data for the section. So, conceptually, getting a file from a nonzero section is a multi-step process. First you must get the content file from section 0. Then you decompress (if appropriate) the section. Then you get the desired file from your decompressed section.
版面、格式而困扰?是否为了做成CHM而不得不进行大量的手工操作?
本软件可以帮助您 ● 生成C++的类的HTML格式的文档,条分缕析,版面美观而朴实。
并且可以集成到 HTML Workshop 中,和其他类一起形成CHM。 ● 而且可以生成 content of topic (.hhc) 文件以及生成
index (.hhk) 文件。避免您手工编辑生成文件的麻烦。要知
道在 HTML Workshop 中编辑生成这两种文件是非常麻烦的。 ● 在马上就要推出的新版本中,将提供二次开发接口,您将可以
操纵文档生成过程,形成更符合您要求的文档。当然源代码的
解析本软件已经帮您完成了,您所要做的仅仅是生成您所关心
的“部分”文档。
欢迎下载,提出改进建议 http://www.csdn.net/cnshare/soft/16/16229.shtm