ITSF internal file formats

Preface

In this section, where the description of a file says that an item is an offset into another file, that file may be located in the same CHM, or it may be located in an accompanying CHI file.

The different types of ITSF files contain different internal files. The list below indicates which file types contain which internal files:

CHI
/#ITBITS, /#STRINGS, /#SYSTEM, /#TOCIDX, /#TOPICS, /#URLSTR, /#URLTBL, /#WINDOWS, /$OBJINST, /#IDXHDR, /$WWKeywordLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property
CHM
/#ITBITS, /#SYSTEM, /#IDXHDR, /#STRINGS, /#TOCIDX, /#TOPICS, /#URLSTR, /#URLTBL, /#IVB, /#SUBSETS, /#WINDOWS, /$FIftiMain, /$OBJINST, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWKeywordLinks/Property
CHQ
/$FIftiMain, /$OBJINST, /$TitleMap
CHW
/$OBJINST, /$HHTitleMap, /$WWKeywordLinks/BTREE, /$WWKeywordLinks/DATA, /$WWKeywordLinks/MAP, /$WWKeywordLinks/PROPERTY, /$WWAssociativeLinks/BTREE, /$WWAssociativeLinks/DATA, /$WWAssociativeLinks/MAP, /$WWAssociativeLinks/PROPERTY
ITS
These only contain format & author files.
hh.dat
/Path/file.chm/windowtype, /Path/file.chm/AdvSearchUI/Keywords, /Path/file.chm/AdvSearchUI/Properties, /Path/file.chm/Bookmarks/v1/Count, /Path/file.chm/Bookmarks/v1/n/Topic, /Path/file.chm/Bookmarks/v1/n/Url
Seen in HHA.dll, but not seen in any ITSF files
/#GRPINF, /#INFOTYPES, /#URLS

Internal file formats

/#ITBITS

The files I have seen so far have been empty or filled with zero BYTEs so who knows. My guess is that it has something to do with information types. The file where it had a non-zero size (12 zero BYTEs in VOICESDK.CHI) also had a non-zero /#SYSTEM code 15 (Information type checksum) entry of 0xffffffff.

/#SYSTEM

OffsetTypeComment/Value
0DWORD3 (Version number)
4/#SYSTEM entries to the EOF

/#SYSTEM entries have the following format:

OffsetTypeComment/Value
0WORDcode - see below for values & meanings
2WORDlength of data
4BYTEsdata

In the below list of the different codes the order of the codes in the /#SYSTEM file is 10, 9, 4, 2, 3, 16, 6, (5,0,1 or 0,1,5 - haven't seen files with all three), 7, 11, 12, 13, 14, 8 and lastly 15.

An eplanation for each of the /#SYSTEM codes
CodeExplanation
0Value of Contents file in [OPTIONS] section of the hhp file. Present when binary TOC is off, but not always. NT
1Value of Index file in [OPTIONS] section of the hhp file. Present when binary index is off, but not always. NT
2Value of Default topic in [OPTIONS] section of the hhp file. NT
3Value of Title in [OPTIONS] section of the hhp file. NT
428 or 36 byte structure:
OffsetTypeComment/Value
0DWORDLCID from the HHP file.
4DWORDOne if DBCS is in use.
8DWORDOne if full-text search is on.
0xCDWORDOne if the file has KLinks.
0x10DWORDOne if the file has ALinks.
0x14QWORDtimestamp - Win32 FILETIME structure. May be from Jan 1, 1970 instead of 1601.
0x1CBYTE[8]0 (unknown)
5Value of Default Window in [OPTIONS] section of the hhp file. NT
6Value of Compiled file in [OPTIONS] section of the hhp file. This is the lowercase of the stem of the CHM file name. If the name of the CHM is "..\bar\foo\ FOO-Bar . chm jimmy is a poo-bum" then this will be " foo-bar ". NT
7DWORD present in files with binary Index
8Rare. VOICESDK.CHM & CHI from the MSDN has one. Each entry is 18 BYTEs:
OffsetTypeComment/Value
0DWORD0, 4 in some (unknown)
4DWORDOffset in /#STRINGS file. An abbreviation.
8DWORD3 where 1st DWORD is 0, 5 where it is 4 (unknown)
0xCDWORDOffset in /#STRINGS file. An explanation of the abbreviation.
9The version/program that the CHM was compiled by - shown in the version dialog as "Compiled with %s" where %s is what is in the /#SYSTEM file. If compiled with the MS HTML Help Author dll then it will be something like "HHA Version 4.74.8702". I'm speculating that it comes directly from the resource strings of HHA.dll (I saw it there in Unicode, but haven't yet tried altering it >;-). NT
10time_t timestamp (DWORD)
11DWORD present in files with binary TOC
12Number of information types (DWORD).
13The /#IDXHDR file contains exactly the same bytes. See below for more info
14Rare. The ones I saw were from MS Word 2000. My guess is that it is an MSOffice extension (or maybe not) that overrides the names & window types of the navigation tabs. DWORD number of windows to override, 2 ANSI NT strings for each window. The first is the text for the tab & the second is probably the name of the window type to use. (eg 2, "&Answer Wizard\0MsoHelpAWDlg\0&Index\0MsoHelpKeyDlg\0")
These are from the Custom tab variables of the [OPTIONS] section of the hhp file.
15Information type checksum (DWORD). Unknown algorithm & data source.
16Value of Default Font in [OPTIONS] section of the hhp file. NT
17-65535Not yet seen. Please let us know if you see these.

/#IDXHDR

This has exactly the same bytes as the code 13 entry in the /#SYSTEM file

OffsetTypeComment/Value
0char[4]T#SM
4DWORDUnknown checksum
8DWORDUnknown
0xCDWORDNumber of topic nodes including the contents & index files
0x10BYTE[56]Unknown
0x48DWORDNumber of files in the [MERGE FILES] list
0x4CDWORDUnknown
0x50DWORDsList of offsets in the /#STRINGS file that are the [MERGE FILES] list
+00 BYTEs to the EOF (unknown)

/#WINDOWS

This file contains information on the window types in the CHM. It has the following format:

OffsetTypeComment/Value
0DWORDNumber of entries in the file
4DWORDSize of each of the entries in the file (188 or 196)
8/#WINDOWS entries to the EOF

/#WINDOWS entries are basically HH_WINTYPE structures as specified in htmlhelp.h. Note the first DWORD can be used to specify different versions of this structure. Also note that the HHW docs show a different structure to htmlhelp.h. Therefore many CHM files need to be surveyed to find structures with sizes other than 188 or 196. In the description of /#WINDOWS entries below, Arg n means that that item is argument n of the window definition in the hhp file, either converted to a DWORD or to an offset in the indicated file:

The format of each /#WINDOWS entry.
OffsetTypeComment/Value
0DWORDSize of the entry (188 in CHMs compiled with "Compatibility=1.0", 196 in CHMs compiled with "Compatibility=1.1 or later")
4DWORD0 (unknown) - but htmlhelp.h indicates that this is "BOOL fUniCodeStrings; // IN/OUT: TRUE if all strings are in UNICODE"
8DWORDArg 0. Offset in /#STRINGS file.
0xCDWORDWhich window properties are valid & are to be used for this window. See the table below.
0x10DWORDArg 10.
0x14DWORDArg 1. Offset in /#STRINGS file.
0x18DWORDArg 14.
0x1CDWORDArg 15.
0x20RECTArg 13. Order left, top, right & bottom.
0x30DWORDArg 16.
0x34DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHelp; // OUT: window handle"
0x38DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndCaller; // OUT: who called this window"
0x3CDWORD0 (unknown) - but htmlhelp.h indicates that this is "HH_INFOTYPE* paInfoTypes; // IN: Pointer to an array of Information Types"
0x40DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndToolBar; // OUT: toolbar window in tri-pane window"
0x44DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndNavigation; // OUT: navigation window in tri-pane window"
0x48DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHTML; // OUT: window displaying HTML in tri-pane window"
0x4CDWORDArg 11.
0x50BYTE[16]0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcHTML; // OUT: HTML window coordinates" & the HHW docs say "Specifies the coordinates of the Topic pane."
0x60DWORDArg 2. Offset in /#STRINGS file.
0x64DWORDArg 3. Offset in /#STRINGS file.
0x68DWORDArg 4. Offset in /#STRINGS file.
0x6CDWORDArg 5. Offset in /#STRINGS file.
0x70DWORDArg 12.
0x74DWORDArg 17.
0x78DWORDArg 18.
0x7CDWORDArg 19 (unknown) - but htmlhelp.h indicates that this is "int tabpos; // IN/OUT: HHWIN_NAVTAB_TOP, HHWIN_NAVTAB_LEFT, or HHWIN_NAVTAB_BOTTOM".
0x80DWORDArg 20 (unknown) - but htmlhelp.h indicates that this is "int idNotify; // IN: ID to use for WM_NOTIFY messages"
0x84BYTE[20]0 (unknown) - but htmlhelp.h indicates that this is "BYTE tabOrder[HH_MAX_TABS + 1]; // IN/OUT: tab order: Contents, Index, Search, History, Favorites, Reserved 1-5, Custom tabs"
0x98DWORD0 (unknown) - but htmlhelp.h indicates that this is "int cHistory; // IN/OUT: number of history items to keep (default is 30)"
0x9CDWORDArg 7. Offset in /#STRINGS file.
0xA0DWORDArg 9. Offset in /#STRINGS file.
0xA4DWORDArg 6. Offset in /#STRINGS file.
0xA8DWORDArg 8. Offset in /#STRINGS file.
0xACBYTE[16]0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcMinSize; // Minimum size for window (ignored in version 1)"
Everything after here is only present in CHMs compiled with "Compatibility=1.1 or later".
0xBCDWORD0 (unknown) - but htmlhelp.h indicates that this is "int cbInfoTypes; // size of paInfoTypes;"
0xC0DWORD0 (unknown) - but htmlhelp.h indicates that this is "LPCTSTR pszCustomTabs; // multiple zero-terminated strings"
Flags used to specify which values are valid.
ValueValid property
0x00000002Navigation Pane Style.
0x00000004Style Flags.
0x00000008Extended Style Flags.
0x00000010Initial Position.
0x00000020Navigation Pane Width.
0x00000040Show state.
0x00000080Info types.
0x00000100Buttons.
0x00000200Navigation Pane initially closed state.
0x00000400Tab pos.
0x00000800Tab order.
0x00001000History count.
0x00002000Default Pane.
0x?????000The rest of the values either do nothing or are unknown. Please let us know if you find out what the rest are.

/#STRINGS

This file is a list of ANSI NT strings. The first is just a NIL character so that offsets to that file can specify zero & get a valid string. The strings are in this order; "\0", [WINDOWS] (Arg 0, Arg 1, Arg 7, Arg 9, Arg 2, Arg 3, Arg 4, Arg 5, Arg 6, Arg 8) #n..., Contents_0_Entry_title, Index_0_Keyword, Contents_Image_file, Contents_Font, Contents_Default_frame, Contents_Default_window, [MERGE FILES] #n...

/#TOCIDX

Present in files with a binary TOC.

/#TOPICS

This file contains information on the topics present. It is sorted by URL.

Each entry has the following format.

OffsetTypeComment/Value
0DWORDUnknown
4DWORDOffset in /#STRINGS file of title. -1 = no title.
8DWORDOffset in /#URLTBL of entry containing offset to /#URLSTR entry containing the URL.
0xCDWORD2 indicates not in contents, 6 indicates that it is in the contents (unknown)

/#URLSTR

Before all the entries is an unknown byte.

Each entry has the following format.

OffsetTypeComment/Value
0QWORD0 (unknown)
8NT string

/#URLTBL

Each entry has the following format.

OffsetTypeComment/Value
0DWORDUnknown
4DWORDOffset in /#URLSTR file of URL entry containing URL.
8DWORDUnknown

/#IVB

This is basically the [ALIAS] section of the HHP file.

OffsetTypeComment/Value
0DWORDSize of the file minus 4 (num entries = (filelen-4)/8)
4/#IVB entries to the EOF

/#IVB entries have the following format.

OffsetTypeComment/Value
0DWORDThe value of the alias
4DWORDOffset in /#STRINGS file of the file to show

/#SUBSETS

This file is present when the [SUBSETS] section is present in the HHP file.

OffsetTypeComment/Value
0WORD0 (unknown)
2WORDNumber of bytes taken up by the subset entries.
4Subset entries.

The subset entries currently seem to be garbage left over from previous usage of the same memory locations. Based on the number of bytes per non-whitespace line in the [SUBSETS] section each subset entry is 12 BYTEs in length.

/$FIftiMain

OffsetTypeComment/Value
0x7EDWORDLCID from the HHP file.

If you have a word longer than 99 characters in a HTML file then it seems the indexing routines will die during indexing of that file and then skip on to the next one.

Empty when "Full-text search=No" or when no HTML files have been indexed. Holds the full-text search information.

All word sorting, processing and storage is done case-insensitively and is not case-preserving.

Note that files with an extension other than htm/html will not contribute keywords to this fast-search index.

Different blocks of bytes. Several types. Methinks it works like a tree. At each letter of a word you can either terminate (if no duplicates at that letter) and specify the rest of the word or you can branch out to each variant of that letter. The advantage of this is that you don't need to search a huge list every time you do a search.
nope first you have the whole word at level zero then you take the rest of the words that begin with the same character & sort them I think there are different blocks for terminal (leaves) & branching nodes (branches)
The words are set to lowercase.
There would also be some sort of table to point to the urls that contain the words.

OffsetTypeComment/Value
0WORDUnknown
2entries to eob
Entries
OffsetTypeComment/Value
0BYTELength of the word in this entry including the NT. Maximum of 100.
1BYTEPosition in the word where characters are placed.
2BYTEsLength bytes make up the word or part of the word. NT
+0BYTEUnknown. Some kind of block number?
+1BYTEIndex number
+2BYTEUnknown. Some kind of block number?
+3DWORDUnknown. Some kind of block number?
+7BYTEHow much to increase the index number by for the next entry.

/$OBJINST

More information is needed on this file.

From the name and the number of GUIDs present I guess it has something to do with ActiveX objects.

OffsetTypeComment/Value
0DWORD0x04000000 (unknown)
4DWORDNumber of entries

This is followed by an listing, and each listing entry is as follows

0DWORDOffset of the entry in this file
4DWORDLength of the entry

The listing is followed by the entries one after another at offsets specified in the listing.

There are 2 known types of entries. The first seems to be made up of up to 3 different sub entries. The second is a 36 BYTE structure.

The first entry
OffsetTypeComment/Value
0GUID{4662DAAF-D393-11D0-9A56-00C04FB68BF7}
0x10DWORD0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to.
0x14DWORDUnknown. Methinks bitflags that somehow affect the size of entries that have the 0x04000000 DWORD, like each bit specifies the presence/absence of a specific subentry.
0x18DWORD0x4e4 (unknown)
0x1CDWORDLCID from the HHP file.
0x20BYTEsunknown
+0Entries

I haven't been able to find any files without the data for bits 0 & 1 so I can't really say exactly how big the header is and which bytes are part of the bit 0 block and which are part of the bit 1 block. Together, though, bits 0 & 1 account for a large bulk of repeatedly increasing byte blocks of 10 bytes each, plus something else at the end. I suspect that the repeats are for bit 0 and the stuff at the end is bit 1. As to the function of these two bits blocks, well there are no GUIDs and no other clues, so who knows.

bit 2. Only present when Full text search stop list file has been specified in the HHP.
OffsetTypeComment/Value
0DWORDunknown
4DWORDLength in bytes of the entries not including the last zero word.
8BYTE[36]unknown
0x28Entries. The last entry has a zero length word.
bit 2 entries
OffsetTypeComment/Value
0WORDLength of the word
2char[length]ANSI string from the stop list file. Not NT.
bit 3
OffsetTypeComment/Value
0GUID{8FA0D5A8-DEDF-11D0-9A61-00C04FB68BF7}
0x10DWORD0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to.
0x14DWORD1 (unknown)
0x18DWORD0x4e4 (unknown)
0x1CDWORDLCID from the HHP file.
0x20DWORD0 (unknown)
The second entry
OffsetTypeComment/Value
0GUID{4662DAB0-D393-11D0-9A56-00C04FB68B66}
0x10DWORD666 I think this represents the version of the class that the GUID refers to.
0x14DWORD0x4e4 (unknown)
0x18DWORDLCID from the HHP file.
0x1CDWORDAlmost always 10031. Have seen 66631 once in accessib.chm (unknown)
0x20DWORD0 (unknown)

/$HHTitleMap

The file begins with a WORD indicating the number of entries.

Each entry has the following format:

OffsetTypeComment/Value
0WORDLength of the file stem.
2BYTEsFile stem. ANSI string. Not NT.
+0DWORDUnknown.
+4DWORDUnknown. Same value as previous DWORD.
+8DWORDLCID of the specified file.

/$TitleMap

The file begins with a WORD indicating the number of entries.

Each entry is 68 BYTEs in length and has the following format:

OffsetTypeComment/Value
0BYTEsFile stem. ANSI NT fixed length string.
+0BYTEs

Unknown.

Unknown.

Unknown.

If there are no ALinks in the CHM then this will be a zero DWORD.

Unknown.

Unknown.

Unknown.

If there are no KLinks in the CHM then this will be a zero DWORD.

/Path/file.chm/windowtype

A cache of the windowtype entry from the /#WINDOWS file of the \Path\file.chm CHM file. Some of the data, sich as window size & position, will be different if the user has customized it.

/Path/file.chm/AdvSearchUI/Keywords

UTF-16 NT string. Each search item is separated by a UTF-16 Line Feed character. The string is followed by an unknown WORD.

/Path/file.chm/AdvSearchUI/Properties

DWORD. Only the lowest 3 bits are used. "Match similar words" is controlled by bit 0. "Search titles only" is controlled by bit 1. "Search previous results" is controlled by bit 2. Note that since previous search results are not stored anywhere as yet HH will uncheck the "Search previous results" checkbox even if its bit is on. IMHO this is a bug: HH should automatically search the whole file if there are no previous results and the checkbox is checked.

/Path/file.chm/Bookmarks/v1/Count

A DWORD indicating the number of favourites stored for the \Path\file.chm CHM file.

/Path/file.chm/Bookmarks/v1/n/Topic

An NT UTF-16 string showing the topic name of bookmark number n (n is zero based).

/Path/file.chm/Bookmarks/v1/n/Url

An NT UTF-16 string showing the URL of bookmark number n (n is zero based). It is a fully qualified path into the \Path\file.chm CHM file.


Please let us know if you find any other internal files or figure out formats of any internal files.