doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables – extract printable text from Microsoft documents

doc2txt [ file.doc ]
doc2ps [ file.doc ]
wdoc2txt [ file.doc ]
xls2txt [ file.xls ]
aux/olefs [ –m mtpt ] file.doc
aux/mswordstrings mtpt/WordDocument
[ –qaDnt ] [ –d delim ] [ –c column–range ] [ –w worksheet–range ] mtpt/Workbook

Doc2txt is an rc(1) script that uses olefs and mswordstrings to extract the printable text from the body of a Microsoft Word document and write it on the standard output. Doc2ps is similar, but emits PostScript corresponding to the document. Wdoc2txt is similar to doc2txt, but uses plumb(1) to send the output to a new acme(1) window instead. Xls2txt performs a similar function for Microsoft Excel documents.

Microsoft Office documents are stored in OLE (Object Linking and Embedding) format, which is a scaled down version of Microsoft's FAT file system. Olefs presents the contents of an MS Office document as a file system on mtpt, which defaults to /mnt/doc. Mswordstrings or msexceltables may then be used to parse the files inside, extracting a text stream. Msexceltables may be given options to control the formatting of its output.
a      Attempt conversion of non–tabular sheets in the workbook (charts).
d delim   Sets the inter–field delimiter to the string delim, by default a single space.
D      Enables debugging output.
c rangeRange is a comma–separated list of column numbers and ranges. Ranges are separated by dashes. Limit processing to just those columns named; by default all columns are output.
n      Disables field padding to column width.
q      Disable quoting of textural fields (see quote(2).)
t      Truncate fields to the column width.
w rangeRange is a comma–separated list of worksheet numbers and ranges, this limits the sheets output using the same syntax as the –c option above. Suppressed chart pages are always included in the sheet count.

Extract pieces of an MS Excel spreadsheet.
aux/olefs report.xls
msexceltables –q –w 1,7,9–14 –c 3–5 –n –d '@' /mnt/doc/Workbook > rpt.txt
unmount /mnt/doc

/rc/bin              doc2txt, doc2ps, wdoc2txt, and xls2txt
the others

``Microsoft Word 97 Binary File Format'', at Microsoft's developer (MSDN) home page.
``LAOLA Binary Structures'', http://user.cs.tu–
``OpenOffice.Org's Excel Documentation'',