Convert Microsoft Word documents to HTML
ModelText's Doc to HTML converter is a utility which converts Microsoft Word documents to clean XHTML.
The conversion discards all style and font information, leaving only clean XHTML. This can help you to republish a Word document as a web page, if you add your own CSS.
The output from the converter is in XHTML format. The following elements of the document are preserved:
The following is an example of the output from the converter, from a simple document:
The converter works with the Word 2003 XML Document format. Documents that already have this format can be converted without Microsoft Word being present. The converter knows the Word 2003 XML schema (which is called "wordml"), opens the file using a generic XML reader, extracts elements and text, and saves them as XHTML.
The converter can also work with any of the other Word document formats. It will do this by using Microsoft Word 2007, to open the document and to save it again as a temporary file in the Word 2003 XML Document format. To work with these document formats therefore requires having Microsoft Office 2007 installed on the machine (which is not required if the documents are already in the Word 2003 XML Document format).
The utility is a console application, named
It requires two parameters:
The following is an example of the command-line, to convert an input file named
to an output file named
tidywordcmd.exe test.docx test.html
License to use the ModelText Doc to HTML Converter (version 1.1) is as follows.
Copyright 2008-2012, Christopher Wells <firstname.lastname@example.org> ("Licensor")
Permission to use, copy, and/or distribute this software for any purpose with or without fee is hereby granted to you, provided that you accept all the terms of this license.
You may copy and distribute this software to other parties ("third parties"), provided that the above copyright notice and this permission notice appear in all copies, and that third parties are bound by the terms of this license.
This is closed source, proprietary software. The software's source code (except for some sample code) has not been released. Although permission is hereby granted to write software which uses this software component, and to use this software as a component within other software, permission is not granted to modify this software component, nor to use nor to distribute modified copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE LICENSOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE LICENSOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
After you download the zip file which contains the release, you can simply run the
tidywordcmd.exe executable using the command line parameters specified above.
tidywordcmd.exe executable depends on
version 2.0 (or greater) of the .NET framework, which must be installed before you run the utility
(it is probably installed on your machine already).
If you want to convert Word documents that are not already in the Word 2003 XML Document format, then you should also:
PrimaryInteropAssemblies.exe(included in the download) to install the interop assemblies, which allow the Doc to HTML Converter to invoke Microsoft Word.
Please post suggestions, and any bug reports and support issues, to the ModelText discussion group.
You can also contact the author by sending email to email@example.com.
© 2009-2012, Christopher Wells. All rights reserved.