Sorry for the crass
commercialism, but I noticed that you had not updated the information on this
page to reflect the latest version of Microsoft Office. Unlike the previous two (I must admit,
half-hearted) attempts at saving documents in HTML format, the (teaming
millions of) Office developers went all out to make HTML a full-fledged document
format and not just a poor cousin to the old binary formats.
If you just want really
simple HTML, you might be better using FrontPage or Visual Notepad, but if
you have existing documents in Word, Excel or PowerPoint, I think youll find
surprisingly good fidelity to the look of the original document and a much
better use of HTML tags to represent that look.
By the way, the HTML for this comment was created by Word2000.
Everytime I create an HTML file using the built-in HTML "converters" in MS Office97, I need to open the .htm file (another annoyance courtesy of Microsoft) in NoteTab Light, and strip out all of the extra junk in the file (e.g. STYLE="vnd.ms-excel.numberformat:$#,##0") for every special number/date format produced by Excel or Access; not to mention all of the Font Face tags. By manually stripping the extra labels, I am able to get rid of several thousand bytes of extra ASCII garbage that does not add anything to my HTML. It's an extra step, but since I know that most people that look at my files are using dial-up from home connections, it is the least that I can do.
-- John Fracisco, October 29, 1999
Regarding the "Try Office 2000" comment above: take a look at the HTML source this thing produced.First, an embedded stylesheet about 30 lines long, completely specifying margins, fonts, etc. for each paragraph class. Each style used several non-standard attributes and values. Then, for each paragraph, it added a <span style=''> wrapper to override the styles in the class. Finally, it added some scripting, apparently for the hell of it.
So the end answer is no, the output of Office 2000 is no better than any previous MS effort, and is probably worse. (The explanation is that what's happing is that they're using HTML as an actual complete file format -- a replacement for .doc. They're using stylesheets as the equivalent to templates. Because there are lots of things in MS formatting that HTML doesn't support, they have to use lots of non-standard extensions. As usual, MS has totally missed the point: HTML is *not* a formatting language, it's a semantic markup language.)
-- Steve Greenland, January 24, 2000
For mass find a replace features, I enjoy the Allaire products HomeSite and ColdFusion Studio. They let you do find and replace features on whole folders. Makes changing MS HTML into legible/legal HTML a little easier. I've noticed MS HTML does sneaky stuff like incorrect nesting that works with IE, but breaks NetScape (subtle browser war tactic???). The Allaire products also have a "code sweeper" function which you can set to do things like "strip the font tag" or "strip ending P tags". You can customize the usage of all tags with code sweeper. Also there is a validate function that will point out the nesting errors. It's handy and so far my favorite editor. It has a WYSIWYG thing too, but it's not that great and I never use it. The 4.0 version of these programs great on Windows98/NT but I would be careful with CF Studio 4.5. I had strange memory problems with it.
-- Phillip Harrington, January 25, 2000
I have recently started converting books for the Web, and luckily these books were produced in pagemaker 6.5. I had never used the HTML export from pm, because generally I start a project in dreamweaver. The HTML export from pagemaker essentially takes a pm style, and you map that style to H1, P, etc. There are some funky problems with font colors, but a search and replace in dreamweaver is fairly quick. {shameless plug} you can see how it finally turned out at my Georgia Coast book {/shameless plug} If you have any specific questions about how to use pagemaker you can mail me directly.
-- John Lenz, April 25, 2000
I would just like to add to the Word 2000 "thread". I am maintaining a site where the principal content producer uses Word. I get emailed the docs, then have to convert them to HTML. At first, I was just cutting and pasting into GoLive 5, and manually editing for lists and breaks etc. I was getting tired of this, and thought I would try the save as HTML option. I was completely amazed at the amount of rubbish in the resulting file, xml this, namespace that. I've gone back to cut'n'paste!
-- Mark Horrocks, November 23, 2000
MSWordView is now wvWare.
Having tried Office 2000 and Office 2001 (Mac) converstion to HTML and seen the awful results, the choice of a un*x (including MacOSX) converter is really cool.
-- Bob Kerstetter, January 11, 2001
Hi, regarding Microsoft 2000 and it's inability to understand what the words "tidy" code mean. I used MS Word 2000 when our site needed to create an intranet and it was a total nightmare. It kept creating all these files and folders on the server that took up space and loading time. It created a file folder for every single page of html which was just plain ignorant if you ask me. I persisted with this until my manager let me buy FrontPage 2000 which is still a little bit messy but better value. I feel that Word is okay for quick, simple pages that aren't going to need much maintenance. Frontpage is very well integrated with all the other MS packages but does tend to spit out some garbage in the form of FPDB includes etc and doesn't tend to like files created in other Web design applications, having it's own tilted view on the world. If you want a dynamic database driven website, then FrontPage is great for the novice who hasn't got time to learn ASP, Javasript etc. It's a good training ground to pick all that up. Thanks
-- Hazera Bibi, January 18, 2001
Try downloading this utility from the microsoft web site. It's saved me hours of reinputting line breaks after I've copied and pasted Word text into Dreamweaver. It's a real life-saver for webmasters.
Yes - it really does work!!
Word 2000 'crud' HTML filter: http://office.microsoft.com/downloads/2000/Msohtmf2.aspx
ruth arnold
www.spacehoppa.com
-- Ruth Arnold, May 30, 2001
I can't believe no one mentioned the fact that Dreamweaver (4.0 at least) has a function that will import Word generated html and clean it up for you. It does a fabulous job and allows you to pick an choose how severe you want the clean up.
-- kim simms, June 21, 2001
Another useful tool is Dave Raggett's Tidy program. It can be found at http://www.w3.org/People/Raggett/tidy/. It will clean up your HTML, and has numerous options so you can customize how it formats (or cleans) the HTML. It's been ported to most OSs, and the source code is available if you want to modify how it works. Since it's a command-line program, it can be hooked into any decent editor -- that is, ones that allow you to run programs and capture the output.
-- David Wall, August 15, 2001
Well I was searching the web on convertion projects from MS-gernerated HTML files to Pure HTML tagged files... And I landed up in this page and found a lot of useful info.I have developed a Java/JSP/Javascript/HTML based web-enabled application to do the job of converting .txt files to .htm files and it gives the end user a choice of operations paragraph by paragraph and the processed paragraphs are then written by the JSP with tags to the .htm file. I found the speed of conversion to be about 45 to 50 files an hour! for this you have to save every MS-HTML file as .txt and then give it to my program as input. I recently converted about 1200 content files for a German website.
anyone interested in offloading projects or additional info? please do write to me at emmanuel_chris@rediffmail.com
-- Benjamin Christopher, November 5, 2001
Yes, Dreamweaver as a nice utility for cleaning the HTML code generated by Microsoft word, it allows also to choose how strong must be the cleaning, and it works satisfactorily for the Word 97 HTML code. Unfortunately, even with the strongest cleaning, it is not able to get rid of the <span style = ...> definition which Word 2000 put at the beginning of every sentence. If you import a Word 2000 HTML file, you will not be able to change the font of the document if not editing the HTML source, line by line... I'll give a try to the Office 2000 HTML filter 2.0 (by Microsoft), hoping it works!
-- Luca Bonci, February 21, 2002
I have found that eWebEditPro from Ektron (www.ektron.com) does a pretty good job of cleaning up Word 2000. It will produce xhtml output. It's not perfect, though - maybe about the same as Dreamweaver 4 but I haven't tested the difference. I'd like to find something that strips off all the font styles and leaves layout structure in place.
-- Andy Harrison, May 15, 2002
After struggling with the crud you get out of Word, even if you copy and paste into an HTML-friendly editor, I came up with this approach using Ant 1.5's very nice ReplaceRegExp task (sorry about the formatting loss here, but I'm too tired to reformat this nicely for text right now):<target name="strip-test"> <replaceregexp flags="g" match="</FONT>" replace=""> <fileset dir="${publish.dir}"/> </replaceregexp> <replaceregexp flags="g" match="<FONT(.*)>" replace=""> <fileset dir="${publish.dir}"/> </replaceregexp> <replaceregexp flags="g" match="<P class=(.*)>" replace="<P>"> <fileset dir="${publish.dir}"/> </replaceregexp> </target>
This effectively strips out the offensive font, style and non-standard <o:p> tag. There may be a way to optimize this by combining the replace expressions into a set of nested expressions, and it could easily be extended to strip out other junk. It's nicely speedy and easy to add to an Ant script for processing directories recursively leveraging Ant's <fileset> tag.
Ant is really wonderful if you're not familiar with it. Hope this is helpful to someone out there trying to clean their MS Word junk.
-- Daniel Seltzer, July 23, 2002
Lot of wonderfull information here guys, thanks.Another utility to try is:
http://www.textism.com/resources/cleanwordhtml/
Using the *Office 2000 HTML Filter 2.0* from Microsoft and the page above does a good job of cleaning out Microsoft's, um, inaccuracies.
Now, if I can only teach my users not to use all caps . . .
-- Grey Gremlin, July 25, 2002
I have a vaguely db-backed personal web site on which I had to append some MSWord documents. The easiest solution I found is to open the Word file in OpenOffice Writer, (http://www.openoffice.org) save as HTML then edit the file in Emacs.OpenOffice does generate a load of crap in the html file, but it's nowhere as bad as Word 2000. Most of my work is done by a rather ugly Emacs Macro (which should really be a Lisp procedure, but that will have to wait until I actually learn Lisp) to replace-regexp a couple of tags, namely :
- delete SPAN ("</?SPAN[^>]*>" -> "")
- remove attributes from p and h tags ("<H1[^>]*>" -> "<h1>")
The macro also add calls to my header and footer scripts and edits the header to use external CSS stylesheets.
Overall it works rather well and I can get .doc files up really fast. I still have to correct a few things by hand but with something more involved than my macro (by a better programmer) I think that wouldn't even be necessary.
By the way OpenOffice (free version of StarOffice 6) is quite good. It does mostly everything MsOffice does and it's free. And the equation editor is *way* better.
-- Serge Boucher, December 30, 2002
I am a farmer, not a computer expert, but here we are in 2003 and it seems even farms need web pages. I can accept that. I traded some beets for some web development work, and even got a crash course on using dreamweaver on a Mac. Wow, could it be, a computer that actually works! Looks complicated though. Next I needed to change something and thought it would be a simple matter to open an html document in MS word (the latest and greatest, in a university PC lab) make my changes and save as HTML. Word changed everything everything around so the images wouldn't display correctly on a Mac, and now I am wasting my time wading through code I don't understand. Sorry Mr. Gates, It looks like you still don't get it.
-- mac burgess, February 6, 2003
I'm sorry if someone has said this but there were a lot of responses and someone may have missed it. UK legislation to be introduced in about a years time will be very harsh on company websites that do not offer adequate accessability options - i.e allowing users to change font or font size and bg colour - to help people with learnig or reading dificulties. When word creates html it seems to put so many tags in that a lot of these facilities will not work. This is perhaps a consideration if you are using word for a vaguely commercial site as the UK gov have claimed they will agressively enforce this law.
-- Nathan Mcilree, July 1, 2003
We have found that the HTML exported from a MS-Word document by OpenOffice 1.1 is much cleaner than the Microsoft version. In particular it uses relative font sizes rather than the idiotic point sized fonts, so the user's screen prefrences are honoured. The output file is also significantly smaller than the equivalent Word export. (or indeed the original Word file.)
-- Andrew Macpherson, March 18, 2004
Also there is a useful plain-vanilla utility called antiword (which does a nice job of just grabbing the text), useful for creating indexes and the like (I personally use it for Plone, an open source CMS)The other utility around is wvWare, but this has a lot of cascaded dependencies so is difficult to compile (build from scratch, as there is not off-the-shelf version) on some systems.
Good luck, and I wish you well extracting your intellectual property from M$ proprietary format !
-- stu hannay, August 9, 2004
Hi All, If you are in education and wish to convert your MS Word material into "clean" "section 508" compliant html, then look no further than using a rather useful plugin for word. "CourseGenie" from http://www.coursegenie.com is an excellent tool and is growing in its use by academics who do not want to be webmasters. A 1 minute demo of courseGenie is at : http://www.coursegenie.com/demos. Its been tipped as the next generation tool. All the Best, Michael Bailey.
-- Mike Bailey, October 12, 2004
There is a CMS called PHPWebSite which is open source. In a settings file, you can specify what html tags to strip from input (for example, when pasting from word into a textarea for creating content). I disallow the (P) tag.I have customised this functionality to do the following: before stripping html tags, replace the (/P) and (/p) tags with (BR /).
Then there is a posting on www.php.net for the strip_tags function, in which a comment talks about a function which can strip attributes from specific tags. (The function allows you to specify an array of attributes not to strip - but that part doesn't seem to work - it will, however, strip all attributes).
Before stripping the tags, I then strip all the attributes for all the allowed tags except for anchor tags (A) and any tags to do with tables (table), (tr), (td), (th), etc.
The result is that only the tags I allow are kept, all paragraphs are converted to (BR) tags (and because Word seems to insert an empty paragraph between each 'real' paragraph, this works out ok), formatting for tables is retained, things like bold and italic text are retained, but all the garbage is thrown out.
Actually, on top of that we use a wysiwyg editor in the CMS, so if there is anything else that needs fixing (eg centering text) - it's pretty quick.
The flipside is that you don't have (P) tags.
If anyone wants info on the exact code we're using, youcantryreachingme AT NOSPAM hotmail DOT com .
-- Chris Notdisclosing, April 22, 2005
I don't want to turn this site into a single-topic forum, but the trial Malkue product doesn't seem to work very well in IE. It passes W3C but there's an unmatched <title /> tag in the code which seems to cause IE to go blank. When removed, the translated text appears in a single unbroken line...All day spent trying to find a Word to HTML converter which both passes W3C and is also legible *sigh* (tried YAWC too, it crashes my Word97)
-- Gaylord Lussac, March 26, 2006
For a few years now I've been developing http://Docvert.org which lets you convert Microsoft Word into standards compliant (x)HTML. It lets you control every tag and attribute, or to just use several of the inbuilt templates for conversion to clean HTML. It's free and open source software, and you can install it for shared use by a whole office of people. It's even got a Microsoft Office toolbar that'll let you convert documents.
-- Matthew Holloway, May 2, 2008
- Use Gmail to Convert Word Docs to HTML
If you have a MS Word doc that you want to convert to HTML, the last thing youýd ever use is the ýSave as Web Pageýý command in Word. Instead, you can send the attachment to your Gmail account and use the ýView as HTMLý link. Once the page is displayed in your browser, go to ýView Sourceý and copy the code. Most of it is very clean and quite useable.
FROM: oreillynet.com 2006: use_gmail_to_convert_word_docsI tried GMail on one document, it said: The attachment cannot be viewed as HTML. Download the attachment to view it in its original format. :)
- I also tried
HTMLTIDY -raw -clean -omit infile.html > outfile.html
It decreased the file size from 690 kb to 580 kb, but did not clean useless tags.
- I did not try but my friend suggested three services doc2html that do not convert images:
http://gdsland.com/excel2html/home.php -- excel to html
http://www.zamzar.com/
http://media-convert.com/konvertieren/
-- Evgenii Philippov, October 23, 2008