Office Formats
Published by marco on
Microsoft recently released documentation for their binary office formats in both PDF and their own XPS format. The PDF for Word weighs in at 2.8MB and has 210 pages. Why are the Microsoft Office file formats so complicated? by Joel Spolsky, provides a lot of good reasons for why the formats are so complicated (most rooted in history), like speed, complexity of the task, purely internal formats (until now), etc.
Where Spolsky veers off the path (and he almost always does) is in reaching a bit too far with his “workarounds”. Instead of trying to load the binary formats yourself, he suggests simply launching Word or Excel as a COM object “directly, even from ASP or ASP.NET code running under IIS”. The caveat comes only later and tells only of a “few gotchas”, like it “not [being] officially supported by Microsoft”. He includes a link to a knowledge base article which uses a lot of words to say the equivalent of “for the love of the sweet baby Jesus, don’t do this.” Clearly, Spolsky was so enchanted by his prose and clever examples that he didn’t think Microsoft explicitly countermanding his idea was enough reason not to publish it to the world[1].
His advice to hide this type of solution behind a web service for Linux servers actually goes for ASP.NET servers as well. If you need to read the Office format (or generate it), there are libraries that do this without using office itself. The POI java library from Apache works quite well for generating Excel and Word documents. If you’re using .NET, you can hide the POI library behind a web service and call that instead. Even a Tomcat server to run a little web service won’t weigh more than running Office in a Windows 2003 Server. If you do have to run a Windows 2003 Server in a Linux environment, consider running it a virtual machine under Xen or some other virtualization solution.
Some of the other suggestions also indicate that Spolsky was just trying to fill out his bullet lists, like “[o]pening an Excel workbook, storing some data in input cells, recalculating, and pulling some results out of output cells”—that sounds like the kind of stuff you could just write in .NET or Java directly, no?[2] Or what about “[u]sing Excel to generate charts in GIF format”—there are libraries for that, aren’t there? Do you really have to consider automation in a server process (including a likely bottlenecking nightmare) just to generate a chart?
Happily, he closes strongly with good suggestions for generating the least complex format possible for fulfilling the task, such as using RTF for formatted documents (it’s a text format, reasonably legible, and is well-documented) or CSV for simple Excel data.
In the end, the formats for the office applications are published. This is what Microsoft deals with in their office products—there’s no use complaining that they’re too complicated. They are what they are and most people should be able to avoid having to deal with them—unless you do something silly like joining the Office development team in Redmond.