|<<>>|237 of 293 Show listMobile Mode

Office Formats

Published by marco on

Microsoft recently released documentation for their binary office formats in both PDF and their own XPS format. The PDF for Word weighs in at 2.8MB and has 210 pages. Why are the Microsoft Office file formats so complicated? by Joel Spolsky, provides a lot of good reasons for why the formats are so complicated (most rooted in history), like speed, complexity of the task, purely internal formats (until now), etc.

Where Spolsky veers off the path (and he almost always does) is in reaching a bit too far with his “workarounds”. Instead of trying to load the binary formats yourself, he suggests simply launching Word or Excel as a COM object “directly, even from ASP or ASP.NET code running under IIS”. The caveat comes only later and tells only of a “few gotchas”, like it “not [being] officially supported by Microsoft”. He includes a link to a knowledge base article which uses a lot of words to say the equivalent of “for the love of the sweet baby Jesus, don’t do this.” Clearly, Spolsky was so enchanted by his prose and clever examples that he didn’t think Microsoft explicitly countermanding his idea was enough reason not to publish it to the world[1].

His advice to hide this type of solution behind a web service for Linux servers actually goes for ASP.NET servers as well. If you need to read the Office format (or generate it), there are libraries that do this without using office itself. The POI java library from Apache works quite well for generating Excel and Word documents. If you’re using .NET, you can hide the POI library behind a web service and call that instead. Even a Tomcat server to run a little web service won’t weigh more than running Office in a Windows 2003 Server. If you do have to run a Windows 2003 Server in a Linux environment, consider running it a virtual machine under Xen or some other virtualization solution.

Some of the other suggestions also indicate that Spolsky was just trying to fill out his bullet lists, like “[o]pening an Excel workbook, storing some data in input cells, recalculating, and pulling some results out of output cells”—that sounds like the kind of stuff you could just write in .NET or Java directly, no?[2] Or what about “[u]sing Excel to generate charts in GIF format”—there are libraries for that, aren’t there? Do you really have to consider automation in a server process (including a likely bottlenecking nightmare) just to generate a chart?

Happily, he closes strongly with good suggestions for generating the least complex format possible for fulfilling the task, such as using RTF for formatted documents (it’s a text format, reasonably legible, and is well-documented) or CSV for simple Excel data.

In the end, the formats for the office applications are published. This is what Microsoft deals with in their office products—there’s no use complaining that they’re too complicated. They are what they are and most people should be able to avoid having to deal with them—unless you do something silly like joining the Office development team in Redmond.


[1] Their exact words are “Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when run in this environment.” That’s not even typically vague. Even when a vendor (not just Microsoft) is effusive about their solution, you should be careful. When they tell you not to do something, you should really just walk away. Did Spolsky even read his linked article?
[2] There are also rumors out there that at least one open source project is working on a way of compiling Excel formulae directly to Java bytecode for execution in server environments. Can’t find a web page for it, though.

Comments

#1 − wow

Marc

surprised that he really recommend to use com-automation at all. its shaky even für interactive winform-apps. but using com-automation in asp.net app?!?!? … don’t know how they get fogbugz stable at all using these techniques…