I'm converting papers for the next JSAWS issue, and instead of converting from Word to HTML (which involves regexping the resulting file to death to change font encoding, formatting, footnotes, etc.), this time I will try to use OpenOffice XML file format, and elementtidy.
A good starting point seems to be Uche Ogbuji's The open office file format on IBM developerWorks. More on this in the next few days, after I work on the conversion.
update: I have started work on one of the files, I will keep appending my commentary on the conversion below.
The first thing you will notice parsing content.xml, is that you need OpenOffice DTDs. Uche Ogbuji's article explains how to get them and how to configure your XML parser. If you're impatient, you can download the source code for OOoDocExplorer, a Java program by DannyB to browse OO files which has all OOo's DTDs combined into a single DTD file (mentioned on OOs's forums).
Focusing on content.xml, we notice a few (uhm, a lot) of namespace declarations:
xmlns:office="http://openoffice.org/2000/office" xmlns:style="http://openoffice.org/2000/style" xmlns:text="http://openoffice.org/2000/text" xmlns:table="http://openoffice.org/2000/table" xmlns:draw="http://openoffice.org/2000/drawing" xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:number="http://openoffice.org/2000/datastyle" xmlns:svg="http://www.w3.org/2000/svg" xmlns:chart="http://openoffice.org/2000/chart" xmlns:dr3d="http://openoffice.org/2000/dr3d" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="http://openoffice.org/2000/form" xmlns:script="http://openoffice.org/2000/script"
I am only insterested in the body text, so I will need the office namespace to locate the body element, and the text namespace for the actual content:
>>> from elementtree import ElementTree >>> NS_OFFICE = '{http://openoffice.org/2000/office}' >>> NS_TEXT = '{http://openoffice.org/2000/text}' >>> tree = ElementTree.parse('content.xml') >>> root = tree.getroot() >>> body = root.find(NS_OFFICE + 'body') >>> body <Element {http://openoffice.org/2000/office}body at 403146ac> >>>
Now that we have the body, let's explore it a bit by looking at the actual elements used, and their attributes:
>>> ns = len(NS_TEXT) >>> tags = {} >>> for t in body.getiterator()[1:]: ... tag = t.tag[ns:] ... attrs = tags.setdefault(tag, {}) ... for k, v in t.attrib.items(): ... attr = attrs.setdefault(k[ns:], {}) ... num = attr.setdefault(v, [0, ]) ... num[0] += 1 ... >>> for tag, attrs in tags.items(): ... print "tag %s" % tag ... for k, v in attrs.items(): ... print " %s" % 'n '.join(['%s="%s"(%s)' % (k, v1, v2[0]) ... for (v1, v2) in v.items()]) ... print ...
This produces a list of tags, followed by their attributes=values declarations, with the number they've been used in parenthesis, like this (only an excerpt from the list I got, since it's a bit long):
tag h style-name="P3"(1) style-name="P14"(1) level="1"(2) tag sequence-decl display-outline-level="0"(4) name="Table"(1) name="Drawing"(1) name="Illustration"(1) name="Text"(1)
Related posts
» An unusual referrer (30/09/2004)
» HTTP Status Codes (linked to RFC2616) (15/09/2004)
» Russ discovers Perl (and bashes Python docs) (07/09/2004)
» SMIME sucks (24/06/2004)