The Python Library is a continue source of amazement: I just discovered the very useful unicodedata module, which pairs the u'N{LETTER NAME}' escape sequence.
The N{} escape sequence works like this:
>>> u'\N{LATIN SMALL LETTER M WITH DOT BELOW}' u'\u1e43'
The unicodedata module, among other things, allows you to lookup the unicode character associated with a name, which allows you to build mapping tables using character names:
>>> import unicodedata >>> unicodedata.lookup('LATIN SMALL LETTER M WITH DOT BELOW') u'\u1e43'
The reverse of lookup() is name():
>>> unicodedata.name(unicodedata.lookup('LATIN SMALL LETTER M WITH DOT BELOW')) 'LATIN SMALL LETTER M WITH DOT BELOW' >>>
If you want to check unicode names, a very useful site is the Letter Database at the Institute of the Estonian Language. An example is the search for LATIN SMALL LETTER S WITH DOT BELOW, which yields this page.
The latest Joel article is an explanation of Unicode and character encodings for programmers. As for many other things, the Python community has produced wonderful tools, libraries, and documentation on Unicode too.
While browsing the logs for my RSS feeds tonight, I noticed a Pears user agent. It's a multiplatform news aggregator written in Python and WxPython by Project5. Other projects hosted on the same site include Kiki, a handy regexp tester that unlike Kodos does not depend on PyQt.
I'm converting papers for the next JSAWS issue, and instead of converting from Word to HTML (which involves regexping the resulting file to death to change font encoding, formatting, footnotes, etc.), this time I will try to use OpenOffice XML file format, and elementtidy.
A good starting point seems to be Uche Ogbuji's The open office file format on IBM developerWorks. More on this in the next few days, after I work on the conversion.
update: I have started work on one of the files, I will keep appending my commentary on the conversion below.
The first thing you will notice parsing content.xml, is that you need OpenOffice DTDs. Uche Ogbuji's article explains how to get them and how to configure your XML parser. If you're impatient, you can download the source code for OOoDocExplorer, a Java program by DannyB to browse OO files which has all OOo's DTDs combined into a single DTD file (mentioned on OOs's forums).
Focusing on content.xml, we notice a few (uhm, a lot) of namespace declarations:
xmlns:office="http://openoffice.org/2000/office" xmlns:style="http://openoffice.org/2000/style" xmlns:text="http://openoffice.org/2000/text" xmlns:table="http://openoffice.org/2000/table" xmlns:draw="http://openoffice.org/2000/drawing" xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:number="http://openoffice.org/2000/datastyle" xmlns:svg="http://www.w3.org/2000/svg" xmlns:chart="http://openoffice.org/2000/chart" xmlns:dr3d="http://openoffice.org/2000/dr3d" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="http://openoffice.org/2000/form" xmlns:script="http://openoffice.org/2000/script"
I am only insterested in the body text, so I will need the office namespace to locate the body element, and the text namespace for the actual content:
>>> from elementtree import ElementTree >>> NS_OFFICE = '{http://openoffice.org/2000/office}' >>> NS_TEXT = '{http://openoffice.org/2000/text}' >>> tree = ElementTree.parse('content.xml') >>> root = tree.getroot() >>> body = root.find(NS_OFFICE + 'body') >>> body <Element {http://openoffice.org/2000/office}body at 403146ac> >>>
Now that we have the body, let's explore it a bit by looking at the actual elements used, and their attributes:
>>> ns = len(NS_TEXT) >>> tags = {} >>> for t in body.getiterator()[1:]: ... tag = t.tag[ns:] ... attrs = tags.setdefault(tag, {}) ... for k, v in t.attrib.items(): ... attr = attrs.setdefault(k[ns:], {}) ... num = attr.setdefault(v, [0, ]) ... num[0] += 1 ... >>> for tag, attrs in tags.items(): ... print "tag %s" % tag ... for k, v in attrs.items(): ... print " %s" % '\n '.join(['%s="%s"(%s)' % (k, v1, v2[0]) ... for (v1, v2) in v.items()]) ... print ...
This produces a list of tags, followed by their attributes=values declarations, with the number they've been used in parenthesis, like this (only an excerpt from the list I got, since it's a bit long):
tag h style-name="P3"(1) style-name="P14"(1) level="1"(2) tag sequence-decl display-outline-level="0"(4) name="Table"(1) name="Drawing"(1) name="Illustration"(1) name="Text"(1)