Ludoo - Python - Sep 2003

HTTP proxies

Alan Kennedy has just announced a list of Python-based HTTP proxies, a few of them may come handy when developing web applications.

And if you're wondering what I'm doing sitting in front of my computer at this hour, well this is one of those nights where I just can't get asleep.

ludo ~ Sep 24, 2003 03:27:00 ~ upd Sep 24, 2003 04:17:00 ~ category Python

Combining Docbook-generated chunked HTML

The book Creating Applications With Mozilla is freely available at mozdev, but unfortunately it only comes as a set of HTML pages (or at least that's what I was able to find).

Having some time to waste, I set out to combine all the HTML pages in one single file, trying to improve my understanding of the wonderful elementree and elementtidy packages along the way.

The resulting script parses the files in the (hopefully) correct order, combines their HTML body elements into a single file, and fixes the internal references to point to the correct places in the new file.

The script takes about 19 seconds to run on my crappy celeron 600 machine, and the resulting file is 1.4Mb. Given that the book seems to written in Docbook, and produced with the chunked HTML Docbook XSL stylesheet, this script may serve as a starting point to reverse-engineer Docbook-produced HTML, if you ever need to do it.

ludo ~ Sep 19, 2003 16:27:00 ~ upd Sep 19, 2003 16:40:15 ~ category Python

Removing unused css classes

One of the things I like best of Python is the interactive console. I often use it to do quick manipulations on text files, and every time I wonder how I did manage before learning Python, when I wrote Perl or PHP scripts for similar things (yes I know that Perl has useful command line options for stuff like that, but with the Python console you can poke around and see the data you're manipulating interactively, get help on commands, etc.).

So today I set to the task of removing unneeded CSS classes from a huge HTML file I did not produce myself.

Getting the data from a file is a one-liner:

/-(ludo@pippozzo)-(27/pts)-(15:44:06:Sat Sep 06)--
\-($:~)-- python
Python 2.3 (#1, Aug  5 2003, 15:11:52)
[GCC 3.2.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> all = file('xxxxxxxxxx.html', 'r').read()

Then we import the re module and obtain a list of all css classes used in the file, removing duplicates:

>>> import re
>>> r = re.compile('class="([^"]+)"')
>>> styles = [s for s in r.findall(all) if s not in locals()['_[1]'].__self__]

The duplicate removal comprehension looks like a clever Perlish hack, but I'm in the console and it's nice to be able to do everything in one line. I copied it from the Python Cookbook entry The Secret Name of List Comprehensions by Chris Perkins, who warns that It should come as no surprise to anyone that this is a totally undocumented, seat-of- your-pants exploitation of an implementation detail of the Python interpreter. There is absolutely no guarantee that this will continue to work in any future Python release.

Now that we have the list of classes in use, we can remove unneeded ones and save the processed content on the original file:

>>> r = re.compile(r"""(?ms)^(\S*\.(\S+)\s+\{[^\}]+\}\n)""")
>>>
>>> for style in r.findall(all):
...     if not style[1] in styles:
...             all = all.replace(style[0], '')
...
>>> file('xxxxxxxxxx.html', 'w').write(all)

Storing the styles in a dictionary is more efficient (not that it matters in this quick console run), and eliminates the need of using the duplicate removal hack. Here is a version using a dictionary:

>>> import re
>>> all = file('xxxxxxxxxx.html', 'r').read()
>>> r = re.compile('class="([^"]+)"')
>>> styles = {}
>>> for s in r.findall(all):
...     styles[s] = None
...
>>> r = re.compile(r"""(?ms)^(\S*\.(\S+)\s+\{[^\}]+\}\n)""")
>>> for style in r.findall(all):
...     if not style[1] in styles:
...             all = all.replace(style[0], '')
...
>>> file('xxxxxxxxxx.html', 'w').write(all)

update: the lines above worked for the particular file I was editing, a general solution would probably need a few changes:

In the second regexp, used to identify style class declarations, I assume the class identifier is anchored at the beginning of a line, which is not always the case
I look for style class declarations in the whole file, which is pretty pointless (but saves typing a few lines of code in the console and works for the specific HTML file in question); style class declarations are obviously inside a <style></style> block, so it's better to limit the scope of the second findall() to the contents of the style block(s), speeding up the search/replace and reducing the chance of errors in the regexp match (for example, matching blocks of C/Java/PHP source in the body of the document)
maybe use a scanner object instead of the second findall, and replace using the match position, as Fredrik Lundh explains in Using Regular Expressions for Lexical Analysis

ludo ~ Sep 06, 2003 16:10:39 ~ upd Sep 06, 2003 21:23:33 ~ category Python

Text Processing in Python

My current technical reading is the excellent book Text Processing in Python by David Mertz. In chapter 1 David expresses with his usual clarity a couple of concepts I usually unconsciously follow in my Python programming, but which are important enough to be repeat here as a reminder to myself.

[...] an important principle of Python programming makes types less important than programmers coming from other languages tend to expect. According to Python's "principle of pervasive polymorphism" (my own coinage), it is more important what an object does than what it is.

David then proceeds to describe a few practical cases where pervasive polymorphism is useful, like working on file-like objects.

One common way to test a capability in Python is to try to do something, and catch any exceptions that occur (then try something else).

One of the main reasons why I like Python so much compared to other languages is the incredible usefulness of its exception mechanism. After all, many of the things we experiment or learn in life we do by trial and error, and using this same method in programming just fits your brain.

ludo ~ Sep 05, 2003 10:58:00 ~ upd Sep 05, 2003 11:26:41 ~ category Python