A couple of days ago while working on this site's generator, I had to solve the (small) problem of generating abstracts of possibly arbitrary length from HTML snippets.
I wanted my code not only to trim down a snippet to a certain word length, but also to count how many words were left out from the abstract, and to preserve HTML tags (not counting them as words, ofc).
A brief look at Text Processing in Python by David Mertz pointed me in the right direction. It took a few minutes, and the abstracts appear to be good so far.
Last night I was browsing through Fredrik Lundh's blog and I stumbled upon his solution to the same problem, that I had only briefly read without much interest when it appeared in my aggregator. I did (and still do not) fully understand his code, mainly because I have never used formatter classes.
Ever the curious person, I decided to benchmark the two solutions together, mentally prepared to take a beating from a far superior programmer than me. I was surprised when my code resulted 35% faster (of course, this may be due to it not actually being my code, but a variation on DM code).
/-(ludo@pippozzo)-(84/pts)-(00:50:25:Mon Aug 18)-- -($:~/pystuff/staticblog)-- ./test.py effbot code, 100 runs 2.05247092247 my code, 100 runs 1.30467903614
Not satisfied, I thought the difference was in my reusing the same istance vs effbot's code creating a new instance at each call (correct me if I'm wrong).
/-(ludo@pippozzo)-(102/pts)-(00:59:40:Mon Aug 18)-- -($:~/pystuff/staticblog)-- ./test.py effbot code, 100 runs 2.05521595478 effbot code, 100 * 1 run min 0.011666 max 0.043925 avg 0.019288 my code, 100 runs 1.25740003586 my code, 100 * 1 run min 0.007895 max 0.037792 avg 0.013004
Still faster, though not by much. Here's my code:
import re from HTMLParser import HTMLParser class abstractParser(HTMLParser): """inspired by a simpler parser described in Text Processing in Python chap 5 http://gnosis.cx/TPiP/chap5.txt""" space_re = re.compile('(?:s| )+', re.S) def __init__(self, abstract_length): HTMLParser.__init__(self) self.abstract_length = abstract_length def reset(self): HTMLParser.reset(self) self.tagstack = [] self.abstract = [] self.wordcount = 0 self.morewords = 0 self.completed = False def handle_starttag(self, tag, attrs): if not self.completed: self.tagstack.append('</%s>' % tag) self.abstract.append(self.get_starttag_text()) def handle_endtag(self, tag): if not self.completed: self.abstract.append(self.tagstack.pop()) def handle_data(self, data): if self.completed: self.morewords += len(self.space_re.findall(data)) else: if data: words = [] for word in self.space_re.split(data): if self.completed and word != '': self.morewords += 1 continue if self.wordcount == self.abstract_length: self.completed = True if word != '': self.wordcount += 1 words.append(word) self.abstract.append(' '.join(words)) def feed(self, content): self.reset() # TODO: split feeding in reasonable chunks until self.completed HTMLParser.feed(self, content) HTMLParser.close(self) if self.morewords > 0: self.abstract.append("... (%s more words)" % self.morewords) self.tagstack.reverse() for t in self.tagstack: self.abstract.append(t) return ''.join(self.abstract) if __name__ == '__main__': snippet1 = "<p>Lorem ipsum dolor sit <b>amet</b> ipso facto.</p>" snippet2 = "<p>Lorem ipsum <i>dolor sit <b>amet</b> ipso facto</i>.</p>" p = abstractParser(5) print p.feed(snippet1) print p.feed(snippet2) # gives # <p>Lorem ipsum dolor sit <b>amet</b>... (2 more words)</p> # <p>Lorem ipsum <i>dolor sit <b>amet</b>... (2 more words)</i></p>
Hmmm I even caught a bug writing this blog entry, grrr when will I learn to write tests before coding even for small things?
Related posts
» An unusual referrer (30/09/2004)
» HTTP Status Codes (linked to RFC2616) (15/09/2004)
» Russ discovers Perl (and bashes Python docs) (07/09/2004)
» SMIME sucks (24/06/2004)