Ludoo » Generating abstracts from HTML snippets

Generating abstracts from HTML snippets

ludo, August 18, 2003 at 00:39:00 CEST

A couple of days ago while working on this site's generator, I had to solve the (small) problem of generating abstracts of possibly arbitrary length from HTML snippets.

I wanted my code not only to trim down a snippet to a certain word length, but also to count how many words were left out from the abstract, and to preserve HTML tags (not counting them as words, ofc).

A brief look at Text Processing in Python by David Mertz pointed me in the right direction. It took a few minutes, and the abstracts appear to be good so far.

Last night I was browsing through Fredrik Lundh's blog and I stumbled upon his solution to the same problem, that I had only briefly read without much interest when it appeared in my aggregator. I did (and still do not) fully understand his code, mainly because I have never used formatter classes.

Ever the curious person, I decided to benchmark the two solutions together, mentally prepared to take a beating from a far superior programmer than me. I was surprised when my code resulted 35% faster (of course, this may be due to it not actually being my code, but a variation on DM code).

/-(ludo@pippozzo)-(84/pts)-(00:50:25:Mon Aug 18)--
-($:~/pystuff/staticblog)-- ./test.py
effbot code, 100 runs
2.05247092247
my code, 100 runs
1.30467903614

Not satisfied, I thought the difference was in my reusing the same istance vs effbot's code creating a new instance at each call (correct me if I'm wrong).

/-(ludo@pippozzo)-(102/pts)-(00:59:40:Mon Aug 18)--
-($:~/pystuff/staticblog)-- ./test.py
effbot code, 100 runs
2.05521595478
effbot code, 100 * 1 run min 0.011666 max 0.043925 avg 0.019288
my code, 100 runs
1.25740003586
my code, 100 * 1 run min 0.007895 max 0.037792 avg 0.013004

Still faster, though not by much. Here's my code:

import re
from HTMLParser import HTMLParser
class abstractParser(HTMLParser):
    """inspired by a simpler parser described in Text Processing in Python chap 5
    http://gnosis.cx/TPiP/chap5.txt"""
    space_re = re.compile('(?:s| )+', re.S)
    def __init__(self, abstract_length):
        HTMLParser.__init__(self)
        self.abstract_length = abstract_length
    def reset(self):
        HTMLParser.reset(self)
        self.tagstack = []
        self.abstract = []
        self.wordcount = 0
        self.morewords = 0
        self.completed = False
    def handle_starttag(self, tag, attrs):
        if not self.completed:
            self.tagstack.append('</%s>' % tag)
            self.abstract.append(self.get_starttag_text())
    def handle_endtag(self, tag):
        if not self.completed:
            self.abstract.append(self.tagstack.pop())
    def handle_data(self, data):
        if self.completed:
            self.morewords += len(self.space_re.findall(data))
        else:
            if data:
                words = []
                for word in self.space_re.split(data):
                    if self.completed and word != '':
                        self.morewords += 1
                        continue
                    if self.wordcount == self.abstract_length:
                        self.completed = True
                    if word != '':
                        self.wordcount += 1
                    words.append(word)
                self.abstract.append(' '.join(words))
    def feed(self, content):
        self.reset()
        # TODO: split feeding in reasonable chunks until self.completed
        HTMLParser.feed(self, content)
        HTMLParser.close(self)
        if self.morewords > 0:
            self.abstract.append("... (%s more words)" % self.morewords)
        self.tagstack.reverse()
        for t in self.tagstack:
            self.abstract.append(t)
        return ''.join(self.abstract)
if __name__ == '__main__':
    snippet1 = "<p>Lorem ipsum dolor sit <b>amet</b> ipso facto.</p>"
    snippet2 = "<p>Lorem ipsum <i>dolor sit <b>amet</b> ipso facto</i>.</p>"
    p = abstractParser(5)
    print p.feed(snippet1)
    print p.feed(snippet2)
    # gives
    # <p>Lorem ipsum dolor sit <b>amet</b>... (2 more words)</p>
    # <p>Lorem ipsum <i>dolor sit <b>amet</b>... (2 more words)</i></p>

Hmmm I even caught a bug writing this blog entry, grrr when will I learn to write tests before coding even for small things?

in: Python

trackback

» Python never had a chance against PHP? (30/03/2005)
» An unusual referrer (30/09/2004)
» HTTP Status Codes (linked to RFC2616) (15/09/2004)
» Russ discovers Perl (and bashes Python docs) (07/09/2004)
» SMIME sucks (24/06/2004)

Comments closed.

Reader comments

Comments closed.

Generating abstracts from HTML snippets

Related posts

Reader comments