October 02, 2003
New category, first entry. I have been thinking for a while of writing about Music, but the difficulty of writing on non-technical arguments in English has always prevented me from doing it. A few minutes ago, listening on blackark.com to the splendid Melting Pot by Boris Gardner, I decided it's the music that counts, not the words. If somebody comes to love Reggae Music by reading my recommendations, and if I discover new artists and new songs, it's worth cutting a poor figure with English.
Not much to say about Boris Gardner, apart from knowing his name by having listened countless times in the past to Elizabethan Reggae, his 70s classic. I tried on Google, but did not come up with much apart from the definition Jamaican Funk which appears to suit him well. The undisputed master of Jamaican Funk is Toots Hibberts of Toos & the Maytals fame, one of the greatest of Reggae Music.
Melting Pot appears to be included in the compilation 200% Dynamite by Soul Jazz Records, a small London-based record label with a very interesting catalogue and a retro look. The compilation explores the links between Reggae, Jazz, Funk and Soul including some of the greatest reggae musicians of the 70s, like Augustus Pablo, Toots, Jackie Mittoo, Tommy McCook. A single CD is a very limited space to include the wealth of talents Jamaica has produced in the 70s, but IMHO it's impossible to explore the Jazz roots of Reggae without including Ernest Ranglin, Jamaica's greates guitarist.
October 01, 2003
While browsing the logs for my RSS feeds tonight, I noticed a Pears user agent. It's a multiplatform news aggregator written in Python and WxPython by Project5. Other projects hosted on the same site include Kiki, a handy regexp tester that unlike Kodos does not depend on PyQt.
I'm converting papers for the next JSAWS issue, and instead of converting from Word to HTML (which involves regexping the resulting file to death to change font encoding, formatting, footnotes, etc.), this time I will try to use OpenOffice XML file format, and elementtidy.
A good starting point seems to be Uche Ogbuji's The open office file format on IBM developerWorks. More on this in the next few days, after I work on the conversion.
update: I have started work on one of the files, I will keep appending my commentary on the conversion below.
The first thing you will notice parsing content.xml, is that you need OpenOffice DTDs. Uche Ogbuji's article explains how to get them and how to configure your XML parser. If you're impatient, you can download the source code for OOoDocExplorer, a Java program by DannyB to browse OO files which has all OOo's DTDs combined into a single DTD file (mentioned on OOs's forums).
Focusing on content.xml, we notice a few (uhm, a lot) of namespace declarations:
xmlns:office="http://openoffice.org/2000/office"
xmlns:style="http://openoffice.org/2000/style"
xmlns:text="http://openoffice.org/2000/text"
xmlns:table="http://openoffice.org/2000/table"
xmlns:draw="http://openoffice.org/2000/drawing"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:number="http://openoffice.org/2000/datastyle"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:chart="http://openoffice.org/2000/chart"
xmlns:dr3d="http://openoffice.org/2000/dr3d"
xmlns:math="http://www.w3.org/1998/Math/MathML"
xmlns:form="http://openoffice.org/2000/form"
xmlns:script="http://openoffice.org/2000/script"
I am only insterested in the body text, so I will need the office namespace to locate the body element, and the text namespace for the actual content:
>>> from elementtree import ElementTree
>>> NS_OFFICE = '{http://openoffice.org/2000/office}'
>>> NS_TEXT = '{http://openoffice.org/2000/text}'
>>> tree = ElementTree.parse('content.xml')
>>> root = tree.getroot()
>>> body = root.find(NS_OFFICE + 'body')
>>> body
<Element {http://openoffice.org/2000/office}body at 403146ac>
>>>
Now that we have the body, let's explore it a bit by looking at the actual elements used, and their attributes:
>>> ns = len(NS_TEXT)
>>> tags = {}
>>> for t in body.getiterator()[1:]:
... tag = t.tag[ns:]
... attrs = tags.setdefault(tag, {})
... for k, v in t.attrib.items():
... attr = attrs.setdefault(k[ns:], {})
... num = attr.setdefault(v, [0, ])
... num[0] += 1
...
>>> for tag, attrs in tags.items():
... print "tag %s" % tag
... for k, v in attrs.items():
... print " %s" % 'n '.join(['%s="%s"(%s)' % (k, v1, v2[0])
... for (v1, v2) in v.items()])
... print
...
This produces a list of tags, followed by their attributes=values declarations, with the number they've been used in parenthesis, like this (only an excerpt from the list I got, since it's a bit long):
tag h
style-name="P3"(1)
style-name="P14"(1)
level="1"(2)
tag sequence-decl
display-outline-level="0"(4)
name="Table"(1)
name="Drawing"(1)
name="Illustration"(1)
name="Text"(1)
Every time I boot my laptop in Windows (which happens once per week or less), I am greeted by the familiar Windows Update popup. Every time, I feel lucky to have switched full time to Linux.
What kept my desktop environment tied to Windows for a long time were mainly Word (now I use a combination of ReST and LaTeX), Internet Explorer+OE (Firebird+Thunderbird), a decent editor (J), and good fonts quality (TTF Web Fonts+freetype with the TTF bytecode enabled).
There still are a few things that Windows does better, mainly desktop and apps integration, managing files graphically, scanner support and scan quality. And there still are a few things where Linux needs improvement, but it is definitely getting there. I can't wait to install Sun's Mad Hatter beta on my office desktop next week.
BTW, if you're curious about the MSVDM toolbar you can see in the picture, it's the free Virtual Desktop Manager from the Microsoft PowerToys for Windows XP.
September 28, 2003
Last night Italy suffered a massive blackout, so this morning I came back from the lake Maggiore to find my server dead. The server, which runs all our mail and web sites, was (yes, was) a very old Compaq Deskpro with a Celeron 300 and twin 4Gb disks configured as a software RAID-1 array. I liked it because it was a very silent and low- consumption machine with no CPU fan, and I don't need more power than that to manage a few mailboxes, a couple of web sites, and my home LAN.
I knew it was in bad shape, but I always managed to reboot it in the past, the very few times I needed to- Today I tried everything, but the bios screen refuses to come up, and no beeps escape from the speaker at boot time.
So I wasted all day trying to let my desktop's twin machine (an Athlon 750 PC I built three years ago) see my server's two disks, and remount the RAID array. The new machine's BIOS screws up the disks geometry, and if I attach both of them on the two IDE channels, I see only the first one. Weird.
So I burned a mini-CD with tomsrtbt, transferred everything on a 20Gb disk, rebuilt the array, hotraidadded one of the two 4Gb disks, reconfigured LILO, rebooted, and LILO started spitting out error messages. Reconfigured, double checked everything, LILO comes up with a checksum error. Hmmm I'm getting old for this, I remember when it was fun but it's not anymore.
In the end I installed Grub, took the 4Gb disk off the array, fscked to death the 20Gb disk who got corrupted in the meantime, and the server is back up. Tomorrow I will buy a second 20Gb or 40Gb disk and add it to the RAID array.
The only good thing to come out of this is that I learned something more about Grub, and I really like it. If things get tough, Grub is your friend. A few useful links if you want to switch from LILO to Grub, boot a software RAID partition from Grub, convert a running system to software RAID (mainly geared towards Debian users, Slackware users may find this more useful).
My fiber optic link where this site is usually served from is still down, something to do with the Catalyst in the basement who serves the whole building, I have called a few times but all I'm getting is it will be fixed RSN. If I knew how to lockpick the rack, and I had not spent the whole day fighting disks and LILO, I would be tempted to connect my laptop and try to bring it up myself (but maybe it's not so easy anymore to reboot a Cisco and get enable permissions). Luckily I have still my ADSL link, who promptly resumed service as soon as power came back. I have switched the DNS to the ADSL address, so this site will slowly resurface on the Web, but it will be pretty slow until the fiber optic link is working again.
September 27, 2003
I admit it, I'm an architect. Not only as in IT Architect or whatever my job title of the moment is, but as in builder of houses. I graduated in 1995, and though I was totally in love with architecture, for a series of coincidences and some luck I soon found myself working in IT, which had been till then little more than a hobby.
I never got back to architecture, and soon I more or less stopped studying and researching it, but in the deep recesses of my mind things continued to work, though at a different pace. So lately I'm thinking about a research project on a period of architecture I've always found very intriguing, and usually overlooked by historians of architecture. The project will involve heavy researching in the field, which happens to be the city where I live. So it will hopefully soon be time to pick up a camera and start wandering the streets in my free time.
Photography has been the third of my passions (read: obsessions) in the past with IT and architecture, so while I wait for my ever procrastinating self to start working on this new project, I'm digging out my old pictures to scan them and put them online as a sort of mental training.
Since they will use up bandwidth, and very few people will be interested in them, my pictures category will stay confined to a row in the right menu, without making it to my blog pages. If you're interested, point your browser from time to time to my pictures area.
September 25, 2003
Sometimes being reminded of one's ignorance is not only instructive, but funny too. In a recent message on the armedbear-j-devel mailing list in reply to a non-bug I recently submitted, Peter Graves (the J developer) used an acronym I never saw before, DWIM (Paste would retain its current DWIMish behavior...).
After replying to the message, I made a quick search on DWIMI expecting to find a reference to some arcane editor of days past, but what I found was something completely different, rooted in the world of LISP gurus
/dwim/ [acronym, "Do What I Mean" (not what I say)]
- Able to guess, sometimes even correctly, the result intended when bogus input was provided.
- The BBNLISP/INTERLISP function that attempted to accomplish this feat by correcting many of the more common errors. See hairy.
- Occasionally, an interjection hurled at a balky computer, especially when one senses one might be tripping over legalisms (see legalese).
Foldoc goes on to relate a notorious incident involving DWIM, which is worth reading
Warren Teitelman originally wrote DWIM to fix his typos and spelling errors, so it was somewhat idiosyncratic to his style, and would often make hash of anyone else's typos if they were stylistically different. Some victims of DWIM thus claimed that the acronym stood for "Damn Warren's Infernal Machine!'.
In one notorious incident, Warren added a DWIM feature to the command interpreter used at Xerox PARC. One day another hacker there typed "delete *$" to free up some disk space. (The editor there named backup files by appending "$" to the original file name, so he was trying to delete any backup files left over from old editing sessions.) It happened that there weren't any editor backup files, so DWIM helpfully reported "*$ not found, assuming you meant 'delete *'". It then started to delete all the files on the disk! The hacker managed to stop it with a Vulcan nerve pinch after only a half dozen or so files were lost.
The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type "delete *$" twice.
Sometimes it pays to be ignorant.....
A few days ago I finally got fed up with spam, so I decided to install a spam filter on my server. After reading around a bit, I settled on Spambayes, which (apart from being written in Python) looks a very solid and well maintained project.
I don't use Outlook (I run Linux both on my server and on my desktops), so using Spambayes Outlook plugin was not an option. Since I'm using Maildir as the storage format for both SMTP and IMAP, I initially tried the Spambayes IMAP filter. Unfortunately, the filter is still in its early stage of development, and the IMAP protocol varies significantly among different server implementations. The main problems I had with the IMAP filter were its marking all new messages as read after processing (this is apparently due to my IMAP server lack of support for an obscure IMAP command), and its frequent crashes.
So after a few hours of monitoring the IMAP filter's activity, I decided to change my approach. Reading around a bit, I discovered that the venerable procmail (which I used a lot until five or six years ago) now natively supports Maildirs.
A .qmail forward file, a .procmailrc recipe and a cron job later, I had a flawless Spambayes setup. In the past 3 days, Spambayes has worked admirably with minimal training, intercepting 99% of all spam and generating zero false positives. Definitely recommended.
My .qmail file simply passes everything along to procmail for delivery:
| preline /usr/bin/procmail
My .procmailrc recipe looks like this:
PATH=$HOME/bin:/usr/bin:/bin:/usr/local/bin:.
MAILDIR=$HOME/Maildir
DEFAULT=$MAILDIR/
LOGFILE=$HOME/procmail.log
LOCKFILE=$HOME/.lockmail
:0 fw
| /usr/bin/sb_filter.py
:0
* ^X-SpamBayes-Classification: spam
.INBOX.spambayes.spam/
:0
* ^X-SpamBayes-Classification: unsure
.INBOX.spambayes.unsure/
Notice how the trailing slash in the DEFAULT delivery identifies a Maildir storage. The rest is pretty self-explanatory, apart maybe from the folder namespace, which is the one used by default by my IMAP server:
- the first directive instructs procmail to feed the message to sb_filter.py
- sb_filter processes the message and adds the X-SpamBayes-Classification header with two values, the first one marking the message as either spam/ham/unsure, the second one displaying the exact numeric spam rating (from 0 to 1)
- the second and third directives match the header on its first value for spam and unsure, and deliver the message to the appropriate Maildir
- if a message is not matched by the second or third directive, it "falls off" the chain and gets delivered to DEFAULT, which in this case is my inbox
To train the filter, I run a cron job every half hour that looks into two Maildir folders for spam and ham messages (the following lines are ofc a single crontab line):
0,30 * * * * /usr/bin/sb_mboxtrain.py -d /home/ludo/.spambayes_hammie.db
-g /home/ludo/Maildir/.INBOX.spambayes.train_ham/
-s /home/ludo/Maildir/.INBOX.spambayes.train_spam/
-n >/dev/null 2>&1
Meaning, every half hour cron runs sb_mboxtrain, instructing it to use the .spambayes_hammie.db (previously created with sb_filter.py -n), and to fetch ham messages from the .INBOX.spambayes.train_ham Maildir, and spam messages from the .INBOX.spambayes.train_spam Maildir.
The Maildir directories where spam/unsure messages get delivered, and where you deposit messages to train SpamBayes, can be created either from your mail client or with the command-line utility maildirmake, provided with qmail and courier.
The last piece of information you need before running this setup, is a .spambayesrc file in your home directory. Mine contains the following lines:
[Storage]
persistent_use_database = True
persistent_storage_file = ~/.spambayes_hammie.db
That's all, efficient and reliable spam protection in 5 minutes or so.
|