One of the things I like best of Python is the interactive console. I often use it to do quick manipulations on text files, and every time I wonder how I did manage before learning Python, when I wrote Perl or PHP scripts for similar things (yes I know that Perl has useful command line options for stuff like that, but with the Python console you can poke around and see the data you're manipulating interactively, get help on commands, etc.).
So today I set to the task of removing unneeded CSS classes from a huge HTML file I did not produce myself.
Getting the data from a file is a one-liner:
/-(ludo@pippozzo)-(27/pts)-(15:44:06:Sat Sep 06)-- -($:~)-- python Python 2.3 (#1, Aug 5 2003, 15:11:52) [GCC 3.2.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> all = file('xxxxxxxxxx.html', 'r').read()
Then we import the re module and obtain a list of all css classes used in the file, removing duplicates:
>>> import re >>> r = re.compile('class="([^"]+)"') >>> styles = [s for s in r.findall(all) if s not in locals()['_[1]'].__self__]
The duplicate removal comprehension looks like a clever Perlish hack, but I'm in the console and it's nice to be able to do everything in one line. I copied it from the Python Cookbook entry The Secret Name of List Comprehensions by Chris Perkins, who warns that It should come as no surprise to anyone that this is a totally undocumented, seat-of- your-pants exploitation of an implementation detail of the Python interpreter. There is absolutely no guarantee that this will continue to work in any future Python release.
Now that we have the list of classes in use, we can remove unneeded ones and save the processed content on the original file:
>>> r = re.compile(r"""(?ms)^(S*.(S+)s+{[^}]+}n)""") >>> >>> for style in r.findall(all): ... if not style[1] in styles: ... all = all.replace(style[0], '') ... >>> file('xxxxxxxxxx.html', 'w').write(all)
Storing the styles in a dictionary is more efficient (not that it matters in this quick console run), and eliminates the need of using the duplicate removal hack. Here is a version using a dictionary:
>>> import re >>> all = file('xxxxxxxxxx.html', 'r').read() >>> r = re.compile('class="([^"]+)"') >>> styles = {} >>> for s in r.findall(all): ... styles[s] = None ... >>> r = re.compile(r"""(?ms)^(S*.(S+)s+{[^}]+}n)""") >>> for style in r.findall(all): ... if not style[1] in styles: ... all = all.replace(style[0], '') ... >>> file('xxxxxxxxxx.html', 'w').write(all)
update: the lines above worked for the particular file I was editing, a general solution would probably need a few changes:
- In the second regexp, used to identify style class declarations, I assume the class identifier is anchored at the beginning of a line, which is not always the case
- I look for style class declarations in the whole file, which is pretty pointless (but saves typing a few lines of code in the console and works for the specific HTML file in question); style class declarations are obviously inside a <style></style> block, so it's better to limit the scope of the second findall() to the contents of the style block(s), speeding up the search/replace and reducing the chance of errors in the regexp match (for example, matching blocks of C/Java/PHP source in the body of the document)
- maybe use a scanner object instead of the second findall, and replace using the match position, as Fredrik Lundh explains in Using Regular Expressions for Lexical Analysis