Sat, 01 Sep 2007

Recovering a Pyblosxom blog using liferea's RSS cache

My buddy who used to host lewk.org didn't pay his bills, so his server got taken down last week. What sucks is I that never backed up my Pyblosxom data. What doesn't suck is that thankfully Liferea, my RSS reader, did for me.

Grepping through ~/.liferea_1.2/cache/feeds, I was able to find my blog cached in some XML format. Then I wrote a little bit of code to re-create my Pyblosxom entry structure with the proper filenames and timestamps.

#!/usr/bin/python -tt
"""
 Turns XML into pyblosxom blog entries.

 It parses BLOG_XML pulling out blog entires in the form of:

     <feed version="1.1">
       <item>
         <title></title>
         <description></description>
         <source>http://foo.com/blog/2007/08/20/bar.html</source>
         <time>1187621268</time>
       </item>
     </feed>

 The file '2007/08/20/bar.txt' will be created in pyblosxom format with
 the appropriate timestamp.  The #mdate is used by the pyblosxom.vim plugin.

     title
     #mdate Aug 20 10:47:48 2007
     <p>description</p>
"""

import os
import time

try: from xml.etree import cElementTree
except ImportError: import cElementTree
iterparse = cElementTree.iterparse

entries = {} # { 'title' : <Element> }

BLOG_XML = 'blog.xml'
BLOG_ROOT = 'http://foo.com/blog/'

def getField(elem, field):
    for child in elem:
        if child.tag == field:
            return child.text

## Pull out all feed items, removing older duplicates
for event, elem in iterparse(BLOG_XML):
    if elem.tag == 'feed':
        for child in elem:
            if child.tag == 'item':
                title = getField(child, 'title')
                if entries.has_key(title):
                    if int(getField(child, 'time')) > \
                       int(getField(entries[title], 'time')):
                        entries[title] = child
                else:
                    entries[title] = child

for title, entry in entries.items():
    source = getField(entry, 'source').replace(BLOG_ROOT, '')
    source = source.replace('.html', '.txt')
    if not os.path.isdir(os.path.dirname(source)):
        os.makedirs(os.path.dirname(source))
    output = file(source, 'w')
    output.write(title + '\n')
    mtime = time.localtime(int(getField(entry, 'time')))
    mdate = time.strftime("%b %e %H:%M:%S %Y", mtime)
    output.write("#mdate %s\n" % mdate)
    output.write("<p>%s</p>\n" % getField(entry, 'description'))
    output.close()
    timestamp = time.strftime("%y%m%d%H%M", mtime)
    os.system("touch -t %s %s" % (timestamp, source))

It also adds an #mdate tag into each entry, which read by the spiffy pyblosxom mdate vim hack that Jordan Sissel wrote to restore each entries original timestamp after editing. His code only works on FreeBSD at the moment, so I started a pyblosxom.vim plugin that works on Linux (hopefully it will eventually support both, along with a bunch of other handy functions). You can find all of this code in my mercurial repo: hg.lewk.org/xml2pyblosxom



posted at: 11:44 | link | Tags: , , | 0 comments


Name:


E-mail:


URL:


Comment: