2012-04-16

Using Python to Extract Data From a Smereka Created SQLite Database

I've been migrating my snippets of stuff to thinkery.me, but for more private/personal data, I've been using Smereka: It's open source, with a (for me) intuitive GUI, fast --- and uses sqlite for its backend database.

Unfortunately, its export absolutely sucks, and it looks like Smereka's author is trying to direct folks to his commercial version, and probably won't be adding many features/fixes to the free version.

So, having become set on doing something with all my random files (dot configs, equipment tracking, cake recipes (all the more important when navigating portals), contacts, TMF Yearmix track listings, pppd dial scripts, etc) beyond dumping them all in a backup directory in an encrypted container (and backing up to dropbox and Ubuntu One), I wanted to make sure I could liberate all the data, if I needed to, from whichever program I used. (Siderant: I looked at Evernote, Treeline, etc... why are there so many programs that do such a simple thing so wrongly?)  Smereka was nearly perfect, except for putting all items into blobs, which my limited abilities don't allow me to retrieve directly with just an interactive sqlite3 session.

Enter Python. (And ni saying knights, spam, laden swallows and show tunes sung from crosses.)

I've been meaning to 'take a look' at python for the better part of a decade, but never had a big enough incentive to push me into doing so.  In one of those weird instances of perfect timing, a good friend linked me to Alex Munroe's rant about PHP, unaware I was bracing myself for some soul destroying C++ spelunking. I read through it, nodding my head and agreeing with most of the points, and was ready to drink the koolaid by the time I got to the end.

What follows is my first ever python program; it's the result of (literally!) a couple days of learning the language.  I'm definitely sold on Python; php was always, "Yeah, it's flawed, but what else is there, for quick scripts and web backends?" --- now, I know the answer is "Python."

This code is likely incredibly naive.  There are almost certainly better/more elegant ways of accomplishing what I'm doing.  However, for my purposes, all I needed was a way of being certain the stuff I was putting into Smereka could be liberated in the future, should I ever need to.

Thus, I'm simply extracting all the data (which, in my case, is text. Smereka also allows images, rich text, etc) and dumping it into a single file.  It would be trivial to dump each record to its own file, or use the parent folder as a containing file, or whatever.  Since the searching inside of Smereka works just fine, the only point for me was to be able to have something greppable, and zipped, 'just in case'.  So, I'm not even bothering to grab the tags or descriptions, again which would be trivial. (Technical: I ignore the DataType: File, as well as the tags in StringsTbl.) I'm skipping error checking, or even politeness, like adding 'global var' to defs. (Obviously, I'm only reading...).  I don't worry about memory management, or if it'd be better to create a table with folder names, rather than walk backwards each time... and, finally, I half-assed dealing with the unicode 'issues'. (Windows directory listings from over a decade ago have the obligatory broken bytes to make python kick up a UnicodeDecodeError.)

Note: This code was tested with Python 2.7.1+. YMMV.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sqlite3 as sql
import sys
import zlib
import codecs

DATABASE = "db.sqlite"

DELIM = "\\"
folders = {}
titems = {}

FQUERY = "SELECT ItemId, ParentId, KeyText FROM Items WHERE \
    DataType = 'Folder|24A1B35E-B391-184F-12C3-FBB2A40B9B37'"
DQUERY = "SELECT ItemId, ParentId, KeyText FROM Items WHERE \
    DataType = 'Plain Text|49F238EA-38B0-1744-CD64-1947630FA367'"

def traverse_tree(node, p):
    """traverse_tree -
    : Walks backward with the parentID until no
    : more parents are found.
    """
    while node['parent'] != 0:
        p.append(node['keytext']+DELIM)
        node = folders[node['parent']]
    p.append(node['keytext']+DELIM)

def debug_folderwalk():
    """Debug function to verify tree iteration."""
    for v in folders.values():
        if v['parent'] == 0:
            print "DEBUG: Path: %s" % v['keytext'] + DELIM
        else:
            pathexpand = []
            traverse_tree(v, pathexpand)
            print "DEBUG: Path: "+ "".join(reversed(pathexpand))

def debug_rawdata(d):
    """Debug function to show raw data from database"""
    for key, value in d.iteritems():
        print "DEBUG: %s-%s" % (key, value)
    
def path_parse(id):
    """path_parse - 
    : Returns the folder tree in reverse
    : order of depth for the given item.
    """
    if titems[id]['parent'] == 0: 
        return DELIM
    else:
        pathexpand = []
        traverse_tree(folders[titems[id]['parent']], pathexpand)
        return "".join(reversed(pathexpand))

def decorate(p, i, v):
    """decorate - 
    : Decorates output for each text file
    : to make it clear which tree node this is,
    : as well as establing the start and end of item.
    """
    t = "/* "+"*"*63+"\n" # top
    r = " "*3+"* " # right
    b = " "*3+"*"*63+" */\n" # bottom
    s = "\n/* "+"-"*63+" */\n" # eof marker
    return t+r+"Path: %s\n" % (p)+r+"Item: %s\n" % (i)+b+"%s" % (v)+s

def populate_dict(c, q, d):
    """populate_dict -
    : Takes a cursor, query and dictionary, and populates 
    : the dictionary with the results of the query.
    """
    c.execute(q)
    while True:
        row = c.fetchone()
        if row == None: break
        d[row[0]] = dict({"parent":row[1], "keytext":row[2]})

byteswritten = 0

f = codecs.open("data", "wb", 'utf-8')

with sql.connect(DATABASE) as con:
        cur = con.cursor()
        populate_dict(cur, FQUERY, folders)
        populate_dict(cur, DQUERY, titems)
        for item in titems:
            cur.execute("SELECT TheData FROM Blobs WHERE ItemId = ?", (item,))
            cblob = cur.fetchone()
            zobj = zlib.decompressobj()
            dblob = unicode(zobj.decompress(cblob[0])[6:], 'utf-8')
            buf = decorate(path_parse(item), titems[item]['keytext'], dblob)
            f.write(buf)
            byteswritten += len(buf)

f.close()

print "Wrote {:,d} bytes from {:,d} item records.\n"\
    .format(byteswritten, len(titems))

#debug_rawdata(folders)
#debug_rawdata(titems)
#debug_folderwalk()

Example output (ignore the cobwebs...):

/* ***************************************************************
   * Path: Equipment\APC Technical Information\
   * Item: APC Backups BX1500G Original
   *************************************************************** */
Model: Back-UPS BX1500G
Serial number: ------------
Firmware revision: 866.L4 .D
USB firmware revision: L4 
Result of last manual self-test: Not Recorded
Last manual self-test date: Not Recorded
Last battery replacement: 9/12/2010

/* --------------------------------------------------------------- */
/* ***************************************************************
   * Path: Equipment\APC Technical Information\
   * Item: APC Backups BX1500G Replacement
   *************************************************************** */
Model: Back-UPS BX1500G
Serial number: -----------
Firmware revision: 866.L5 .D
USB firmware revision: L5 
Result of last manual self-test: Not Recorded
Last manual self-test date: Not Recorded
Last battery replacement: 12/4/2010

/* --------------------------------------------------------------- */
/* ***************************************************************
   * Path: Scripts\
   * Item: uudecode
   *************************************************************** */
grep -v ":\ " Alt.binaries.multimedia.repost | grep -v "From\ " | grep -v '^$' | uudecode -c

/* --------------------------------------------------------------- */
/* ***************************************************************
   * Path: deprecated\linux_varg\rc\
   * Item: linux_winmx_forward
   *************************************************************** */
ipmasqadm portfw -l
prot localaddr            rediraddr               lport    rport  pcnt  pref  
TCP  xxxxxxxxxx.xxxxxxxx.dyndns.org xxxxxxxx.xxxxxxxx.dyndns.org     6699     6699    10    10
UDP  xxxxxxxxxx.xxxxxxxx.dyndns.org xxxxxxxx.xxxxxxxx.dyndns.org     6257     6257    10    10
ipmasqadm portfw -a -P tcp -L 10.0.0.1 4000 -R 10.0.0.3 4000
iptables -t nat -A PREROUTING -p tcp -i ppp0 --dport 6699 -j DNAT --to 10.0.0.3:6699
iptables -t nat -A PREROUTING -p udp -i ppp0 --dport 6257 -j DNAT --to 10.0.0.3:6257

/* --------------------------------------------------------------- */

General References:

Smereka - "Extensible Personal Freeform Database and Personal Information Manager"
SQLite Database Browser - Very useful app to play with sqlite databases graphically.

Python References:

PHP - A Fractal of Bad Design - Wonderful php hate manifesto.
Why Python? - Advocacy post by Eric S. Raymond, from all the way back in 2000!
The Python Tutorial - Start Here.
Learn Python The Hard Way - More Beginner Python
SQLite Python Tutorial - Useful for the database bits.
All About Python and Unicode - A staggeringly helpful document.

No comments: