A poorman’s Excel using sqlite, apsw, and csv2sql

3 comments    Posted on April 15th, 2011

Excel has always been my go-to tool for exploring the abundance of csv-formatted data that I often find on the internet. However, every once in awhile large datasets like these H1B salary figures show up1, where the shear amount of records exceeds Excel’s 65k row limit.

In these scenarios I turn to sqlite and a recently discovered project, apsw for my data analysis-fu.

Importing data into sqlite3

To start, I load up the data using a script I created a while back called csv2sql.py 2. Although you can use sqlite to directly load in csv files, I often find the functionality chokes on slightly malformed csv files. So here’s my short script that reinvents the wheel a little better using Python:

Usage: csv2sql CSVFILE

Options:
  -h, --help  show this help message and exit
  -o DBNAME   output sqlite3 database file
  -t TABLE    default table name for data import
#!/usr/bin/env python
import sys
import csv
import re
import codecs
import optparse as op
import sqlite3 as db
from itertools import islice

parser = op.OptionParser()
parser.usage = "%prog CSVFILE"
parser.add_option('-o', action="store", dest="dbname",default="data.db", help="output sqlite3 database file")
parser.add_option('-t', action="store", dest="table", default="Records", help="default table name for data import")

def main(fpath, table, dbname):

    with codecs.open(fpath, encoding='utf-8', errors='ignore') as fh:
        rows    = csv.reader(fh)
        fields  = next(rows)
        slugify = lambda s: re.sub(r'\W+',"_", s).upper()
        schema  = "\n\t\t" + ",\n\t\t".join("%s text" % k for k in map(slugify,fields))

        print("Creating table: %s\n" % dbname)
        conn   = db.connect(dbname)
        c      = conn.cursor()

        # create table with schema
        stmt   = "CREATE TABLE %s (%s);" % (table, schema)
        c.execute(stmt)

        print("Schema")
        print("======")
        print("\n%s\n" % stmt)

        # insert
        print("Adding records to database")
        stmt   = "INSERT INTO %s (%s) VALUES (%s);"
        stmt   = stmt % (table, ", ".join(fields), ", ".join(['?'] * len(fields)))
        to_add = list(islice(rows, 2000))
        while to_add:
            sys.stdout.write('.')
            c.executemany(stmt, to_add)
            to_add = list(islice(rows, 2000))
            conn.commit()

if __name__ == '__main__':
    opts, args = parser.parse_args()
    main(fpath=args[0], table=opts.table, dbname=opts.dbname)

Exploring the data with apsw

After importing the csv file, I typically run SQL queries against the database in a sqlite shell to explore the data. Apsw is a Python wrapper for sqlite3, which also includes an enhanced shell with features like tab completions and output modes for json and Python tuples. These two things combined make interactive data exploration extremely pleasant.

Once you have apsw installed 3, create a short alias in your .bashrc file so that you can invoke the enhanced sqlite shell from the commandline:

alias sqlite='python -c "import apsw;apsw.main()"'

With this in place, you can invoke the shell by typing sqlite data.db on your terminal prompt. Those familiar with the sqlite shell know that you can have it output the results of a query in several different formats (i.e. column, csv, line etc) . And to set the output mode you simply type .mode <MODENAME> into the prompt before executing your queries.

The two most useful output formats that apsw provides is the “json” and “python” modes. Here’s what the following SQL query looks like after setting up “json” and “python” modes respectively.

SELECT job_title, wage_from, avg(wage_from)
  FROM main
  WHERE job_title like "%SOFTWARE%" COLLATE NOCASE
  GROUP BY job_title
  ORDER by wage_from
  LIMIT 10;

json output

tuple output

As you can see in the above example, this data is readily consumable in any standard python shell. Cut and paste the output into a python interpreter for further analysis, and use matplotlib to further visualize the data.


  1. Mentioned on Hacker News
     

  2. This script assumes that the csv file’s first row contains the field names for the data. This is how it automatically creates the table schema for you
     

  3. To install apsw, you’ll need to make sure that the sqlite development libraries are in your library and include paths. I finally was able to install apsw after using brew install sqlite and setting the libraries variable in setup.py to ["/usr/local/lib"] See Roger’s comment below for a way to automatically retrieve all of the dependencies:
     

3 responses so far

  1. Roger says:

    I’m the author of APSW. For footnote 3 that is the hardest way to install APSW! If you follow the recommended instructions then APSW will download the latest SQLite 3 source for you automatically and use it.

    http://apidoc.apsw.googlecode.com/hg/build.html#recommended

    Your link is the Microsoft JET database version of the data. The CSV link is

    http://www.flcdatacenter.com/download/H1B_2010_TEXT.zip

    APSW has a .import command and it isn’t as broken as the standard SQLite shell version. Behind the scenes it uses the standard Python CSV module which is generally very good.

    The .import command retains compatibility with the SQLite shell one which means you have set separators and create the table yourself. That is far too much boilerplate so I’ve created a ticket for an autoimport command that does all the hard work.

    http://code.google.com/p/apsw/issues/detail?id=116

  2. Huy says:

    Hey Roger,

    That makes it a lot easier to install now. I’ve changed the H1B links above and made a reference about your installation comments in the footnotes.

    Also, the .autoimport functionality would be super useful. I’ll be your first user if you end up incorporating it into apsw. Thanks for making such a great tool!

  3. Rajesh says:

    This is pretty good. One bug is that you slugify the column names when you create the table, but do not when you add to the database. If you have a bad character, it will throw an error.

    I fixed the error by replacing .join(fields)
    with .join(“%s” % k for k in map(slugify,fields))

Leave a reply