Excel has always been my go-to tool for exploring the abundance of csv-formatted data that I often find on the internet. However, every once in awhile large datasets like these H1B salary figures show up1, where the shear amount of records exceeds Excel's 65k row limit.

In these scenarios I turn to sqlite and a recently discovered project, apsw for my data analysis-fu.

Importing data into sqlite3

To start, I load up the data using a script I created a while back called csv2sql.py 2. Although you can use sqlite to directly load in csv files, I often find the functionality chokes on slightly malformed csv files. So here's my short script that reinvents the wheel a little better using Python:

Usage: csv2sql CSVFILE

Options:
  -h, --help  show this help message and exit
  -o DBNAME   output sqlite3 database file
  -t TABLE    default table name for data import
#!/usr/bin/env python
import sys
import csv
import re
import codecs
import optparse as op
import sqlite3 as db
from itertools import islice

parser = op.OptionParser()
parser.usage = "%prog CSVFILE"
parser.add_option('-o', action="store", dest="dbname",default="data.db", help="output sqlite3 database file")
parser.add_option('-t', action="store", dest="table", default="Records", help="default table name for data import")

def main(fpath, table, dbname):

    with codecs.open(fpath, encoding='utf-8', errors='ignore') as fh:
        rows    = csv.reader(fh)
        fields  = next(rows)
        slugify = lambda s: re.sub(r'\W+',"_", s).upper()
        schema  = "\n\t\t" + ",\n\t\t".join("%s text" % k for k in map(slugify,fields))

        print("Creating table: %s\n" % dbname)
        conn   = db.connect(dbname)
        c      = conn.cursor()

        # create table with schema
        stmt   = "CREATE TABLE %s (%s);" % (table, schema)     
        c.execute(stmt)

        print("Schema")
        print("======")
        print("\n%s\n" % stmt)

        # insert
        print("Adding records to database")
        stmt   = "INSERT INTO %s (%s) VALUES (%s);" 
        stmt   = stmt % (table, ", ".join(fields), ", ".join(['?'] * len(fields)))
        to_add = list(islice(rows, 2000))
        while to_add:
            sys.stdout.write('.')
            c.executemany(stmt, to_add)
            to_add = list(islice(rows, 2000))
            conn.commit()

if __name__ == '__main__':
    opts, args = parser.parse_args()
    main(fpath=args[0], table=opts.table, dbname=opts.dbname)

Exploring the data with apsw

After importing the csv file, I typically run SQL queries against the database in a sqlite shell to explore the data. Apsw is a Python wrapper for sqlite3, which also includes an enhanced shell with features like tab completions and output modes for json and Python tuples. These two things combined make interactive data exploration extremely pleasant.

Once you have apsw installed 3, create a short alias in your .bashrc file so that you can invoke the enhanced sqlite shell from the commandline:

alias sqlite='python -c "import apsw;apsw.main()"'

With this in place, you can invoke the shell by typing sqlite data.db on your terminal prompt. Those familiar with the sqlite shell know that you can have it output the results of a query in several different formats (i.e. column, csv, line etc) . And to set the output mode you simply type .mode <MODENAME> into the prompt before executing your queries.

The two most useful output formats that apsw provides is the "json" and "python" modes. Here's what the following SQL query looks like after setting up "json" and "python" modes respectively.

SELECT job_title, wage_from, avg(wage_from)
  FROM main 
  WHERE job_title like "%SOFTWARE%" COLLATE NOCASE 
  GROUP BY job_title 
  ORDER by wage_from
  LIMIT 10;

json output

tuple output

As you can see in the above example, this data is readily consumable in any standard python shell. Cut and paste the output into a python interpreter for further analysis, and use matplotlib to further visualize the data.


  1. Mentioned on Hacker News  

  2. This script assumes that the csv file's first row contains the field names for the data. This is how it automatically creates the table schema for you  

  3. To install apsw, you'll need to make sure that the sqlite development libraries are in your library and include paths. I finally was able to install apsw after using brew install sqlite and setting the libraries variable in setup.py to ["/usr/local/lib"] See Roger's comment below for a way to automatically retrieve all of the dependencies: