Excel has always been my go-to tool for exploring the abundance of csv-formatted data that I often find on the internet. However, every once in awhile large datasets like these H1B salary figures show up1, where the shear amount of records exceeds Excel's 65k row limit.
In these scenarios I turn to sqlite and a recently discovered project, apsw for my data analysis-fu.
Importing data into sqlite3
To start, I load up the data using a script I created a while back called csv2sql.py 2. Although you can use sqlite to directly load in csv files, I often find the functionality chokes on slightly malformed csv files. So here's my short script that reinvents the wheel a little better using Python:
Usage: csv2sql CSVFILE Options: -h, --help show this help message and exit -o DBNAME output sqlite3 database file -t TABLE default table name for data import
#!/usr/bin/env python import sys import csv import re import codecs import optparse as op import sqlite3 as db from itertools import islice parser = op.OptionParser() parser.usage = "%prog CSVFILE" parser.add_option('-o', action="store", dest="dbname",default="data.db", help="output sqlite3 database file") parser.add_option('-t', action="store", dest="table", default="Records", help="default table name for data import") def main(fpath, table, dbname): with codecs.open(fpath, encoding='utf-8', errors='ignore') as fh: rows = csv.reader(fh) fields = next(rows) slugify = lambda s: re.sub(r'\W+',"_", s).upper() schema = "\n\t\t" + ",\n\t\t".join("%s text" % k for k in map(slugify,fields)) print("Creating table: %s\n" % dbname) conn = db.connect(dbname) c = conn.cursor() # create table with schema stmt = "CREATE TABLE %s (%s);" % (table, schema) c.execute(stmt) print("Schema") print("======") print("\n%s\n" % stmt) # insert print("Adding records to database") stmt = "INSERT INTO %s (%s) VALUES (%s);" stmt = stmt % (table, ", ".join(fields), ", ".join(['?'] * len(fields))) to_add = list(islice(rows, 2000)) while to_add: sys.stdout.write('.') c.executemany(stmt, to_add) to_add = list(islice(rows, 2000)) conn.commit() if __name__ == '__main__': opts, args = parser.parse_args() main(fpath=args, table=opts.table, dbname=opts.dbname)
Exploring the data with apsw
After importing the csv file, I typically run SQL queries against the database in a sqlite shell to explore the data. Apsw is a Python wrapper for sqlite3, which also includes an enhanced shell with features like tab completions and output modes for json and Python tuples. These two things combined make interactive data exploration extremely pleasant.
Once you have apsw installed 3, create a short alias in your .bashrc file so that you can invoke the enhanced sqlite shell from the commandline:
alias sqlite='python -c "import apsw;apsw.main()"'
With this in place, you can invoke the shell by typing
sqlite data.db on your terminal prompt. Those familiar with the sqlite shell know that you can have it output the results of a query in several different formats (i.e. column, csv, line etc) . And to set the output mode you simply type
.mode <MODENAME> into the prompt before executing your queries.
The two most useful output formats that apsw provides is the "json" and "python" modes. Here's what the following SQL query looks like after setting up "json" and "python" modes respectively.
SELECT job_title, wage_from, avg(wage_from) FROM main WHERE job_title like "%SOFTWARE%" COLLATE NOCASE GROUP BY job_title ORDER by wage_from LIMIT 10;
As you can see in the above example, this data is readily consumable in any standard python shell. Cut and paste the output into a python interpreter for further analysis, and use matplotlib to further visualize the data.
This script assumes that the csv file's first row contains the field names for the data. This is how it automatically creates the table schema for you ↩
To install apsw, you'll need to make sure that the sqlite development libraries are in your library and include paths.
I finally was able to install apsw after usingSee Roger's comment below for a way to automatically retrieve all of the dependencies: ↩
brew install sqliteand setting the libraries variable in setup.py to ["/usr/local/lib"]