I’m inventing a new data format (as blasphemous as it sounds). It’s flexible, human readable, easy to produce, and best of all, nearly impossible to screw up parsing.

I’m doing it to replace CSV files because you shouldn’t have to worry about quoting, escaping, or deciding whether your “comma” sepearated values turn out to really mean semicolon or even worse, tab delimited.

The new data format is called json streaming record or JSRec for short. While I say it’s “new”, I’m sure many of you have either produced or consumed this type of data already at some point in your career.

Here’s how it’s defined:

  1. Files of this format have .jsrec as their file extension
  2. Each line in the file is a json hash map
  3. Empty lines and lines beginning with ‘#’ are considred comments and ignored during parsing

Here’s an example file foobar.jsrec

{"foo":1, "bar": "marry"}
{"foo":11, "bar": "had a"}
{"foo":21, "bar": "little lamb"}

# some comments
{"foo":33, "bar": "more data"}

Here’s the code to parse and encode this data format:

import json

def load_jsrec(filename):
    """loads a .jsrec file"""
    fh = open(filename)
    for line in fh:
        line = line.strip()
        if line == "":
            continue
        if line.startswith("#"):
            continue
        yield json.loads(line)
    fh.close()

def dump_jsrec(filename, records):
    """writes a .jsrec file"""
    with open(filename, "w") as fh:
        for rec in records:
            fh.write(json.dumps(rec) +"\n")
A new use case: building data processing pipelines with JSRec

Because each line in a JSRec file contains all the information necessary to parse a record, you can use it to pipe output from one program to another:

cat foobar.jsrec | progA | progB

As long as the programs you’re using understands JSRec, you can start chaining them together. This is HUGE because it makes building data processing pipelines on the commandline a modular and simple task.

When to use it

Json Streaming Record is an ideal replacement for CSV files. Use it when you want a data format that can store “streams” of records that are human readable yet easily parsed by a machine.

With each line of the format being a completely self-contained JSON object, JSRec allows you to produce and consume data in an incremental fashion. I encourage you to start using it as a data format to pass around on the commandline for when you’re building those data processing pipelines.