At my last job, I worked with data from the Brazilian educational system in several situations. The details aren’t the important part, but the format – a giant denormalized CSV with an accompanying PDF detailing its fields. It is very nice after you work with it for some time, but there are some things that could be better.
In that format, enumerations (fields with a fixed, finite set of values) are encoded as some arbitrary integer range, boolean values as 0 or 1, and other implementation details that are explained in the PDF. Thus far we have a cute CSV with documented fields. Nice, right?
I was quite happy with it for months, doing analyses and maintaining the internal libraries used to work with it. However, as soon as we started working with data from earlier years, things went awry. Not so obviously wrong, but the code started getting lots of little conditionals, including things like:
if year == 2009:
# Some fields didn't exist.
elif student.extra_language == 'english':
# The answers were configured with a certain offset.
else # student.extra_language == 'spanish':
And this is freaking ugly.
I thought I could improve the situation. For example, keeping a directory for each year with the dataset (the CSV file) and a JSON describing the schema of the fields. The gains aren’t pronounced in this case, basically it is a translation of the PDF documentation to a computer-readable format.
We could also create a default schema such that each year’s data is mapped to it. This would move the complexity of the application to data pre-processing, which I prefer – that is one of the ugliest and most troublesome steps of data analysis anyway.
The Data Package Format
Today I was organizing the output files of some internal tools I developed at my current job:
So a bunch of directories, each with various CSVs representing data for a country. For reasons that I can’t write here, I started thinking about how it would be awesome if I could write some sort of metadata file for those IDs.
This opens up some possibilities. The format of those CSV files changed some times in the past 2 months, and some of my recent scripts can’t work with earlier versions of them. If I had a metadata file describing the precise schema of those files, I could abort any incompatible operation instead of receiving an error or, much worse, failing silently.
Thankfully, I found something that filled that niche: what is known as a Data Package. It is a bundle of data (which can be in any format, CSV, XLS, etc) and a “descriptor” file called datapackage.json . Quite simple. The specification can be found here.
For the case I’m working with, i.e. lots of CSV files, there is an extension to the format called Tabular Data Package. Its specification can be found here.
These formats are defined using what is called JSON schema, which I hadn’t heard about before. The json-schema.org website shows some interesting examples.