The state of open data is primitive. Most datasets are not found in repositories, but instead rattle around on FTP servers or hard-to-find web pages. Datasets are often undocumented, undated, and schemaless. Even those datasets that are stored within repositories have no change histories, adhere to no schemas, are identified in no consistent way. It’s chaos.
Imagine that you want a list of all corporations registered in New York and a list of all corporations registered in Tennessee, to know which businesses are registered in both states. How do you find it? You fire up Google and search for
"new york" corporate register,
"new york" corporations download, etc. until you either find the data or give up. Repeat for Tennessee. After buying the data from Tennessee (which will run you $1,000), then you discover that New York and Tennessee use completely different file formats, and neither document their schemas. So now you’re writing software to transfer them into a common format. This is not a good process, but this is the standard process—and that’s assuming that things go well.
In short, the data may be open and standardized, but the related practices, systems, relationships, and metadata are not.
These major shortcomings prevent open data from advancing beyond its messy state. We have all of the downsides of a distributed system with none of the upsides.
However, a series of small, wholly-achievable changes would facilitate innovative applications of open data that are currently too impractical to consider.
The future of open data is a decentralized, synchronized, automated system that brings to data what Docker and EC2 have brought to compute. That future must be built on its own technology (and practices) stack.
There are seven layers to the open data stack of the near future: open repository inventories, an inventory of all data repositories, common schemas, dataset metadata, dataset portability, dataset segmentation, and data synchronization. Following is a review of each of those layers, and a look at existing efforts to bring their promise to fruition.
Dataset of Repositories’ Inventories
Each data repository must include a dataset that is a list of all datasets in the repository, along with associated metadata about each of them. This makes it possible for a client to determine, with a single request, what the repository’s holdings are.
This exists, and is widely deployed, generally using Project Open Data’s data.json standard. It’s a simple JSON file, so named for its standardized location at
http://data.example.gov/data.json. For an example, see Seattle’s inventory file.
Dataset of all Repositories
Having a list of the URL for every data repository makes it possible to index not just all repositories, but also all datasets (if they are publishing their inventory as a dataset). Some additional metadata about each repository (name, scope of holdings, topics of datasets, last updated, frequency of update, etc.) will allow subsets of this dataset to be used, such as allowing historians studying the American presidency to look only at the inventories of presidential libraries.
There are a handful of existing efforts working toward this goal—collectively, if not individually—including Data Portals, Data.gov, Docker for Data, OpenGeoCode, and many more. Stitching together dozens of such meta-sources and using Common Crawl to automatically identify new repositories could yield a near-comprehensive dataset of all data repositories.
For each common type of dataset, a schema—a common set of fields and acceptable values—will have to be devised and employed. This is the hardest part of the open data stack. Creating a robust schema, and then getting people to use it, is a multi-year effort, and it needs to be done for hundreds of types of data. But it is also inevitable. Standards emerge in all professions, and open file schemas are no exception.
There does not need to be One True Schema for each type of data: proper dataset metadata (see below) can often make it possible to transmute data between schemas.
At present, most datasets lack any metadata to allow humans or software to understand their contents. For example, consider a CSV file like such:
|Janet Moyer||202-844-1212||vice president|
|Julio Menendez Muñoz||(203) 777-4647 x700||treasurer|
|Amy Smith Barbe||616-892-1212||Vice President|
Although the data may appear useful at first blush, it probably raises more questions than it answers. Are these their complete names, or just the names that they go by? Are these phone numbers for work, home, mobile, or is that unknown? Are the numbers in any kind of a standard format (it appears not), or have they been validated in any way? Are the positions in any standard format (given the differing case for “vice president,” probably not), and have they been validated? When was this data last updated? Could one person appear more than once, or should multiple similar records be assumed to be coincidence?
A schema and some additional metadata allows us to answer all of these questions, in ways meant for humans to understand, as well as in ways that software can understand. When data includes a schema, it becomes comparatively easy to validate data, inventory it, track modifications to it, or automatically convert it into a different format.
The most promising work towards this end is the Data Packages standard, created by the Open Knowledge Foundation. It requires only that the data file be accompanied by a
DataPackage.json file, which can include the data’s title, a description, version number, keywords, license data, authors, schema, or any of a few dozen other types of data. As many or as few fields can be included as is desirable.
Because we transport data in oversimplified forms, divorced from all metadata, a non-trivial amount of work is required to reconstitute that data into a usable form. For instance, if a state transportation agency is going to publish traffic data, they’re going to extract the data from their own systems and transform it into a common, open format (e.g., CSV) for the public to download. For somebody to then use that data after downloading it, she’s likely to need it to be in a database, necessitating that she create an import script to clean up the data and load it. Without a schema and metadata for that CSV file, that could easily be a multi-hour process.
Data portability requires the automation of the transfer and transformation process, so that all metadata is generated at the time that the data is exported, travels with the data, and is applied at the time that the data is imported by a third party. Transferring data in this way becomes a single command (i.e.,
import https://example.gov/dataset.json), instead of a multi-hour process.
Much of this functionality is made possible with dataset metadata and some scripting (e.g., Data Packages, cURL, and 50 lines of Bash may do the trick), but this is something that is made possible in a more robust fashion using dat, U.S. Open Data’s version-controlled, decentralized data synchronization tool.
Some enormously valuable datasets are large enough to be problematic. The Centers for Medicare and Medicaid Services publishes the database of all physicians in the United States, a core dataset in the health data ecosystem. It is distributed as a 550 MB ZIP file that is over 4 GB when uncompressed. Most applications of this data do not require the full dataset—the end user just requires data about every doctor in a given state, or physicians with a particular specialty. But there is no way to get those slices of data without first downloading all of the data. When dealing multi-terabyte datasets, we rapidly approach the point where transferring them is not merely impractical, but actually impossible.
This is why another feature necessary in our open data stack is dataset segmentation—the ability to retrieve only the portions of a dataset that are needed. Retrieving only those records that have a column that match a given value is computationally demanding, in comparison to transferring an existing file, but it’s also markedly more useful. (For perspective, it’s on par with an equivalent API call.)
Again, this is something that dat does.
The final layer in this system is a method of synchronizing data automatically, so that updates to remote datasets can—when desirable—be applied to local copies. This is akin to how computers or phones automatically download software updates. For frequently-updated datasets, where it’s desirable to perform rolling analysis (e.g., corporate registrations, legislation, or streamflow), a universal sync layer would be of enormous benefit.
Again, this is something that dat does, while versioning the changes. However, this could be done (though without versioning) by any system that periodically retrieves data from the URL defined for a dataset within its metadata.
Individually, each of those seven layers is useful. Together, some of them are very powerful. Collectively, they will unlock the value of open data in a way that is presently impossible, and that will make the current open data ecosystem look pretty terrible in retrospect.
It’s plausible that this entire open data stack could be in place within 18–24 months. Portions of it, in the form of co-useful layers, could be in use next year.
To this end, at U.S. Open Data, we’re going to continue to develop dat, we’ll increase our promotion of and support for Open Knowledge’s Data Packages standard, and help to foment the creation of new schemas for core datasets. Whenever possible, we’ll also promote other ways to create the layers of this stack—even ones in ostensible competition with dat—because a diversity of approaches will allow the best one to come out on top.
Our opening example looks a lot rosier with this future stack. Building our environment then looks something like this:
data-get https://data.ny.gov/corporations https://data.tn.gov/corporations --schema https://schemas.gov/commerce/corporations.json
If you’ve worked with data for long, that should give you a little thrill of excitement. Let’s make it happen.