U.S. Open Data Blog

Here’s Our Open Data Future

11 December 2015 Waldo Jaquith

The state of open data is primitive. Most datasets are not found in repositories, but instead rattle around on FTP servers or hard-to-find web pages. Datasets are often undocumented, undated, and schemaless. Even those datasets that are stored within repositories have no change histories, adhere to no schemas, are identified in no consistent way. It’s chaos.

Imagine that you want a list of all corporations registered in New York and a list of all corporations registered in Tennessee, to know which businesses are registered in both states. How do you find it? You fire up Google and search for "new york" corporate register, "new york" corporations download, etc. until you either find the data or give up. Repeat for Tennessee. After buying the data from Tennessee (which will run you $1,000), then you discover that New York and Tennessee use completely different file formats, and neither document their schemas. So now you’re writing software to transfer them into a common format. This is not a good process, but this is the standard process—and that’s assuming that things go well.

In short, the data may be open and standardized, but the related practices, systems, relationships, and metadata are not.

These major shortcomings prevent open data from advancing beyond its messy state. We have all of the downsides of a distributed system with none of the upsides.

However, a series of small, wholly-achievable changes would facilitate innovative applications of open data that are currently too impractical to consider.

The future of open data is a decentralized, synchronized, automated system that brings to data what Docker and EC2 have brought to compute. That future must be built on its own technology (and practices) stack.

There are seven layers to the open data stack of the near future: open repository inventories, an inventory of all data repositories, common schemas, dataset metadata, dataset portability, dataset segmentation, and data synchronization. Following is a review of each of those layers, and a look at existing efforts to bring their promise to fruition.

Dataset of Repositories’ Inventories

Each data repository must include a dataset that is a list of all datasets in the repository, along with associated metadata about each of them. This makes it possible for a client to determine, with a single request, what the repository’s holdings are.

This exists, and is widely deployed, generally using Project Open Data’s data.json standard. It’s a simple JSON file, so named for its standardized location at http://data.example.gov/data.json. For an example, see Seattle’s inventory file.

Dataset of all Repositories

Having a list of the URL for every data repository makes it possible to index not just all repositories, but also all datasets (if they are publishing their inventory as a dataset). Some additional metadata about each repository (name, scope of holdings, topics of datasets, last updated, frequency of update, etc.) will allow subsets of this dataset to be used, such as allowing historians studying the American presidency to look only at the inventories of presidential libraries.

There are a handful of existing efforts working toward this goal—collectively, if not individually—including Data Portals, Data.gov, Docker for Data, OpenGeoCode, and many more. Stitching together dozens of such meta-sources and using Common Crawl to automatically identify new repositories could yield a near-comprehensive dataset of all data repositories.

Common Schemas

For each common type of dataset, a schema—a common set of fields and acceptable values—will have to be devised and employed. This is the hardest part of the open data stack. Creating a robust schema, and then getting people to use it, is a multi-year effort, and it needs to be done for hundreds of types of data. But it is also inevitable. Standards emerge in all professions, and open file schemas are no exception.

Examples of existing schemas include OpenTrails, General Transit Feed Specification, Local Inspector Value-entry Specification, and United States Legislative Markup, among many others.

There does not need to be One True Schema for each type of data: proper dataset metadata (see below) can often make it possible to transmute data between schemas.

Dataset Metadata

At present, most datasets lack any metadata to allow humans or software to understand their contents. For example, consider a CSV file like such:

name	phone	position
Janet Moyer	202-844-1212	vice president
Julio Menendez Muñoz	(203) 777-4647 x700	treasurer
Amy Smith Barbe	616-892-1212	Vice President

Although the data may appear useful at first blush, it probably raises more questions than it answers. Are these their complete names, or just the names that they go by? Are these phone numbers for work, home, mobile, or is that unknown? Are the numbers in any kind of a standard format (it appears not), or have they been validated in any way? Are the positions in any standard format (given the differing case for “vice president,” probably not), and have they been validated? When was this data last updated? Could one person appear more than once, or should multiple similar records be assumed to be coincidence?

A schema and some additional metadata allows us to answer all of these questions, in ways meant for humans to understand, as well as in ways that software can understand. When data includes a schema, it becomes comparatively easy to validate data, inventory it, track modifications to it, or automatically convert it into a different format.

The most promising work towards this end is the Data Packages standard, created by the Open Knowledge Foundation. It requires only that the data file be accompanied by a DataPackage.json file, which can include the data’s title, a description, version number, keywords, license data, authors, schema, or any of a few dozen other types of data. As many or as few fields can be included as is desirable.

Data Portability

Because we transport data in oversimplified forms, divorced from all metadata, a non-trivial amount of work is required to reconstitute that data into a usable form. For instance, if a state transportation agency is going to publish traffic data, they’re going to extract the data from their own systems and transform it into a common, open format (e.g., CSV) for the public to download. For somebody to then use that data after downloading it, she’s likely to need it to be in a database, necessitating that she create an import script to clean up the data and load it. Without a schema and metadata for that CSV file, that could easily be a multi-hour process.

Data portability requires the automation of the transfer and transformation process, so that all metadata is generated at the time that the data is exported, travels with the data, and is applied at the time that the data is imported by a third party. Transferring data in this way becomes a single command (i.e., import https://example.gov/dataset.json), instead of a multi-hour process.

Much of this functionality is made possible with dataset metadata and some scripting (e.g., Data Packages, cURL, and 50 lines of Bash may do the trick), but this is something that is made possible in a more robust fashion using dat, U.S. Open Data’s version-controlled, decentralized data synchronization tool.

Dataset Segmentation

Some enormously valuable datasets are large enough to be problematic. The Centers for Medicare and Medicaid Services publishes the database of all physicians in the United States, a core dataset in the health data ecosystem. It is distributed as a 550 MB ZIP file that is over 4 GB when uncompressed. Most applications of this data do not require the full dataset—the end user just requires data about every doctor in a given state, or physicians with a particular specialty. But there is no way to get those slices of data without first downloading all of the data. When dealing multi-terabyte datasets, we rapidly approach the point where transferring them is not merely impractical, but actually impossible.

This is why another feature necessary in our open data stack is dataset segmentation—the ability to retrieve only the portions of a dataset that are needed. Retrieving only those records that have a column that match a given value is computationally demanding, in comparison to transferring an existing file, but it’s also markedly more useful. (For perspective, it’s on par with an equivalent API call.)

Again, this is something that dat does.

Data Synchronization

The final layer in this system is a method of synchronizing data automatically, so that updates to remote datasets can—when desirable—be applied to local copies. This is akin to how computers or phones automatically download software updates. For frequently-updated datasets, where it’s desirable to perform rolling analysis (e.g., corporate registrations, legislation, or streamflow), a universal sync layer would be of enormous benefit.

Again, this is something that dat does, while versioning the changes. However, this could be done (though without versioning) by any system that periodically retrieves data from the URL defined for a dataset within its metadata.

* * *

Individually, each of those seven layers is useful. Together, some of them are very powerful. Collectively, they will unlock the value of open data in a way that is presently impossible, and that will make the current open data ecosystem look pretty terrible in retrospect.

It’s plausible that this entire open data stack could be in place within 18–24 months. Portions of it, in the form of co-useful layers, could be in use next year.

To this end, at U.S. Open Data, we’re going to continue to develop dat, we’ll increase our promotion of and support for Open Knowledge’s Data Packages standard, and help to foment the creation of new schemas for core datasets. Whenever possible, we’ll also promote other ways to create the layers of this stack—even ones in ostensible competition with dat—because a diversity of approaches will allow the best one to come out on top.

Our opening example looks a lot rosier with this future stack. Building our environment then looks something like this:

data-get https://data.ny.gov/corporations https://data.tn.gov/corporations --schema https://schemas.gov/commerce/corporations.json

If you’ve worked with data for long, that should give you a little thrill of excitement. Let’s make it happen.

Open Data Must be Distributed via HTTPS

18 November 2015 Waldo Jaquith

HTTP is being replaced by HTTPS. The default has changed. The newly-released HTTP/2 standard is being implemented almost exclusively over HTTPS, which will leave HTTP as a legacy standard. Firefox and Chrome are both deprecating HTTP.

Government is following suit. With M-15-13, the White House has declared that “all browsing activity should be considered private and sensitive” (their emphasis), including APIs. Federal agencies have a deadline of December 31, 2016 to introduce HTTPS and eliminate HTTP.

Following the lead of the federal government, it’s time to make HTTPS the new standard for data published by state and local governments, too. This is for two main reasons:

Using HTTPS protects the privacy of people accessing that data, e.g., a woman downloading a list of shelters for abused women in her area.
HTTPS ensures that data cannot be altered in transit. This is not a hypothetical concern: proxies, including ISPs and public WiFi hotspots, sometimes change data before delivering it to the client. Protecting data with HTTPS makes it impossible to alter.

Also, data integrity is an important issue to some folks in government, who worry that people wind up with altered, inaccurate copies of data, and so they object to publishing data at all if there’s no verification method. HTTPS addresses this by ensuring that the data that is published on the internet is the data that people receive when they download it.

Vendors of data repository software, especially those who provide hosted solutions, can make a strong impact here. They’re in a position to add this feature for their clients, at a trivial cost. But for data outside of a repository, scattered around various websites, this poses a challenge. But it’s also a good incentive to for governments to centralize those scattered datasets on a centralized, HTTPS-enabled data repository. The inexorable move to HTTPS for all internet traffic means that all straggling data will eventually be protected by HTTPS anyway—there is no scenario in which it’s forever transmitted without encryption.

Let’s not wait for open data to be dragged into the future by default. Instead, the sector should take the lead, assuring data publishers and consumers alike that data is confidential and unaltered.

U.S. States Open Data Census

16 November 2015 Waldo Jaquith

Today we’re taking the wraps off our long-in-development U.S. States Open Data Census, the first census of the open data holdings of U.S. states.

Screenshot of the Census

This census is notable for being the first data census of its kind, using interesting new software, and advancing some new best practices for publishing open data. We’re launching with comperehensive data for the 13 most populous U.S. states (covering over 60% of the population). This will allow for public review before we begin the push to survey the remaining 39 states and territories.

Why States

U.S. states are powerful, chronically undervalued holders of crucial datasets, including legislation, spending, incarceration, corporate registries, and more. State CTOs, CIOs, and CDOs find it difficult to measure their successes in publishing open data, and some have expressed enthusiastic support for a way to compare their work against other states, something that was previously impossible because there wasn’t a state data census. We want to promote standards for what data that states should publish and how they should publish it.

The Software

Difficulty deploying Open Knowledge’s popular Open Data Census software led to a search for alternatives. That’s when we discovered that Code for America’s Indianapolis brigade had built an Open Data Census clone in order to launch their great Police Open Data Census. (The Open Indy Brigade is doing signal work in the field of data about policing.) Their code was bespoke for police data, so we forked it and set about abstracting it to serve our purposes. The result isn’t something that’s ready to be deployed for general use, but it is an improved system, and we hope to work with the Indy Brigade to turn it into a general-use data census application.

The software uses Gulp, Bower, and Node.js to generate static files that can be deployed to GitHub Pages (or, really, any data host). The data lives in Google Sheets, and the software uses Tabletop.js to retrieve and display that data in the client.

We took the additional step of using CloudFlare to serve the site over HTTPS (albeit incomplete HTTPS, since GitHub Pages doesn’t support HTTPS for custom domain names).

How Datasets were Selected

The dataset selection process was done in collaboration with Emily Shaw, who was until recently the Deputy Policy Director at the Sunlight Foundation. Starting with the G8 National Action Plan’s definition of “High-Value Datasets,” we selected representative datasets for each category, making allowances for the diffences between datasets appropriate for a nation and those appropriate for a state. Where necessary and possible, we consulted with experts in the field to help us to select datasets.

The Scoring Criteria

Also in collaboration with the Sunlight Foundation, we came up with the criteria that we’d use to evaluate the openness of each dataset. We used Open Knowledge’s global data census scoring criteria, and added three additional criteria of our own:

Is there a mechanism by which somebody with a copy of a dataset can ensure that it was not altered in transit? e.g., is the data served over HTTPS, is an MD5 hash provided, or does the state provide a validation service, like Data Seal?
Is the dataset available in the state’s data repository? (See “Incomplete Repositories Considered Harmful.”)
Is the dataset complete?

The question of “completeness” requires considering both whether all of the data is there (does a list of state corporations include all of them?) and whether the data published is adequate to be useful (does a list of state corporations include their addresses?). For this we, again, consulted with experts in the field to help us define each dataset for the purpose of completeness. In particular, we’re grateful to the Sunlight Foundation’s Becca James, OpenAddresses’ Ian Dees, HDScores’ Matthew Eierman, and U.S. Deparment of Transportation CDO Daniel S. Morgan for their crucial help.

We decided to award 5 points for every criterion, with the exception of whether the data is machine-readable, for which we award 50 points, since that’s really the essence of open data. That yields a 100-point scale, which we use to calculate A–F grades.

The methodology is described in more detail on the census website.

Work Remaining

After pausing to process public feedback about the datasets and methodologies (which are welcome in the form of a GitHub Issue, tweets @opendata, or email), clearly the biggest work to be done is to survey the remaining 39 states and territories. Also:

Simplify the site home page—it’s confusing and too wide.
Move data off of Google Sheets and into a GitHub-hosted CSV, so that people can file pull requests to improve the data.
Include a few more datasets in the census. In particular, air, water, and waste permits and corporate grants, which were accidentally omitted thus far.
Grade each state, instead of just grading each dataset. (We’ve held off on this until the community can review the grading criteria.)

What You Can Do

The most useful thing that you can do is let us know about state datasets that we don’t have listed now. So if you’re an expert in Nebraska’s published incarceration data, we’d love to receive a pull request telling us about it, and we’ll include it in the census. If you just want to provide us with a list of the URLs for each surveyed dataset in your state, that would be an enormous help. And, of course, we’ve got open issues on the project repository—help is welcomed!

CKAN Multisite Now Available

14 September 2015 Waldo Jaquith

Today, U.S. Open Data is happy to announce the immediate availability of CKAN Multisite, a powerful new tool to launch, host, and manage dozens or even hundreds of data repository websites on a single server. Built by DataCats, CKAN Multisite consists of a Dockerized CKAN environment and an administrative interface to manage those Docker containers. Using CKAN Multisite, creating a new data repository takes literal seconds instead of hours or days (or weeks or months).

This is the culmination of a ten-month process that began with hiring Open Knowledge to create a draft proposal for making it easier to host multiple CKAN sites on a single server. We put that draft RFP up on GitHub and, with essential guidance from dozens of people from both the public and private sectors, we whittled down that draft RFP to the parts with which the community told us we could have the most impact. We published a pair of RFPs in January and awarded both contracts to DataCats (née boxkite). They’ve been working on this ever since, and we couldn’t be happier with their work.

What It’s For

There are two benefits that we expect to see come of CKAN Multisite:

Creating more competition in the commercial data hosting space. There aren’t nearly enough vendors selling this service, in part because so much technical debt is accrued with each new client. This makes it far simpler to offer this service.
Making it trivial to establish regional data centers. Governments, non-profit organizations, and universities throughout the country are creating shared data repositories for regional governments and NGOs, but they’ve found the technical and financial obstacles to be daunting. CKAN Multisite was created explicitly with them in mind.

Fostering Competition

Not wanting to create vendor lock-in, since there are many other popular data repository packages out there, we simultaneously funded the creation of CKAN Export, which DataCats executed as an expansion of CKAN’s API. This makes it possible to “dehydrate” a CKAN website, so that its contents can be imported into another CKAN server, or into commercial services like Junar, Socrata, ArcGIS Open Data, etc. (contingent on those services supporting its open format). Some sites are going to outgrow CKAN Multisite, or even CKAN, and we want to make it easy for them to change their environment to match their needs.

Funding

CKAN Multisite was funded generously by the John S. and James L. Knight Foundation as a part of their founding support of U.S. Open Data. They provided the $22,400 in contracting costs, as well as the non-trivial amount of staff time that went into the process. We’re grateful to them for their essential support.

Dat Goes Beta

29 July 2015 Karissa McKelvey

After a long year of alpha testing, which started on August 19th, we are excited to announce our launch of a new, even-more-stable phase of dat. Beta starts now.

Let us know what you’re working on and how dat might work for you. We’ll even come to your lab to help you get up and running, implement features, and fix bugs in real time. That’s first-class service!

Get started and and break things (but don’t forget to let us know if things break):

npm install -g dat

Dat is a data collaboration tool. We think most people will use it to simplify the process of downloading and updating datasets, but we are also very excited about how people will use it to fork, collaborate on, and publish new datasets for others to consume.

We’ve been testing dat with the generous help of physicists, data hackers, machine learning experts, biostatisticians, government developers, sociologists, and informaticians. We brought our work to Mozilla Science Lab to integrate with Federated Wiki, the Sanger Institute to show off dat and Bionode, Berkeley Institute for Data Science where one of us co-works, ROpenSci to talk about visual diffing, JSConf to get our nerd on, and to SRCCON to demo with Flatsheet, just to name a few.

Finally, we’ve landed on an internal and external design that we believe covers the key use cases in the collaboration of data science and open data. With dat beta, you now can:

1) Put multiple data tables in a single repository. With dat alpha, the entire dat was one tabular entity—now, dat supports complex, heterogeneous datasets composed of different schemas or types of files. Datasets are kind of like SQL tables, except these datasets don’t support robust querying or joins.

2) Trust versions with cryptographic accuracy. Versions are now represented as uniquely identifiable SHA-256 hashes, and you can attach a message and timestamp for a coarse-grained view of the data’s history. You can also easily (O(1)) revert to a historical version to view and edit the data.

3) Fork, diff and merge. Although forks could be represented as conflicts to be merged immediately (as one might expect in a version control system such as Git), dat’s philosophy is the opposite. We think that data tools should embrace forks as key support for experimentation during the scientific process. Now dat is a decentralized, versioned, directed acyclic graph, so fork away!

4) Deploy over HTTP or SSH. Dat is transport-agnostic and easy to deploy. This means your IT team is happy because your existing authentication schemes will work the same as ever. Feel free to give read-only access through HTTP, too.

5) Integrate with other backend engines. Now you can use dat on top of any database that implements the AbstractLevelDOWN api—that means PostgreSQL, Redis, Mongo, Google Drive, etc. This feature is one of the many that still need extensive testing, but we are excited about the prospect.

6) Focus on internals. We’ve decided to scale back our initial offerings of the dat editor and server in favor of relying on a community of modules across languages that are built on top of dat, using dat as a data storage and collaboration tool instead of a jack-of-all-trades-master-of-none. Check out flatsheet for a dat editor replacement and karissa/dat-rest-server for a REST server replacement.

7) It’s faster. Yeah, it’s faster. And it’ll only get faster as time goes by.

The Future

After being granted the security to continue working on dat for the next two years, courtesy of the Alfred P. Sloan Foundation, we are doubling down on reliability on our approach to a full-featured 1.0 release in the coming months. We want to make it easier to:

track forks (with names, for starters);
verify tabular schemas through test suites, with inspiration from sensorQC and testdat;
add a robust set of complex and heterogenous data security, privacy, and access settings;
make BitTorrent support easy for fast data transfer (for cases like all of the stars in the sky); and,
integrate into user-friendly editors like flatsheet and the Jupyter Project.

But we can’t do it without your help (really). Every little piece of feedback counts. We’re particularly looking for bug testing, feature requests, and integration with existing data sharing systems. We can’t wait to see your feedback, comments, questions, and pull requests.

Chat with Us

We’re mostly always available in #dat on freenode, or datproject/discussions on gitter.im, so join us!

Previous Page: 3 of 9 Next