We Need Data Schemas—So Let’s Create Them

29 July 2016

Here’s a conversation I’ve had a half-dozen times in the past year:

Agency: We’ve got a handful of datasets ready to be published, and we need advice on how to publish each of them.

Waldo: Great! Why don’t you run down the list?

A: We want to publish our corporate register, but we don’t know how to provide it. What file format do we use, and what schema?

W: Er…yeah, about that. There is no file format or schema for corporate registers. Sorry.

A: Huh. That leaves us in a tight spot. But, OK, moving on: How about our checkbook—our spending? What’s the format and schema for that?

W: Jeez, I’m sorry, but again…there isn’t one.

A: Wow. OK, well, what about our agency’s inventory of buildings and land?

W: Same.

A: Population forecasts?

W: Nope.

A: Address coordinates?

W: Nuh-uh.

A: Cadastral records?

W: Sorry.

A: …

W: Yeah.

A: How are we supposed to publish data if there’s no standard?

W: Just…kinda…do it? I guess?

A: …

You can’t have this conversation very many times before it becomes obvious that our lack of data standards is a real problem.

Johns Hopkins’ Center for Government Excellence has put together a list of civic data standards. It is extremely short. There are very few schemas for sharing government data with the public.

Creating a standard is hard. The right way to create a standard involves engaging a broad range of stakeholders in the public and private sectors, including producers and consumers of data in that format, to create something that will be broadly useful and stand the test of time.

This approach has yielded exactly zero standards in this space.

General Transit Feed Specification (GTFS) is the huge success story here, and that resulted from some Google engineers working with a single transit agency. There was no series of roundtables, no acceptance testing, no RFC. They just did it, building something lightweight and extensible that solved the problems at hand. It’s changed a lot in the 11 years since, adapting to the needs of its growing user base and becoming subject to the normal standards-creation processes, but for almost that entire time, GTFS has been the standard for transit data.

We don’t have enough data points to know whether GTFS is an outlier or a model, but I posit that it’s a model. (Consider that Open311 emerged in the same way.) There’s no movement to create schemas for the many dozens of core datasets that are being published by governments (or, rather, not being published). The effort required to convene a standards group is apparently not worth the trouble, what with it not happening. The effort required to do this for all of these core datasets is implausibly large. So let’s not.

What we need is for tiny groups of stakeholders—maybe mere pairings of stakeholders—to just go ahead and create standards within their area of expertise. And don’t call it a “standard,” if that sounds too scary. Call it an “implementation” or “our schema,” or whatever. Develop it in the open, document it, set up a validator, put it to work, and get out the word that it exists.

As writers like to say, you can’t edit nothing. And, as both Twitter and Wikipedia have demonstrated, being wrong in public is sure to attract people to explain precisely why you are wrong. Once there’s a v1.0 standard, stakeholder-critics will materialize, demand a say, and then they can begin the multi-year process of producing a v2.0 standard. In the meantime, that v1.0 standard will exist, and people can start using it. That initial version starts a conversation, it doesn’t end one.

Rough consensus and running code.

Is this the right way to create data standards? Nope. Will it work? Maybe. But is our current approach doing any good? Nuh-uh.

Now go create some standards.