U.S. Open Data Blog

We Need Data Schemas—So Let’s Create Them

29 July 2016 Waldo Jaquith

Here’s a conversation I’ve had a half-dozen times in the past year:

Agency: We’ve got a handful of datasets ready to be published, and we need advice on how to publish each of them.

Waldo: Great! Why don’t you run down the list?

A: We want to publish our corporate register, but we don’t know how to provide it. What file format do we use, and what schema?

W: Er…yeah, about that. There is no file format or schema for corporate registers. Sorry.

A: Huh. That leaves us in a tight spot. But, OK, moving on: How about our checkbook—our spending? What’s the format and schema for that?

W: Jeez, I’m sorry, but again…there isn’t one.

A: Wow. OK, well, what about our agency’s inventory of buildings and land?

W: Same.

A: Population forecasts?

W: Nope.

A: Address coordinates?

W: Nuh-uh.

A: Cadastral records?

W: Sorry.

A: …

W: Yeah.

A: How are we supposed to publish data if there’s no standard?

W: Just…kinda…do it? I guess?

A: …

You can’t have this conversation very many times before it becomes obvious that our lack of data standards is a real problem.

Johns Hopkins’ Center for Government Excellence has put together a list of civic data standards. It is extremely short. There are very few schemas for sharing government data with the public.

Creating a standard is hard. The right way to create a standard involves engaging a broad range of stakeholders in the public and private sectors, including producers and consumers of data in that format, to create something that will be broadly useful and stand the test of time.

This approach has yielded exactly zero standards in this space.

General Transit Feed Specification (GTFS) is the huge success story here, and that resulted from some Google engineers working with a single transit agency. There was no series of roundtables, no acceptance testing, no RFC. They just did it, building something lightweight and extensible that solved the problems at hand. It’s changed a lot in the 11 years since, adapting to the needs of its growing user base and becoming subject to the normal standards-creation processes, but for almost that entire time, GTFS has been the standard for transit data.

We don’t have enough data points to know whether GTFS is an outlier or a model, but I posit that it’s a model. (Consider that Open311 emerged in the same way.) There’s no movement to create schemas for the many dozens of core datasets that are being published by governments (or, rather, not being published). The effort required to convene a standards group is apparently not worth the trouble, what with it not happening. The effort required to do this for all of these core datasets is implausibly large. So let’s not.

What we need is for tiny groups of stakeholders—maybe mere pairings of stakeholders—to just go ahead and create standards within their area of expertise. And don’t call it a “standard,” if that sounds too scary. Call it an “implementation” or “our schema,” or whatever. Develop it in the open, document it, set up a validator, put it to work, and get out the word that it exists.

As writers like to say, you can’t edit nothing. And, as both Twitter and Wikipedia have demonstrated, being wrong in public is sure to attract people to explain precisely why you are wrong. Once there’s a v1.0 standard, stakeholder-critics will materialize, demand a say, and then they can begin the multi-year process of producing a v2.0 standard. In the meantime, that v1.0 standard will exist, and people can start using it. That initial version starts a conversation, it doesn’t end one.

Rough consensus and running code.

Is this the right way to create data standards? Nope. Will it work? Maybe. But is our current approach doing any good? Nuh-uh.

Now go create some standards.

Concluding U.S. Open Data

16 June 2016 Waldo Jaquith

Three years ago, when planning U.S. Open Data at the Aspen Institute’s Forum on Communications and Society, we established an unusual requirement: that the organization be shut down within four years, at the most. Ideally sooner.

On July 31, we’re going to do just that.

* * *

U.S. Open Data was created to serve as a partner to both government and the private sector, to improve the state of open data and to help ensure that government wouldn’t let open data become a passing fad. Our mission has been to build up the open data ecosystem, both inside and outside of government, to the point that an organization like U.S. Open Data is no longer necessary. We’ve done that by providing no-cost consulting services, building software to close gaps in the ecosystem, propagating best practices, and promoting the people doing important work in this space.

Our four-year term limit has been a gift, every day. Knowing that our organization is going away, there’s been no need to chase grants, no sense in building ourselves up, nothing to be gained by building a network that places us at the center. Even ignoring the daily benefits, the value of a term limit is this: there is some point of time by which we either have accomplished our mission (in which case we should stop, because we’re done), or we have not accomplished our mission (in which case we should stop, because we’re not up to the task). It limits the potential damage that a well-intentioned organization can do, and prevents it from becoming a zombie organization that exists because it exists.

A lot has changed since 2013. 18F and the U.S. Digital Service exist now, giving the federal government technical capacity that it simply didn’t have three years ago, and they’re baking open data into their systems and processes. Bloomberg Philanthropies launched What Works Cities last year, bringing standardized open data practices to 100 cities across the U.S. The DATA Act is now law. Cities and states throughout the U.S. have open data laws and policies.

Better practices are in place, leaders have emerged, gaps have been bridged, laws have been passed, regulations have been written, businesses have been started. Open data has enmeshed itself into government, business, and society, in ways that would make it awfully difficult to eliminate.

Open data can’t go away anymore. (If, indeed, it ever could have.) Our mission has been accomplished, although whether we deserve any credit for that is impossible to know.

* * *

We’re going to spend the next month and a half wrapping up projects (even starting one new project), finding new homes for other projects, and generally ensuring that nobody will notice when U.S. Open Data ceases to exist. As it should be.

Our major project, Dat, long ago became substantially larger than the rest of U.S. Open Data, by any metric. Designed to outlast U.S. Open Data, Dat will continue to grow and thrive, unaffected.

We are enormously grateful to our founding funder, the John S. and James L. Knight Foundation, without whom U.S. Open Data would never have existed. And we are likewise grateful to our general support funder since 2015, the Shuttleworth Foundation, who made it possible for us to complete our work.

Here’s hoping we did some good.

Accidentally Collecting 30 Million Land Records

07 June 2016 Waldo Jaquith

We’ve just finished an interesting, experimental project to open up cadastral data, and the results are too delightful to keep quiet (despite that we have nothing to show for the project right now except a GitHub repository).

U.S. Open Data is a long-time supporter of the Open Addresses project, a volunteer-run project that aggregates government-published address datasets to create a global repository of the coordinates of street addresses. Anecdotally, project volunteers had noticed that a fair number of the data sources contained not just the latitude and longitude of an address, but the boundaries of the parcel. That raised the question of how many of the indexed 257 million addresses might include boundary data that was going unused. Could we have accidentally collected millions of cadastral records?

So we hired PostLight to figure it out for us. Developer Bryan Bickford spent a little over a week creating a Python-based tool to find and extract parcel data from OpenAddresses’ records.

Bryan’s work gave us a hard number: of the 1,511 data sources ingested by OpenAddresses, 383 include parcel boundaries (or 25%). There are a total of 30,461,769 parcels included. Although OpenAddresses has address data for dozens of countries, it turns out that only 5 commingle addresses with parcel boundaries: Brazil, Canada, Singapore, Ukraine, and the United States. 95% of those included parcels are in the United States (29,033,215 in total). That’s a lot of land parcels to have collected accidentally.

If the idea of a comprehensive cadastral map sounds familiar, that’s because this is exactly what Loveland is doing. They’ve put together a collection of over 100 million parcels. We’ll figure out the extent to which their 100 million parcels overlap with our 30 million parcels, to ensure that we can fill in any gaps that we’ve identified in Loveland’s records. The first 10% of what Loveland does is a lot like what OpenAddresses does, but they go so much farther in building atop the data that they collect, instead of just collecting and publishing metadata.

The OpenAddresses project is working on figuring out what to do with parcel records, now that it can be known that it has so many of them. That might mean doing nothing (beyond sharing this data), and it might mean creating an ongoing offshoot of the project to publish this data. It remains to be seen.

Our thanks to the Shuttleworth Foundation for funding this project.

Opening Philadelphia’s Parking Data

07 April 2016 Waldo Jaquith

Last month, Lauren Ancona created an entirely new type of website: a mapped open-data powered parking website for Philadelphia, named “Parkadelphia.”

Parkadelphia screenshot

Ancona is the Senior Data Scientist for the City of Philadelphia, but she built this website on her own time, as a passion project. I asked her a few questions about this novel form of open data.

Where did you get the idea for Parkadelphia?

It wasn’t as much of an idea as it was just a hole that needed filling. I just couldn’t understand why, in an age where I can poke at my phone and a person in a car shows up to drive me wherever I want, I couldn’t also use it to figure out where it was legal to park, for how long, and how much it would cost me. It turned out to be an excellent vehicle for learning to code, as an autodidact. You have to find some way to limit yourself to a defined set of goals, otherwise you never end up producing something from start to finish.

What datasets are required for this?

In Philadelphia, we’re using:

Streets Centerlines (maintained by Streets Department)
Residential Permit Parking Blocks (joined from text-only file)
Residential Permit Districts (created manually from text of the City code)
Metered Blocks (joined from text-only file)
Scooter/Motorcycle Corrals (created manually from list on Philadelphia Parking Authority website)
Loading/Delivery Zones (created from text file)
Snow Emergency Routes (maintained by Streets Department)
Center City Parking Meter locations (created manually from Google Street View)

Some cities have still other sets available that may be useful, such as:

Fire Hydrant locations
Parking Meter locations
Handicapped Parking Spot locations
Car Share locations

Is it unusual that Philadelphia publishes the data necessary to support this?

Most of the data that has been released in Philadelphia isn’t actually geospatial…so, no? In fact, many other cities, including Washington, DC and San Francisco, already publish most of the data you’d need to build something similar via their open data portals. The trick was finding some way to tie the regulations together and refer to existing geometry. Most large cities keep their own canonical data set of streets and their exact geometry (relative to one another), called a centerlines file (sometimes “streets centerlines”). Because hundred blocks often intersect other, smaller segments, they become subdivided into smaller sections—called “street segments.”

I decided to try to use that resolution and join the regulation data I had to each street segment, which each have a unique ID. While it’s true that this model doesn’t take into account regulations like minimum legal distance you can park from an intersection, it was enough to get started. Philadelphia has over 40,000 street segments; if I needed more detailed information after that I figured it could always be tightened down later.

I’m looking at using something like Koop to take a shot at loosely coupling sets directly from data portals that have the minimum required sets, to see if I could handle any necessary joins on the fly, with something like Turf.js, cache the regulations for a period of time, serve via API to any apps that wanted to reference them. Then perhaps depending on how frequently they might change, just go back to the portal at a set interval to refresh the data automatically. But I need to finish a data model for that, first.

One exciting development since we launched is the beginning of a conversation between cities to establish a spec for open parking data. It’s my nerdiest dream yet, but what if we could build something like GTFS, but for parking data? If you can look up the subway schedule via a search engine, why can’t you look up public parking regulations?

What would it look like to deploy this in other cities?

If there’s a centerlines file available, I’d start there. Then find a way to start mapping regulation data to the smallest linear segment you have available—likely segment.

I’m working to document the data model—it was constantly changing right up until launch—in a way that is general enough that would be easy to reuse.

Maintainability is a huge concern—it’d be best to plan your model around the most convenient way to update some (but not all) of the data on a rolling basis, since that’s been the most likely scenario, in my experience.

* * *

To contribute to Parkadelphia, or to fork it to deploy in your own city, see the project’s GitHub repository.

The New Trend of Decentralized Government Data Aggregation

06 April 2016 Waldo Jaquith

A new approach to government data sharing is emerging, and it has the potential to reshape how and why data is published openly: decentralized government data aggregation.

For years, governments were encouraged to publish open data for the benefit of the private sector. This is a terrible incentive for government and its employees, and consequently doesn’t work. Gradually, governments are starting to use their open data programs as a vector for sharing data between governments agencies (e.g., a state agency sharing corporate records with localities to enable business tax audits), instead of as a mechanism to provide public data to the private sector for uncertain benefits (with associated handwaving about “apps”). This makes the data available to the public just the same, but uses a different incentive model, one based on how government actually works.

An early example of this was the White House’s 2013 requirement that federal agencies publish public, machine-readable inventories of their data holdings. The inventories must be published as a file named data.json, at the root of federal domains (e.g., usda.gov/data.json, interior.gov/data.json, commerce.gov/data.json, etc.), providing metadata about every dataset that they hold. The purpose of this was to allow Data.gov to harvest the data for the federal government’s central data repository, but the happy byproduct is that, by requiring that these inventories be published publicly, anybody can have them. (Disclaimer: I was one of the creators of this standard.)

Another prominent example of this is in the OpenAddresses project—a private effort, but one that is powered by open geodata published by local and state GIS departments. By gathering up data from over 1,300 sources, the project has collected the coordinates of over 240 million addresses. Now the U.S. Department of Transportation is likely to emulate this approach, as they plan their National Address Database. Although this is far from a done deal, decentralized aggregation is probably the only viable path to a national address database, so the odds are good. (Disclaimer: I am a volunteer for OpenAddresses. Apparently I’m a fan of decentralized government data aggregation.)

Just last month came a particularly exciting development in this emerging method of exchanging government data—U.S. Secretary of Transportation Anthony Foxx asked every local and state transit agency to publish their route data openly. Specifically, he’s called on them to publish it in the General Transit Feed Specification, on their website where anybody can download it, to be placed into the public domain:

About half of the transit agencies in the United States, including almost all of the largest agencies, already collect this information in a common format, GTFS, and make it available either publicly through their Web site or directly to private companies. Each transit agency sets a variety of restrictive terms on the use of heir data. This information can be accessed for analytical purposes by the public, planning agencies, researchers, or Government agencies, but it must be requested on a case-by-case basis.

The solution is straightforward: a national repository of voluntarily provided, public domain GTFS feed data that is compiled into a common format with data from fixed route systems.

This will form the basis for a National Transit Map. […] It would be a service to your community and the Nation if your agency would permit DOT to collect your GTFS data from your Web site on a periodic basis so that we can incorporate your agency’s routing and schedule into the National Transit Map. […] We will be placing the compiled information in the public domain as open data.

This is an important new vector for open data. Open data is sometimes better for government than closed data, because open data infrastructure is sometimes more advanced than closed data infrastructure.

18F, the U.S. digital services team, summed it up in a tweet last week:

In a bureaucracy the size of the US government, our experience is the most effective way to make information travel is to make it public.
— 18F (@18F) March 31, 2016

Government and its data are too big to share data through dedicated systems. But open data scales up to the size of government and its data. Decentralized open data is a promising new vector to make that happen.

Previous Page: 1 of 9 Next