U.S. Open Data Blog

A Lightweight Open Data Catalog (For Real)

28 March 2016 Waldo Jaquith

A screenshot of JKAN

There’s a healthy market for open data repository software. There’s ArcGIS Open Data, CKAN, DKAN, Junar, OpenDataSoft, and Socrata. It might not seem like there’s space in the market for a new contender, and yet a new entrant makes a good argument for its own necessity.

Philadelphia CDO Tim Wisniewski has created JKAN, a data catalog modeled on CKAN, but built in Jekyll. The GitHub-friendly, Markdown-based website generation system has made static sites popular again, especially when coupled with the power of JavaScript, and Wisniewski has used it to make a simple, lightweight data catalog.

Via email, I asked Wisniewski a few questions about JKAN.

* * *

There are a bunch of data repository options out there. Why create something new?

Just about every government website has data on it already. The first version of OpenDataPhilly was basically a card catalog of links to data that the city government published. What that did was connect the dots between disparate datasets, so people could see the picture that it formed, and what the benefit was, resulting in enough momentum for an open data program to form inside the government, with resources behind it.

That step is really important for getting open data off the ground in cities. Robust APIs, data standards, and in-depth analysis tools come later.

The idea behind JKAN is to make this step really easy, and enable a champion for open data in a government agency to do this themselves, without needing tens-to-hundreds of thousands of dollars allocated first, or having to set up servers at the command line.

Architecturally, JKAN is quite different from other repository software. But its modularity, simplicity, and static nature fit into a larger trend in software. Could you tell me about the decisions that went into designing and creating it?

The core functionality of a data catalog is making it easy to find datasets and the links to download them. JKAN accomplishes this without needing a database. Under the hood, there are a bunch of text files that represent each dataset, and contain its title, description, download links, etc. When you modify one of those files, JKAN regenerates the website using the text files. The website can then be served over a simple web server, because they’re just HTML files. JKAN goes a step further and provides a user-friendly interface for modifying the text files. Because of this simple architecture, you can run a JKAN site for free on GitHub.

What’s missing is a “data store.” Data stores allow users to actually upload the data rather than just link to it. These are helpful, but account for the majority of the cost and headache. For now, why not just put your data on Dropbox, Google Drive, an FTP site, the city’s website, or even a GitHub repo? Sure, those options won’t give you robust APIs, but that probably shouldn’t be your top priority if you’re trying to get an open data program off the ground anyway. Philly hosted most of its data on GitHub for the first year of its open data program, before purchasing a data store, and even today, the data store and data catalog are two distinct products.

Have folks from other governments expressed interest in JKAN yet?

Well, to be honest, I started it as a proof of concept. The recent addition of user-friendly interfaces for editing the datasets, though, has got me thinking that it could actually be a thing. I’ve been developing it in the open on GitHub, but there are still a few features I’m working on before it’s ready for production use. Now would be a great time for collaborators to jump in, though, if anyone’s interested.

What does success look like for JKAN?

Standing up a simple data catalog is no longer a barrier for a government’s entry into open data, so they can focus on the hard parts like getting buy-in.
A few more folks get behind it and help make improvements and bug fixes, so folks are comfortable enough to use it

Of the work remaining to be done, are there folks with any particular skill sets that you need help from, or individual issues that you’d like to see people tackle that are perhaps outside of your skills or time capacity?

I’d love for someone to think about the user experience from the ground up. The layout, terminology, and functionality is almost entirely based on (a slimmed-down version of) CKAN. And for folks familiar with CKAN, this is comfortable. But how would we design a data portal that also caters to people who are new to open data? If the purpose of the product is to demonstrate the value of open data, how can its design further that? JKAN is also sorely lacking in aesthetic and would greatly benefit from some web design.

* * *

For more information about JKAN, or to contribute to it, see the project’s GitHub repository, or follow the one-step guide to have your own repository running in literal seconds.

Opening Data While Prohibiting Using It

15 March 2016 Waldo Jaquith

An essential part of publishing data openly is doing so under a license that allows people to use that data. Copyrighted data isn’t open data. Those two things are mutually exclusive. Government who copyright their open data are saying “here is our data, please come use it” while also saying “you may not use this data.” Doing so means that serious users of the data will be scared off before they can put it to work.

Findings

Our U.S. states open data census shows that about half of all surveyed datasets are encumbered by copyright restrictions.

Of the 356 extant, surveyed datasets, 171 (or 48%) are restricted by claims of copyright. 174 (or 49%) have no restrictions. For 11 (3%) of them, it’s unclear whether they’re copyrighted or not (e.g., the link to the copyright page is broken).

The prior paragraph, graphed.

Datasets without restrictions

For those datasets without restrictions, overwhelmingly that’s because the site is silent on the matter of copyright. We tally that as consent that the data is in the public domain. Note that this is the opposite of private works, for which no statement of copyright is required. This is because works of government are generally in the public domain. (Some states hold that they may claim copyright, and some do not, with significant nuance between those two extremes.)

In rare cases, there is a direct disclaimer of copyright. For example, the datasets in Washington D.C.’s repository link to a copyright page that puts all data in the public domain by default:

Unless otherwise noted, the data on the Sites is public domain and made available with a Creative Commons CC0 1.0 Universal dedication. In short, the District waives all rights to the data worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute, and perform the data, even for commercial purposes, all without asking permission. The District makes no warranties about the data, and disclaims liability for all uses of the data, to the fullest extent permitted by applicable law.

D.C.’s copyright policy is imperfect, but it’s also the best effort made by any U.S. state.

Copyrighted datasets

There are various ways in which surveyed datasets are restricted by copyright.

The most common restriction comes from agencies putting a copyright statement in their website footer, presumably placed there by website developers who added it reflexively. For example, Puerto Rico’s population projections include “Junta de Planificación de Puerto Rico © 2015” (“Puerto Rico Board of Planning © 2015”) at the bottom of the page. Agencies use that template for pages where data is published, without clawing back that copyright for the purpose of the data on the page.

Also common is the use of statewide website design standards and templates, in the same manner as agency templates. For example, West Virginia has “© 2016 State of West Virginia” at the bottom of all of their pages, including pages where they publish open data (e.g., corporations, vehicle crashes, and state real estate).

Finally, some vendors claim copyright over state data that they host. For example, restaurant inspection record host HealthSpace claims copyright in the footer of their hosted client sites with “Copyright © 2014-2016 - HealthSpace Informatics, Inc.” (e.g., Wisconsin, Virginia). They probably don’t mean to claim that they own these states’ restaurant inspection records, but that’s exactly what they’re doing. Likewise, Socrata claims copyright over all of their Open Checkbook sites (e.g., Iowa) with “© Socrata” in the footer of every page.

There are a subset of states claiming copyright over data in which they seem to be doing so deliberately. For example, Vermont’s geodata site has a “Warranty and Copyright Notice” (as a PDF, for maximum irony) that says that “certain information contained within is the intellectual property of the State of Vermont,” but doesn’t specify what information. They also prohibit the resale of state data and require that their copyright claims travel with the data.

Next steps

Here are three things that governments (state and otherwise) can do to address this problem:

Claw back claims of copyright on data pages. If the site’s footer claims copyright, explicitly disclaim that copyright, affirmatively placing that data in the public domain (or, better, published under a Creative Commons Zero license—“public domain” is not meaningful to users of that data outside of the United States).
When publishing data within repository software, use its built-in metadata functionality to specify that the data is public domain.
Remove that troublesome copyright footer entirely. Just because a governments can claim copyright doesn’t mean that your government needs to, over data or otherwise.

Governments have to end the practice of publishing data while simultaneously prohibiting people from doing anything with it, or risk fatally hobbling their open data efforts.

2016 Open Data Pioneer Award

11 March 2016 Waldo Jaquith

We’re pleased to announce the 2016 recipient of our Open Data Pioneer award: Connecticut Chief Data Officer Tyler Kleykamp.

···

In the fall of 2014, I stood in an auditorium at Yale University, addressing the Connecticut Data Collaborative’s conference about the state of open data in Connecticut. Reading from prepared remarks, I said that I would award the state a C grade for its data repository. I hadn’t regarded this as an inflammatory claim, but the crowd actually gasped in response.

In retrospect, this probably had something to do with the fact that Tyler Kleykamp was in the audience, and that people are unaccustomed to such candor. Kleykamp had been on the job for a scant 8 months. Etiquette had been breached. A tiny gauntlet, perhaps, had been thrown.

Tyler took it in good spirits. He conceded that the grade was probably accurate, and that the provided evidence was correct. But there’s no way that he liked it.

···

One of the people I’ve talked to about Tyler is Andrew Ba Tran, a data journalist with The Connecticut Mirror and Trend CT. Andrew’s work requires working with a lot of data produced by Connecticut, and that means working with Tyler. Such a relationship could reasonably be adversarial, but theirs is far from it. Andrew gave Tyler the title of “the oracle of open data in Connecticut,” summarizing his work as “coaxing data out of various departments’ clutches.” He described Tyler as a connector—knowing how to get data, even if the state doesn’t have it—and as doing a great job of explaining the benefits of open data. Tyler doesn’t just pass along lousy data: when possible, he’ll improve it, knowing how that can make it more useful to others.

Scott Gaul, of the Hartford Foundation for Public Giving, also sang Tyler’s praises in a recent conversation. Scott puts on quarterly gatherings of state data folks (in fact, Tyler gave a presentation at the most recent one). He cited Tyler’s work fostering collaboration as one of his great merits—instead of protecting his turf, or trying to take over others’, he connects people to each other, providing them with the tools that they need to succeed. Scott sees Tyler as realistic about how much the state can reasonably accomplish with a data repository, while simultaneously seeking ways to succeed outside of those limits. Tyler’s ability to navigate bureaucracy and politics is matched by his technical skills—the perfect combination of skills.

I also checked in with Michelle Riordan-Nold, director of the Connecticut Data Collaborative (which we profiled in 2014). Michelle emphasized that Tyler is fighting an uphill battle in Connecticut, with some agencies reluctant to adopt standard data practices, making his successes harder-won than might be publicly obvious. He’s been happy to collaborate with her organization, going so far as jointly submitting a project proposal to the Knight News Challenge. This kind of collaboration is fantastically rare in state governments.

···

One can’t just go around grading states arbitrarily. My giving Connecticut’s repository a C was one of the moments that led to the creation of the U.S. States Open Data Census, a comprehensive survey and grading of every state in the nation. When U.S. Open Data started surveying the holdings of states, Tyler tagged the datasets in question for Connecticut, to make them easy to inventory. Connecticut emerged with a score of 78% (a C, I hasten to brag), which was the top score of the couple of dozen states inventoried at that point. Tyler immediately set about working to raise his state’s score. He made sure that all inventoried data was published under an open license, complete, verifiable, up-to-date, etc. One major obstacle to a higher score: Connecticut doesn’t aggregate localities’ restaurant inspections, netting them a score of zero on an entire inventoried dataset. No problem for Tyler—he created a whole new site to aggregate localities’ inspection records.

Now that every state has been graded, Connecticut has come out on top, with a score of 84%. Tyler is, naturally, still looking for ways to increase his state’s grade by improving Connecticut’s offerings.

···

The role of the state CDO is still being defined. The limits of a state data repository are being plumbed. The interplay of powers between agency CDOs and the government’s leading CDO is up in the air. Tyler Kleykamp is doing a great deal to help to resolve each these, in the course of working to maintain Connecticut’s position as a leading state in the practice of open data. He’s doing so collaboratively, openly, and innovatively. It is for these reasons and more that we’re pleased to name Tyler Kleykamp our 2016 Open Data Pioneer.

A Hard Number on Incomplete Repositories

19 February 2016 Waldo Jaquith

Last year, I wrote that incomplete data repositories should be considered harmful:

A reasonable person looking for government data who looks at a government’s repository and doesn’t find that data would conclude that it does not exist. As a rule, this is not true. I am not aware of any government data repository that is complete. Most of them are not even close to being complete. In my experience, the majority of a government’s existing, published, online data holdings are not included in their data repository.

Having completed our census of U.S. states’ open data holdings, we’ve actually got some numbers to confirm this. Using the raw census data, I tallied up the datasets published by states that have data repositories, based on whether they are found in the repository. (This, of course, relies only on the nine types of datasets that we inventoried: legislation, spending, address points, etc.)

It turns out that 73% of datasets published by states with data repositories are not found in the repository. Those datasets neither exist in the repository nor are they linked to from within the repository.

When only 27% of extant core data sets are found on the site where the public is directed to go to find data, something has gone terribly wrong. Operators of state data repositories have got to inventory key state data holdings to ensure that they’re listed within the repository, and they also have to audit the search behavior on their sites, to identify what data people want, but cannot find. We’ve got to do better.

Best-of-Breed State Datasets

11 February 2016 Waldo Jaquith

Having completed our census of U.S. states’ open data holdings, we can start studying the aggregate data and drawing some conclusions. (As, indeed, can anybody, using the raw census data.) First up: highlighting the very best examples of each type of dataset. We include nine types of datasets in this census, and following are the states that came out on top.

Incarceration

Prison population data—how many people are held in each facility relative to capacity, and how that’s changed over time. Connecticut does this best, earning a score of 95%. It falls short by being incomplete, because the dataset does not include the capacity of each facility.

Legislative

Legislators and legislation—what bills are filed and who is filing them. Again Connecticut does this best, earning a score of 95%. Its shortcoming is that the legislature claims copyright over the data (with “© 2016 Connecticut General Assembly” printed in the footer), preventing people from actually doing anything with the data. (This is very common.)

Population Projections

How the population of municipalities is forecast to change in the decades ahead. Colorado does a perfect job at this, earning a score of 100%.

Real Estate

Buildings and land that are owned or leased by the government. Oklahoma earns a perfect score, providing detailed information about all 13,792 state holdings as CSV and via an API.

Restaurant Inspections

The outcome of inspections of food services facilities. New York comes out on top here, with a score of 95%, but with a significant caveat: their licensing terms. All data that they publish on their repository is bound by a confusing, restrictive license, ironically published as a PDF.

Vehicle Crashes

Car accidents and detailed data about the circumstances under which they occurred. South Dakota does an excellent job here, earning a score of 90% with their detailed, pipe-delimited bulk data. They lose points for not providing any mechanism by which the public can verify that their copy of the data is accurate (e.g., serving it over HTTPS), and by not including the data in their data repository (because South Dakota does not have a repository).

Companies

The state register of all companies in the state. The state of Washington does a perfect job at this, earning a score of 100%, providing data as both tab-delimited text and as XML.

Address Points

A list of every address in the state and its coordinates. Washington DC does the best job here, earning a score of 95%. Admirably, they affirmatively place the data in the public domain. They don’t have a perfect score because it could not be determined whether the data is up-to-date.

Checkbook

Every expenditure by the state, broken down by municipality. Again, Connecticut does the best job, with a score of 95% for this dataset. The only shortcoming is that there is no description for each transaction, to know what the expenditure was for, although there are multiple expense categories that allow for a good idea of the purpose.

There are a pair of themes that emerge here.

One is that Connecticut seems to do everything well. There were a several categories in which Connecticut did very nearly as well as the best state. So, when in doubt, emulate Connecticut.

Another is that a non-representational portion of these datasets are hosted on a data repository. Perhaps it’s just that states that have solid data holdings tend to invest in repositories, but hosting a dataset in a repository often has the effect of increasing its score. That’s because repository software tends to encourage—or can even force—better practices. For instance, Socrata’s repository software uses HTTPS by default, guaranteeing that the data is verifiable (worth 5% of the score), and ensures that that machine-readable data is also available in bulk (worth another 5%). Because repository software offers no mechanism for the sale of data, the presence of a dataset within it ensures that it is without cost (another 5%). And, of course, a dataset being a repository is itself worth 5%. So the act of putting a dataset in modern repository software yields a 20% score increase. The lesson here is to establish a repository because doing so encourages good practices.

Previous Page: 2 of 9 Next