A Hard Number on Incomplete Repositories

Last year, I wrote that incomplete data repositories should be considered harmful:

A reasonable person looking for government data who looks at a government’s repository and doesn’t find that data would conclude that it does not exist. As a rule, this is not true. I am not aware of any government data repository that is complete. Most of them are not even close to being complete. In my experience, the majority of a government’s existing, published, online data holdings are not included in their data repository.

Having completed our census of U.S. states’ open data holdings, we’ve actually got some numbers to confirm this. Using the raw census data, I tallied up the datasets published by states that have data repositories, based on whether they are found in the repository. (This, of course, relies only on the nine types of datasets that we inventoried: legislation, spending, address points, etc.)

It turns out that 73% of datasets published by states with data repositories are not found in the repository. Those datasets neither exist in the repository nor are they linked to from within the repository.

When only 27% of extant core data sets are found on the site where the public is directed to go to find data, something has gone terribly wrong. Operators of state data repositories have got to inventory key state data holdings to ensure that they’re listed within the repository, and they also have to audit the search behavior on their sites, to identify what data people want, but cannot find. We’ve got to do better.