Incomplete Repositories Considered Harmful

An open data repository has become a totem of the open government data movement. No open data effort is considered to be complete—or perhaps even really begun—without a data catalog. Generally at an address like data.placename.gov, it functions as a central site that inventories the data holdings of that government, making it easy for people to find data from a given government.

A reasonable person looking for government data who looks at a government’s repository and doesn’t find that data would conclude that it does not exist.

As a rule, this is not true. I am not aware of any government data repository that is complete. Most of them are not even close to being complete. In my experience, the majority of a government’s existing, published, online data holdings are not included in their data repository.

Publishing a data repository while omitting data should be considered harmful. An important part of launching a data repository is inventorying existing published data holdings, such as with “Let Me Get That Data For You.” Existing data holdings should serve as the basis for a data repository, and all of them should be included.

Note that this does not require actually hosting all data files in the repository. Internal politics can make it infeasible to copy one agency’s data files over to another agency’s repository, no matter how relevant that those files are. But there is no reason why a repository cannot contain an entry about and a link to data, wherever it may live. (In some government agencies, it is claimed that there is a “rule” against so much as linking to a page on another agency’s website. If you are told this, demand to see a copy of that rule. That rule does not exist.)

In fairness to governments with incomplete data repositories, the process of inventorying data holdings is non-trivial. It’s a lot easier now that “Let Me Get That Data For You” exists, but it’s nearly as simple as it should be. As a standard part of the setup for any repository software, it should offer to inventory all holdings at a given domain name or set of doman names, filtering out useless files, highlighting great ones, gathering metadata from each file, and auto-populating the repository with those files, after an approval step. Creators of repository software should add that functionality, and soon. In the meantime, the operators of government data repositories need to take the steps necessary to ensure that their catalog is complete.