The Irrationality of Publishing Open Data

12 November 2014

About a year ago, I was approached by a state agency that needed advice about how to open a dataset. This agency is charged with assembling voter information collected by localities throughout the state, and they were interested in opening up an anonymized version of that dataset. I talked with a couple of their developers about potential applications of that data, how to normalize it, and other technical matters. Then one of the developers interjected.

What happens, he asked, when a legislator discovers that some of our data is inaccurate? He explained that his agency’s job was just to collect and disseminate the data—they had no way of knowing whether localities were collecting it correctly. Inevitably, some of it was wrong. If I put together a system to publish precinct-level voter registration data, and we say that there are more voters in a precinct than there really are, couldn’t I wind up getting hauled in front of a committee as Exhibit A in a voter fraud hearing?

Yes, of course, that was absolutely possible, I agreed, after some stammering.

So best case, he said, we publish this data and nobody gets angry about the mistakes that are bound to be in it. But “open data” isn’t in my job description. My annual evaluation isn’t based on this work. I don’t get a raise or a promotion if this goes well. So it does nothing for me.

I had to concede the point.

And worst case, he said, I get fired after being publicly humiliated in front of a legislative committee.

Again, I had to concede the point.

So…why would I do this?

He was right—there was no reason why he should open this data. That was the end of the conversation, and, it turned out, the end of the project.

···

That conversation has stuck with me. I relate it to somebody every week or two. It illustrates perfectly a significant conundrum of open data: there is no incentive model that makes it rational for government employees to publish open data. The safe thing to do is to publish bland, unobjectionable, low-value data.

One solution to this is probably the hardest possible solution: culture change. So long as there are legislators who are willing to seek out small mistakes to pounce on to score political points, and so long as agencies don’t support agile development and its iterative approach (clear up to the head of the agency), then it will be irrational for government employees to publish datasets that could be used to make them look bad.

There needs to be somebody at the top of the organizational chart who will put in writing a committment to agile, iterative development, who will go to the mat in defense of that process, and who can make a strong case for the need to get data in the open, even if it has mistakes (maybe especially if it has mistakes), so that those mistakes can be aired, identified, and corrected.

Ideally, the legislature should be brought into this process, or at least aware of it. This is, confessedly, wildly unlikely to happen.

And will this culture change solve our problem, making it rational to publish open data, or at least not irrational? I don’t have any idea. Probably not. It might make it better. It probably won’t make things worse. But I suppose it could.

This is a hard problem. There’s no five-step plan, new software, or clever trick that will solve it. I’m optimistic that solutions will bubble up from local governments over the next few years, as long as we all stay cognizant of the problem, and on the lookout for solutions. The viability and impact of open data depends on it.