U.S. Open Data Blog

The Financial Value of Open Corporate Data to Government

11 December 2014 Waldo Jaquith

At the U.S. Open Data Institute, we recently used government data to audit a small city’s business licenses, and we found that 30% of the city’s businesses were operating without a license. The city is losing out on somewhere between hundreds of thousands and millions of dollars of business license revenue—as much as 2% of the city’s annual budget. It’s a compelling case for the significant economic value of open data to government. This is the story of how it happened.

Opening Virginia’s Data

The state of Virginia does not provide open data about corporations. They have a website where people can search, one business at a time, and read some of the data that the State Corporation Commission stores about it. The only way to get a list of all 786,308 registered corporations is to pay the SCC $450 for a three-month data subscription and sign a five-page contract.

In April, I signed that contract and wrote that check, making me only the seventh paying customer for this data. (Within minutes of getting access to the data, I began giving it away for free, thus preventing anybody else from having to pay for it.)

The data was a mess. It’s in no particular format, other than fixed-width ASCII, and it’s thick with inconsistencies, errors, and character encoding problems. Here’s a sample:

Turning this into useful data required creating an entire software project, which I’ve developed over the course of nights and weekends throughout the spring, summer, and fall (and is by no means finished). This package cleans up the data and transforms it into CSV and JSON.

I had Elasticsearch ingest the data (in lieu of a database), and hacked together a website to search the data.

This experimental, unadvertised site hadn’t been up for long when I got an e-mail from an employee of a municipal government in Virginia. She’d stumbled across the site, and hoped I could provide her with a list of every business located in her town. The reason, she explained, was that they had no way of knowing what businesses existed within their boundaries. The state doesn’t just withhold data from the public—they withhold it from localities, too. They had no way of knowing which businesses were failing to pay their business license fees or business property taxes.

Providing her with that data took a couple of months. That’s because the addresses in the data were wildly inconsistent, sometimes wrong, and of course not geocoded. Paying to geocode all of the records would have cost thousands of dollars, so I had to figure out how to get access to the state’s geocoder. The state’s geodata department was extremely helpful, and soon I had a script geocoding all of the addresses. I was able to send several thousand records to that municipal employee, as a spreadsheet. It remains to be seen what they’ll do with those records, if anything.

Using Virginia’s Data in Charlottesville

Remembering that I’m acquainted with the commissioner of the revenue—the tax collector—in my home of Charlottesville, I generated the same spreadsheet for him. But his office is limited by the proprietary software that they use, and the only way for them to figure out which businesses were registered with the state and unlicensed with the city was by having a person look up each of them, one at a time, an impractical task. So they sent me spreadsheet of all 4,253 licensed businesses in Charlottesville (data that should be open, but that is not). Their records don’t use corporate identifiers, and their street addresses aren’t normalized, so I used the Census geocoder to standardize the addresses and wrote some code to match up state records with city records.

The result was a spreadsheet of 1,900 businesses that list an address in Charlottesville as their place of business with the state, but that do not have a business license. That’s 30% of all Charlottesville businesses.

That does not mean that there are 1,900 businesses that are breaking the law. Many of these corporations do not need to be licensed, because they don’t transact any business. Others are licensed by the state (e.g., insurance agencies), and are exempt from municipal license requirements. Some of them are businesses that are domiciled in Charlottesville, but that do all of their business in a different locality (e.g., a restaurant), and are presumably licensed in that locality. I even have a corporation that’s among those 1,900, but it’s just a shell corporation—it doesn’t engage in any kind of business—and as best I know, it doesn’t require a license.

However, there are are some names on the list that I know should be licensed, but are not. How many such corporations there are remains to be seen—determining that will require a proper audit by the city’s financial staff, which they may or may not ultimately do.

The average Charlottesville business pays $1,588 in annual license fees. If all 1,900 businesses were supposed to be licensed, that would be an extra $3 million in annual income for the city, or about 2% of the city’s annual budget. If only 20% of those 1,900 are wrongly unlicensed, that’s an additional $600 thousand in annual revenue. If that estimate is right, and Charlottesville is a representative sample, then Virginia localities are collectively missing out on over $100 million each year. But these are wild estimates—again, a proper audit is required to have any real data to work with.

Conclusions

The value of corporate registration data to localities makes it a poster child for the value of open data within government and between governments.

States should publish corporate registration data as open data, both as bulk data and as an API. By working with a representative sample of localities in the development phase, they can maximize its utility. For example, they should provided an API-based search by name, address, etc. with fuzzy matching, since many localities don’t store corporate identifiers. They should also provide a bulk service, to which a CSV file can be uploaded, which will identify the state corporate ID for each record, add the ID, and return the enriched file. And, finally, they should periodically send a spreadsheet of all corporate record changes to every municipality.

Where possible, localities should use software that stores state corporate IDs, and that provides robust import/export functionality. There is little competition in this space at present, but given the economic value of sharing this data, the market should address this on its own.

Secretaries of state should establish corporation data schemas for states to adhere to, so that vendors don’t need to accommodate fifty different standards. The National Association of Secretaries of State (NASS) is a good candidate for this. Realistically, several states should collaborate to put together a common practice first, adopt that as an informal schema, and while NASS chews on that for a couple of years, it can become the de facto standard that is eventually ratified.

I speculate that the value of this data goes well beyond local income taxes. Is Virginia’s State Corporation Commission sharing corporate registration data with the Department of Taxation? Or the Department of Professional and Occupational Regulation? Or the Department of Health Professions, the Virginia Employment Commission, or the Department of Business Assistance? There are clear business cases for each of those examples, allowing the state to increase revenue and strengthen its regulatory framework without increasing taxes or passing any new laws.

It’s time to open up corporate registration data, to help government to operate more efficiently and effectively.

CKAN Multisite Draft Proposal

08 December 2014 Waldo Jaquith

We want to make it much easier for governments to create and host open data repositories. To that end we’re proposing a “CKAN Multisite” project, to make it easy to centralize dozens or hundreds of CKAN sites on a single server, driving the cost of an individual repository towards free. We hired the Open Knowledge Foundation—the creators of the CKAN data repository software—to consider our half-baked concept and turn it into a series of actionable proposals. Their draft proposal is now available as Markdown and as a PDF.

We’d like feedback about this throughout the week, before we break up these tasks into components, issue RFPs, and award contracts to begin the work. We’d like to hear from CKAN experts about which components of this should be bid out collectively versus broken up, and about which of these enhancements best advance other goals, even if those goals are unrelated to our project. We’d like to hear from operators of open data repositories about how we can improve on this proposal to better serve their needs. We’d like to hear from people in government about whether this would help make it easier for them to publish data. And, of course, from anybody who has ideas about how to improve on this.

Naturally, we have a GitHub repository for this project, and feedback is best provided there. (Or, if you prefer privacy, they can be sent to me via e-mail.)

• • •

Here’s the motivation behind this project.

When somebody in government wants to publish data, there’s a yawning chasm between that desire and their ability to do so. That chasm can only be bridged with time or money. (There are a few free hosting options, so this is improving.) I posit that if a government employee could set up a data repository with no money, a minimal amount of time, and no contract to be signed, we’d see a lot more data being opened up, especially at the state and municipal level.

Back in July, I wrote that we need to make it trivial to deploy a new instance of CKAN, the popular open-source data repository software. CKAN is famously difficult to install and configure under anything other than ideal circumstances. It seemed like it would make sense to create a master CKAN hosting service, where many CKAN instances could live. State governments could use it to provide free hosting to its agencies or municipalities, or metropolitan planning organizations could use it to provide hosting of planning data. Hosting companies could use it to provide low-cost data repository hosting. There are dozens of applications.

Right now, running dozens of CKAN instances would be laborious and expensive. Each would need its own server (if virtualized), and require security updates and maintenance.

But, with some improvements to CKAN, the required server resources could be made quite minimal, costing only a few dollars a month per site. Standing up a new site could be automated entirely. That gap between desire and ability could be reduced to the time it takes to fill out a registration form and pressing “Submit.”

Why do this with CKAN? Because, of all of the open source data repository packages, CKAN is the most robust and the most popular. Commercial software is out of the question, of course, because without access to the source code, we can’t make improvements to it.

We’re breaking these improvements up into pieces, rather than making a monolithic project out of it. That’s because it may well fail. There may be a reason why this won’t work in CKAN, something that we haven’t seen yet. Maybe we’ll get hung up on technical problems that we can’t resolve. Maybe it’ll cost way more than we thought, and we’ll run out of money. So rather than make a monolithic improvement to CKAN, we want to make a series of small ones, each one of which will make it easier to start a new open data repository. The whole will be greater than the sum of its parts, but the parts won’t be too shabby either.

Please take a few minutes to read the draft proposal, and let me know if you have any thoughts about how to improve it before we get to work.

The Irrationality of Publishing Open Data

12 November 2014 Waldo Jaquith

About a year ago, I was approached by a state agency that needed advice about how to open a dataset. This agency is charged with assembling voter information collected by localities throughout the state, and they were interested in opening up an anonymized version of that dataset. I talked with a couple of their developers about potential applications of that data, how to normalize it, and other technical matters. Then one of the developers interjected.

What happens, he asked, when a legislator discovers that some of our data is inaccurate? He explained that his agency’s job was just to collect and disseminate the data—they had no way of knowing whether localities were collecting it correctly. Inevitably, some of it was wrong. If I put together a system to publish precinct-level voter registration data, and we say that there are more voters in a precinct than there really are, couldn’t I wind up getting hauled in front of a committee as Exhibit A in a voter fraud hearing?

Yes, of course, that was absolutely possible, I agreed, after some stammering.

So best case, he said, we publish this data and nobody gets angry about the mistakes that are bound to be in it. But “open data” isn’t in my job description. My annual evaluation isn’t based on this work. I don’t get a raise or a promotion if this goes well. So it does nothing for me.

I had to concede the point.

And worst case, he said, I get fired after being publicly humiliated in front of a legislative committee.

Again, I had to concede the point.

So…why would I do this?

He was right—there was no reason why he should open this data. That was the end of the conversation, and, it turned out, the end of the project.

···

That conversation has stuck with me. I relate it to somebody every week or two. It illustrates perfectly a significant conundrum of open data: there is no incentive model that makes it rational for government employees to publish open data. The safe thing to do is to publish bland, unobjectionable, low-value data.

One solution to this is probably the hardest possible solution: culture change. So long as there are legislators who are willing to seek out small mistakes to pounce on to score political points, and so long as agencies don’t support agile development and its iterative approach (clear up to the head of the agency), then it will be irrational for government employees to publish datasets that could be used to make them look bad.

There needs to be somebody at the top of the organizational chart who will put in writing a committment to agile, iterative development, who will go to the mat in defense of that process, and who can make a strong case for the need to get data in the open, even if it has mistakes (maybe especially if it has mistakes), so that those mistakes can be aired, identified, and corrected.

Ideally, the legislature should be brought into this process, or at least aware of it. This is, confessedly, wildly unlikely to happen.

And will this culture change solve our problem, making it rational to publish open data, or at least not irrational? I don’t have any idea. Probably not. It might make it better. It probably won’t make things worse. But I suppose it could.

This is a hard problem. There’s no five-step plan, new software, or clever trick that will solve it. I’m optimistic that solutions will bubble up from local governments over the next few years, as long as we all stay cognizant of the problem, and on the lookout for solutions. The viability and impact of open data depends on it.

How to Run a Hackathon (If You Must)

10 November 2014 Waldo Jaquith

When somebody asks me for advice on how their government agency should run a hackathon, my response is simple: “please don’t.” That’s because it’s very easy to put on a bad hackathon, and it’s very hard to put on a good one.

Deciding whether to hold a hackathon is about aligning expectations with reality. A hackathon will not result in new software. It will not result in new businesses. It will not be a source of free consulting time for your agency. However, a hackathon may be a way to build a community around your agency’s data, or introduce that data to an existing community. It may be a way to teach people about your data. It may be a way for you to observe how developers attempt to understand and employ your data.

Any hackathon built on the assumption that it will conclude with the creation of success stories is a hackathon that will fail.

But I will not tell you how to run a successful hackathon. That’s because Josh Tauberer published recently “How to Run a Successful Hackathon,” rendering redundant any practical or procedural advice that I might offer here. And Laurenellen McCann also published republic “So You Think You Want to Run a Hackathon? Think Again,” which is all about the big-picture issues behind a successful hackathon.

However, I do want to describe the unusual hackathon that one government recently held, because I think it presents a compelling new model. In August, Virginia’s Secretary of Technology put on a two-day “Datathon” in Richmond, which included six agencies, with five people representing each agency. They were challenged to build tools using only open data provided by state agencies, using at least two datasets from two different agencies.

Although there was the usual structure of a competition with winners named, the point of this wasn’t really to create software or websites, and it wasn’t to get the participants to quit their jobs and launch new businesses. It was to get state software developers to drink their own champagne (or “eating their own dog food,” if you prefer, but I doubt that you do.) By being immersed in their own data, locked into using only public datasets, they could see where the data lacked key fields, where poor assumptions had been made, where far too much was assumed of third-party developers. They were transported from the government side of the API to the public side of the API, a shift in perspective that is crucial to good API UX.

If you must run a hackathon for your government agency, take a cue from Virginia, and limit participants to employees of that government. And if you want to open it up to a broader audience, set internal expectations and goals based not on a hyped-up, .com-era concept of what success looks like, but instead based on the more modest goals of welcoming people to your agency, sharing knowledge, and building community.

Building the Open Addresses Stack

24 October 2014 Waldo Jaquith

One type of data that’s particularly in need of opening up is geodata. A lot of people have been working for a long time to improve the state of open geodata, but this is a time of particularly exciting advances for address data.

An essential component of the geodata ecosystem is “geocoding.” This is the process of turning address data into spatial coordinates. There is nothing about “742 Evergreen Terrace, Springfield IL” that actually tells us where on Earth that address is located, not without some external data to call upon.

First we have to locate Illinois, and there’s plenty of open data to do that. Then we have to locate Springfield, for which the US Census Bureau’s National Places Gazetteer provides the necessary data. Then we have to find Evergreen Terrance, which is a taller order, but do-able, thanks to OpenStreetMap’s data. But knowing exactly where 742 is? That’s really quite difficult.

What we need is a database that gives us the latitude and longitude of addresses, including that one, something that looks like this:

741 | Evergreen Terrace | Springfield | IL | 39.6981 | -89.6194
742 | Evergreen Terrace | Springfield | IL | 39.6983 | -89.6197
743 | Evergreen Terrace | Springfield | IL | 39.6985 | -89.6200

This is known as an “address points file.” Put simply, geocoding is the act of using one of these files to automatically convert an address (742 Evergreen Terrace) into coordinates (39.6983, -89.6197).

Every geocoding service has to assemble their own data stack—they have to find their own sources of information to know where each address is physically located, and write their own software to do that work. That’s an enormous amount of work, and anybody doing that work needs to charge for access to their data in order to justify the investment of creating and maintaining it.

There are a handful of private vendors who have assembled and maintain their own address points files and geocoding services. Much of their data is built on top of the US Census Bureau’s TIGER (Topologically Integrated Geographic Encoding and Referencing) data, which isn’t very good, but it’s long served as a good start. From there, vendors draw on a variety of sources of data in order to improve their offerings. Using data from these vendors, dozens of web-based geocoding services exist. All of them cost money, beyond some small threshold. At a small scale—geocoding dozens or hundreds of addresses—it can be free or inexpensive, but at the scale of thousands of addresses, geocoding quickly becomes an expensive proposition.

Enter open data.

Imagine if, instead of laboriously assembling collections of address geodata, it was available as open data, provided by each municipality and state. Then it would only need to be harvested, a trivial process. And imagine if, instead of writing geocoding software to use that data to match addresses to coordinates, there was open source software that could be used. Then it would be simple to create a bespoke geocoder for a project, for a government to create a geocoder for their jurisdiction, or to start a commercial geocoding service.

That’s not theoretical. That’s happening. Municipal and state GIS agencies are moving away from the model of selling access to their own geodata, having found that it’s cheaper to give it away. (Plus, if localities sell that data, it becomes hard for states to aggregate it and use it for their purposes—the for-pay data pollutes the openness of the rest of the data.)

Forward-thinking GIS officers around the country are publishing address points data. Shelby Johnson, Arkansas’ State Geographic Information Officer, maintains a blog where his staff announces each update to their statewide address points file. Johnson is the president of the National States Geographic Information Council, and has established a model for other state GIS officers.

The OpenAddresses project is at the fore of the open address points movement, collecting address files from government sources all around the world. They’re building up an open collection of links and data about every address points file, available for anybody to do anything with.

Things are going even better on the software front, with two great open source geocoders having been released recently. Mapzen’s Pelias is an Elasticsearch-powered geocoder and reverse geocoder (that is, it can provide the address closest to a given latitude and longitude), and it even bootstraps itself with TIGER, OpenStreetMap, and OpenAddresses data. And then there’s Code for America’s Streetscope, which likewise relies on Elasticsearch. Instead of bootstrapping a whole nation’s worth of data, Streetscope was created for Lexington, KY—so you only import the OpenAddresses data for the geographic region that you want to include in the geocoder.

There is a non-trivial amount of work to be done to create a completely open geocoding stack, and the remaining working is almost entirely in the realm of data collection. The OpenAddresses team is making rapid progress, with the growing open geodata movement within government is providing the fuel for that progress.

With a great deal of work by a great deal of people over the next couple of years, we should expect that most of the United States’ addresses will be available within open address points files. The cost of geocoding will rapidly approach zero, making it possible to geocode and thus make discoverable vast quantities of data that are now just structured text.

Previous Page: 6 of 9 Next