U.S. Open Data Blog

Here’s How We Saved State-Level Open Legal Data

28 July 2015 Waldo Jaquith

An open data armageddon is upon us for state legal data, in the form of the Uniform Electronic Legal Materials Act (UELMA, pronounced “you-EL-mah”). The model legislation is being adopted by states throughout the nation, requiring them to make published legal data verifiable, so that people know that it is official, authoritative, and unchanged. On its current course, the law will effectively force states to publish legal data as PDFs, bringing to an end an era of expanding the availability of legal data.

So U.S. Open Data stepped in, creating a simple, free, open source software product that won’t just prevent this from happening, but will accelerate the trend towards open state legal data. This is the story of how that happened, and the lessons to be learned from it.

Early last year, Washington D.C. Council attorney David Zvenyach approached us with both a problem and a solution. A member of the Uniform Law Commission, Dave was familiar with UELMA, and worried that its implementation could set back open data.

UELMA requires that states publish legal data in a method that is verifiable, which is to say that a person with a copy of a regulation (for example) must have some method of knowing that it is an actual regulation and not a fabrication. It’s this bit that does that:

To comply, many states are simply looking at publishing all laws as PDFs, using Adobe Acrobat to digitally sign each PDF. That would be a cheap, easy way to comply with the law, with the downside that it could be the end of legal data within states. That’s because an authentication system based on Acrobat cannot be applied to anything other than a PDF, like JSON, XML, YAML, etc. That would be catastrophic for open data. As states adopt UELMA, they have a year or two to figure out how to implement it, providing a brief window in which this disaster can be headed off.

Dave’s idea was to create a simple, lightweight file authentication system. After all, this is a problem solved many times over in many other domains. Every software update downloaded by Windows, Mac OS, Android, etc. all uses a basic authentication process to ensure that the software update is official and unchanged. (For the technically inclined, imagine verifying an MD5 hash over an HTTPS connection.) We knew this was both plausible and palatable to states, because Minnesota’s Office of the Revisor of Statutes had built a similar system in-house, and published a crucial, detailed whitepaper about how they did so. (Minnesota initially bid out the project, and was so appalled by the quotes that they received that they decided to just build the system themselves. Brava to the Minnesota Revisor of Statutes for a bold decision.) Such a system would allow any file to be verifiable under UELMA—not just PDFs, but also JSON, XML, YAML, Word, etc. This would simultaneously solve the problem of government agencies that refuse to provide data because “somebody could change it,” by providing them with a verification mechanism. This would allow us to turn UELMA from an accidental cudgel against open data into a powerful tool in favor of it.

A real problem, a sensible solution. U.S. Open Data was on it.

Silicon Valley Software Group won the contract. Mike Tigas and Dan Schultz started committing code at the end of April and they finished by June. The whole project cost just $7,500. Throughout the process, Washington D.C. served as the client stand-in, courtesy of Dave. All work was done on GitHub, and hosted on the collaboratively-maintained /unitedstates/ GitHub repository as a sign of our committment to a larger cooperative goal. We named the product “Data Seal.”

screenshot of Data Seal

The way it works is easy. Data Seal is a self-hosted website that governments can add documents to. Data Seal generates a PGP signature for each file that’s added. Then, when a user wants to verify the validity of a legal material (e.g., a copy of a law), they can visit the agency’s Data Seal website, drag the file onto their browser, and find out whether the document is official or if it’s been altered. (The document-adding and document-verification processes can be performed via the browser or through an API.)

In July I presented the finished product to the Administrative Codes and Registers conference, exhorting members to use Data Seal as UELMA passed into law in their states, or to follow Minnesota’s example and write the code themselves. My message was simple: file verification is a trivially-solved technological problem.

But we knew there remained a major obstacle: that free software is basically worthless to government agencies.

Governments acquire technology by publishing RFPs and awarding contracts. Developers of open source software aren’t in the habit of responding to RFPs with bids of $0 that say “just go ahead and download my software and use it.” Many agencies lack the in-house knowledge about how to install software, or don’t have the ability to use the services required. So instructions like “throw this up on a Heroku instance, set up a Mailgun account and plug your API key into the config file, which should live on S3” are somewhere between baffling and logistically impossible.

Although Data Seal is simple to install by many developers’ standards—it’s a basic Django application that uses GnuPG and a Python GnuPG wrapper—it falls into the category of “Free As In Helicopter”—it costs nothing, but requires so much knowledge about how to set up and operate as to be implausible.

Solution: Make it possible for governments to pay for our free software.

So late last year we went back to Silicon Valley Software Group, the vendor that built Data Seal, and explained that—congratulations!—they were now the world’s leading experts in Data Seal, and consequently in a great position to sell hosting services. (Of course, anybody could sell Data Seal hosting services—it’s free software.) They thought that was a fine idea, so they went away for a while and worked on what to do about that.

As of today, Silicon Valley Software Group sells Data Seal hosting, targeted at states that have adopted UELMA. That’s only a dozen today, but will surely be close to 50 within a few years.

They’ve settled on a pricing structure that’s cheap enough to make it the obvious decision for states, and likely below many states’ threshold for putting a contract up to bid.

Now states can use Data Seal.

Silicon Valley Software Group makes money, states will save money, the public retains access to legal data. Everybody wins.

Open data wins rarely come quickly. They require demand by government, accepted standards to rely on, government buy-in to the proposed solution, and market forces to support that solution. In the case of UELMA, this win will come slower still, over the years that it will take for states to adopt and then implement UELMA. In the meantime, we’ll keep giving talks to state agencies, spreading the message that file verification is a solved problem, that they don’t need to fall back on PDFs, and that they don’t need to pay millions of dollars to a vendor to build a custom system. But they also don’t have to pay nothing. They can do something better—pay a little bit.

Sloan Redoubles Dat Funding

03 April 2015 Waldo Jaquith

We’re thrilled to announce that the Alfred P. Sloan Foundation has provided a $640,000 grant to support the development of Dat. Created by Max Ogden, and housed at U.S. Open Data, Dat makes it easy to create automated, reproducible data pipelines that sync. Sloan’s support will allow Dat to bring to scientific data the same automated, distributed workflows that Git brings to source code sharing. They’re funding three full-time positions for the next two years.

Dat is about more than scientific data—Sloan’s core support is about building its capacity in the sciences, but it will also improve Dat in ways that will serve open data generally.

One year ago, Sloan provided $260,000 in funding for Dat development, also focusing on scientific data use cases, which enabled the team to release an alpha version of Dat in August, and a beta version, due out imminently.

We’re all grateful to the Sloan Foundation for their support, and to Sloan’s Josh Greenberg for his guidance and forebearance, and look forward to two years of fruitful work.

Incomplete Repositories Considered Harmful

30 March 2015 Waldo Jaquith

An open data repository has become a totem of the open government data movement. No open data effort is considered to be complete—or perhaps even really begun—without a data catalog. Generally at an address like data.placename.gov, it functions as a central site that inventories the data holdings of that government, making it easy for people to find data from a given government.

A reasonable person looking for government data who looks at a government’s repository and doesn’t find that data would conclude that it does not exist.

As a rule, this is not true. I am not aware of any government data repository that is complete. Most of them are not even close to being complete. In my experience, the majority of a government’s existing, published, online data holdings are not included in their data repository.

Publishing a data repository while omitting data should be considered harmful. An important part of launching a data repository is inventorying existing published data holdings, such as with “Let Me Get That Data For You.” Existing data holdings should serve as the basis for a data repository, and all of them should be included.

Note that this does not require actually hosting all data files in the repository. Internal politics can make it infeasible to copy one agency’s data files over to another agency’s repository, no matter how relevant that those files are. But there is no reason why a repository cannot contain an entry about and a link to data, wherever it may live. (In some government agencies, it is claimed that there is a “rule” against so much as linking to a page on another agency’s website. If you are told this, demand to see a copy of that rule. That rule does not exist.)

In fairness to governments with incomplete data repositories, the process of inventorying data holdings is non-trivial. It’s a lot easier now that “Let Me Get That Data For You” exists, but it’s nearly as simple as it should be. As a standard part of the setup for any repository software, it should offer to inventory all holdings at a given domain name or set of doman names, filtering out useless files, highlighting great ones, gathering metadata from each file, and auto-populating the repository with those files, after an approval step. Creators of repository software should add that functionality, and soon. In the meantime, the operators of government data repositories need to take the steps necessary to ensure that their catalog is complete.

The New Trend of Regional Data Centers

25 March 2015 Waldo Jaquith

Across the U.S., there are a half dozen efforts underway to establish regional open data centers for municipal governments. The hurdles to publishing open data are too high and too numerous for the vast majority of municipal governments. By banding together, the minimum viable product for municipal open data can be simplified substantially.

The Problem

It’s a practical impossibility for most municipal governments to publish open data. There are 39,044 local governments in the U.S., but those with open data programs number in the dozens. To pick an quasi-arbitrary cap, for governments with fewer than perhaps 100,000 citizens—or 98.7% of subcounty municipal governments—it’s beyond their capabilities. They lack in-house technical expertise. They lack a budget for specialized staff, a data repository, ETL solutions, etc. They’re saddled with lousy, specialized software that has no ability to export data in an open format. Worst of all, they lack clear business cases for why they should open their data holdings.

By banding together, municipalities can work around these obstacles. A central, coordinating entity can determine what data it would be mutually beneficial for them to share, establish norms for that data, and provide shared infrastructure at a viable per-municipality price point.

Normally when a group of geographically proximate local governments band together to establish and share common standards, we call it a “state.” But states haven’t shown leadership in the open data space, despite the extensive amounts of data that they collect from localities. Perhaps their priorities lie elsewhere, perhaps they too lack the technical expertise, or perhaps they find the challenge of working with every municipality in their state too daunting. Whatever the reason, states are sitting this out.

The Solution

Governmental, non-profit, and educational organizations are stepping into this leadership vacuum. (Governmental organizations like community service boards and metropolitan planning organizations are in a particularly good position to fill this role.) Of the half-dozen regional data centers under development, all are in various stages of planning, none are launched yet, and most of them aren’t public knowledge.

The Pittsburgh Regional Data Center is the farthest along. Under new mayor Bill Peduto, the city of Pittsburgh has made a strong push for open data in the past year. True to form, they’ve teamed up with the surrounding Allegheny County, the University of Pittsburgh, and Carnegie Mellon University to create a regional data center, supported by a $1.8M grant from the Richard King Mellon Foundation.

The University of Pittsburgh’s Center for Social and Urban Research (USCUR) is spearheading the project, after spending the better part of the past decade as a data intermediary in the region. I talked with UCSUR’s Bob Gradeck, who is heading up the project for the organization. He explained that with 120-odd municipalities in the greater Alleghany County area, the challenge is substantial, but so is the potential payoff. They hope to have 10 local governments as a part of their program in a year, working with those governments to find ways that sharing data can help them. The idea isn’t to lead government, but instead to provide the infrastructure and resources to support existing government initiatives with better data. There are people already doing good work who just need support.

The Regional Data Center will provide the technical and administrative support to establish a data repository for each participating municipality. They’ll identify the best data practices that exist in the region, and promote them to other municipal governments. Perhaps most important, they’ll establish a regional data center that can serve as a model for other regions throughout the country.

Looking Ahead

Each of the planned regional data centers are a little different. They serve larger or smaller areas, they have different goals and methods, they’re run by different kinds of organizations. Although it’s possible that one approach will turn out to be the right one, it’s more likely that different types of data centers will be better in different regions. If the regional data centers are open about their methods, and frank about their successes and failures, their approaches may be a good short-term solution to the seemingly insurmountable obstacles to publishing open data that are faced by municipalities.

In the medium term, states should take over this role, as the cost/technical challenge of running a repository and ETL is driven down. It is the proper role of states to establish the standards for the exchange of data between localities and between the state and localities, and the sooner that they get involved in that, the more control that they’ll have in establishing standards and practices that are to their liking.

We Must Identify Sustainable Government Data Sources

12 March 2015 Waldo Jaquith

To create sustainable sources of open government data, it’s essential that we find, employ, and promote data held by government that is needed by another part of government, that has measurable financial value to government. This ensures that the data will continue to be shared, making it sustainable in a way that would otherwise be very difficult.

The response to our December blog entry about open corporate data has been revelatory. Like all states, Virginia registers corporations at a state level. And like many states, most Virginia municipalities charge a business license tax—a tiny percentage of businesses’ revenue, usually with a floor of a nominal fee, like $50/year. Also like many states, Virginia’s list of registered business isn’t shared with municipalities, so the municipalities have no way to audit their records, to find out what businesses aren’t registered. We devised a system to provide this data to municipalities, and demonstrated its substantial financial value to localities throughout Virginia. In the three months since, tax collectors, local elected officials, members of the state legislature, political groups, and citizen activists responded extremely enthusiastically. Now that localities know that this data exists, and that it can generate millions of dollars in income, there is no stopping this data. It will be published so long as Virginia registers businesses and localities charge business license taxes, as will serve as a model for other states.

There exists throughout government, at all levels, data that one part of government has, that another part of government needs, but where nobody has yet connected those dots. Identifying these datasets is the most powerful lever to opening government data. It’s important to recognize that government employees are rational actors—this is the rare basis for publishing open data that makes sense for individual government employees to participate in. It’s crucial that we find more of these datasets, study how to make them useful, write whatever minimal code is necessary to transform the data to make it possible for government to consume it, study its resulting value, and then tell that story.

Finding these datasets is not easy. It requires a deep familiarity with minutiae of government processes combined with a lot of experience working with data. In practice, this necessitates discussion and collaboration with a great many people at all levels of government—empathic, questioning discussion. Over time, it becomes possible to pull together threads from many conversations, to identify how one agency’s little-noted data source can be of crucial benefit to another agency’s mission.

Of course, a significant benefit of this process is that everybody gets this data, not just the government that needs it. It’s not rational for government employees to publish open data merely because it could be of benefit to the private sector, but by leading with a government-first approach, the private sector gets the data just the same, though sustainably.

Such use cases do not exist for every type of data, nor should they have to. But it’s important that there be large number of use cases like this, applicable at the local, state, and federal levels, to help to establish open data as a sensible norm. The infrastructure that governments create to support these use cases, and the experience that they gain from that work, will later serve to facilitate the publication of data that doesn’t have an immediate financial value to government.

Open data practitioners should work to identify, use, and tell others about government data sources that are of value to other parts of government, so that we can ensure that open data will advance under its own, unstoppable momentum.

Previous Page: 4 of 9 Next