U.S. Open Data Blog

In Praise of CSV

10 March 2015 Waldo Jaquith

Comma Separated Values is the file format that open data advocates love to hate. Compared to JSON, CSV is clumsy; compared to XML, CSV is simplistic. Its reputation is as a tired, limited, it’s-better-than-nothing format. Not only is that reputation is undeserved, but CSV should often be your first choice when publishing data.

It’s true—CSV is tired and limited, though superior to not having data, but there’s another side to those coins. One man‘s tired is another man’s established. One man’s limited is another man’s focused. And “better than nothing” is in, fact, better than nothing, which is frequently the alternative to producing CSV.

There’s a lot about CSV that makes it a great format:

It’s easy to produce. Many closed datasets exist as a spreadsheet on a government file server somewhere. The existence of File → Save as CSV makes it trivial for that data to be opened as CSV. Rendering it in a more advanced format, on the other hand, is far more difficult.
It’s easy to consume. 99% of people have no idea of what to do with an XML or JSON file. But CSV files can be read by any spreadsheet software, including dozens of free programs, even plugins that will render it within the browser, and offer niceties to make it easy to browse, search, and manipulate that data.
It’s easy for developers to produce and consume. Any programming language can handle CSV, since it’s just an array. There are no data types to worry about, validation is easy, and there are plenty of libraries to make the work trivial.

Again, some of these strengths are weaknesses. The simplicity of the file format makes it terrible for rendering complex data, especially nested data. The lack of typing makes schemas generally impractical, and as a result validation of field contents is also generally impractical. There are vast swaths of data that can’t be reasonably represented as CSV, because they’re too complex.

For many datasets that can be represented as CSV, they should be represented as CSV. CSV lowers the barriers to both producing and consuming open data, and it’s crucial that we continue to drive down the minimum viable product for open data. So knock CSV if you must, but please also produce CSV, to make sure that your data can be used widely and easily.

Dataset Inventorying Tool

18 February 2015 Waldo Jaquith

Today we’re releasing Let Me Get That Data For You (LMGTDFY), a free, open source tool that quickly and automatically creates a machine-readable inventory of all the data files found on a given website.

When government agencies create an open data repository, they need to start by inventorying the data that the agency is already publishing on their website. This is a laborious process. It means searching their own site with a query like this:

site:example.gov filetype:csv OR filetype:xls OR filetype:json

Then they have to read through all of the results, download all of the files, and create a spreadsheet that they can load into their repository. It’s a lot of work, and as a result it too often goes undone, resulting in a data repository that doesn’t actually contain all of that government‘s data.

Realizing that this was a common problem, we hired Silicon Valley Software Group to create a tool to automate the inventorying process. We worked with Dan Schultz and Ted Han, who created a system built on Django and Celery, using Microsoft’s great Bing Search API as its data source. The result is a free, installable tool, which produces a CSV file that lists all CSV, XML, JSON, XLS, XLSX, XML, and Shapefiles found on a given domain name.

Screenshot of query results

We use this tool to power our new Let Me Get That Data For You website. We’re trying to keep our site within Bing’s free usage tier, so we’re limiting results to 300 datasets per site. At the moment, for demonstration purposes, we’re permitting searches of .gov, .com, .net, and .org sites, but we’ll reduce that to only government domain names in a few days—again, just to minimize our own API costs.

You’re welcome to try it out! Pull requests and issues are welcome, of course.

Announcing the CKAN Multisite Project

18 February 2015 Waldo Jaquith

Today, we’re pleased to announce CKAN Multisite, a project that will make it trivial to create a new CKAN-based data repository by making it easier to host CKAN sites. We’re going to drive down the cost of hosting open data, allowing states to provide open data hosting for their cities and counties, commercial hosting companies to be able to sell low-priced data hosting, and facilitate innovation in a space that is now out of reach of many organizations and governments. The minimum viable product for an open repository is not minimal enough—we think CKAN Multisite will get it there.

To make CKAN Multisite a reality, last month we published a pair of RFPs to improve CKAN—one to add full data export to CKAN, one to create a Docker-based, multisite CKAN management tool—and last week we selected the winning bids from a series of great proposals. We’ve awarded both contracts to boxkite, an Ottawa-based software development shop, so now they’re taking on the entire project.

boxkite is run by Denis Zgonjanian and Ian Ward. Denis was on the team that extended CKAN to serve as the data repository for Canada, and Ian serves as the technical lead on CKAN. So they’re deeply qualified to do the work. But the biggest reason that we awarded the contracts to them is because they’re already doing some great work in this area. They’ve done a lot of work to Dockerize CKAN—that is, to make it trivial to deploy and run many sites on a single server. So now boxkite will build a management and deployment tool atop the great work they’ve already done, and add site export functionality (allowing people to leave their small CKAN site for a larger one, or to transition to other platforms, such as ArcGIS Open Data, DKAN, Junar, OpenDataSoft, Socrata, etc.)

The plan for this project was put together by the Open Knowledge Foundation, the indispensible organization that’s behind CKAN and so many other brilliant open data projects. (Their plan is far-reaching, beyond what we can do now—so anybody looking to make improvements to CKAN would do well to start there.)

Our development timeline has this scheduled for a late-spring v1.0 release. Of course, all work will be done in the open, via GitHub, and all code will be published under an open source license. We’re excited about the work, grateful to have boxkite as partners in this effort, and extremely enthusiastic about what people are going to do with the resulting software.

CKAN Multisite RFPs

07 January 2015 Waldo Jaquith

We’ve published a pair of RFPs for our CKAN Multisite project, and we’re looking for bids on them. The goal of this project is to make it trivial to launch and manage data repositories. We’re doing that by containerizing CKAN and creating a deployment/management platform for those containers. Or, rather, we want to pay somebody (you?) to do that.

First, there’s the CKAN Content Export RFP. This is to address CKAN’s lack of a comprehensive export function, making it difficult to move data out of CKAN or between CKAN repositories. The solution to this will involve marrying CKAN’s metadata export functionality with its inherent data-hosting functionality to create an all-encompassing export function. Experience with CKAN is not required to accomplish this. Bids are due by Thursday, January 15.

Second, there’s the CKAN Docker Management Platform. This is to make it possible to deploy and manage CKAN instances, when each CKAN instance is within a Docker container. Realistically, this will probably involve wrapping a Docker management tool (Ansible, Project Atomic, Citadel, Panamax, Decking, Shipyard, etc.) in a site development framework (e.g., Django) to provide authentication, user management, etc. At least some CKAN experience is probably required, but not necessarily. Bids are due by January 23.

Bids need not be terribly formal—you just need to convince us that you are the right person or business for the job, that your work is worth your price, and that you can see a path to completion for the project. Contact us if you have any questions, need any clarification, or have any ideas.

2014 Open Data Pioneer Award

17 December 2014 Waldo Jaquith

We’re pleased to announce that the inaugural recipient of our Open Data Pioneer award is V. David Zvenyach.

Portrait of V. David Zvenyach

The general counsel for the Washington D.C. Council, Dave is the very ideal of an open data advocate within government. An attorney by training, Dave has learned how to code over the past couple years, and has become a key figure in the world of open legal data and in Washington D.C. open data and civic data.

The story of how Dave came to open data is illustrative. It was just last year that Tom MacWright was frustrated that Washington D.C. made it impossible to get or create an electronic copy of their laws, and a bunch of us made a big, public fuss about how totally unreasonable that it was for Washington D.C. to claim copyright over their laws (thus making it illegal for anybody to reproduce them). It was clear that some civil disobedience was in order, perhaps there’d be a lawsuit, but in the end we’d prevail and the laws would be opened. But instead of having a public brawl, one V. David Zvenyach stepped in. Tom put on a D.C. Code hackathon one Sunday morning, and Dave showed up (again, on a Sunday morning) to announce that a) the D.C. Code was now released under a Creative Commons Zero license and b) that he’d be sticking around for the hackathon and maybe learning to write a little code.

Twenty months later, “learning to write a little code” isn’t really how it turned out. He learned Python. And Node. And Ruby. And Dave dabbles fearlessly in anything else he might need to know. He started creating new services just for fun, becoming comfortable with Heroku and AWS. All while continuing his normal workload (looking at his Git commits, that’s because he’s done a lot of that work on nights and weekends).

To address the problem of the Supreme Court occasionally editing decisions post-publication, he put together a Node project called SCOTUS Servo, which watches for changes and, when it finds them tweets out an alert, with the changes highlighted in before-and-after graphics.
To deal with the D.C. Council’s confusing data-transmission legal obligations under the Home Rule Act, he created Effdate, which uses Congressional meeting data to calculate the effective date of D.C. legislation.
To make it easier for attorneys to produce open, structured legal text, he created LegalMarkdownJS, a browser-based editor for legal text.
Not happy just educating himself, he created a book called Coding for Lawyers, to teach other attorneys what he’s learned.
To expand the capabilities of his own organization, he’s established a D.C. Free Law Innovation Fellowship, to hire a developer working hand-in-hand with D.C. government attorneys to make government data more accessible (personally raising the money to fund the position).

That’s just a sample—a list of his projects is available on his website. All of these software projects are done right—they’re committed to GitHub from the first line of code, they consume their own APIs, they’re published under open licenses.

Beyond his coding, Dave does a great deal of behind-the-scenes work to advance the cause of open data, both within D.C. government and federal government. He understands how government works, he understands the needs and desires of the private-sector open data world, and his work as a bridge between the two is vital.

There are a lot of people, all across the country, doing brilliant, vital work in the real of open data. But, in 2014, nobody has done more pioneering work, or established a better model for a bureaucrat-coder, than Dave Zvenyach.

In recognition of Dave’s work, we’re sending him a handsomely engraved Jefferson cup and a $250 credit on Amazon Web Services, his platform of choice. (We tried to buy that credit from Amazon, but it turns out that it’s impossible. We’re grateful to Ariel Gold of the AWS Open Data team and Amazon for solving this problem by simply giving Dave a $250 AWS credit!)

Previous Page: 5 of 9 Next