U.S. Open Data Blog

The Legal Synonyms Project: A Decent Start

10 October 2014 Casey Kuhlman

When Waldo first asked me to look into making a synonyms file based on the context of how words are situated within a large legal corpora I thought, “Well, that should not be a problem. It seems a common situation to overcome so I am sure that there is an open source library out there. I’ll just take the library and run some legal codes against it and Bob’s your uncle.” Turns out it was not that easy. The idea we were aiming for was fairly simple: take a large body of words, and break those down and compare the words to find where words have been used in similar contexts, because those words are probably synonyms of one another.

I began my research into the preparation of synonyms from a large corpus of documents where I would normally begin such a project: by searching GitHub. Unlike some other times, this research proved surprisingly complex. It was complex for a couple of reasons. The first was that there was not a lot of code available that had been open sourced and was meant to deal with our problem set. The second reason was that researching “synonyms” or “legal synonyms” is a bit tricky because when you perform searches there is a distinctly low ratio of signal to noise. It eventually became clear that there was not much code which was publicly available.

Once I hit a wall finding libraries that would assist my building process, I turned to academia. There is a lot of information on computational linguistics, synonym theory, and the like in journals. The challenge for me was that I’m a lawyer, not a linguist and especially not a linguist at the level required to fully comprehend the academic papers. That said, I was able to mostly piece together what the process for creating a synonyms file was. In general, from the academic research I read, there appear to be three phases to developing a synonyms file.

The first phase seeks to break the corpus into pieces and categorize the words in each of those pieces. It was important based on our design idea, that the synonyms file be built from a large corpus (e.g., a state code or the U.S. Code—which was the basis for my testing), so this required a large amount of natural language processing capability to first break down a large text file into its constituent pieces and then categorize those pieces. There was a difference of opinion as to how to break the pieces down, but categorizing them was fairly straightforward and just required a natural language processor (which is sort of like a computerized sentence diagrammer) to determine what part of speech the word was being used as.

The second phase in developing a synonyms file is to compare each of the categorized pieces of the overall text to find a similarity index between different words. This step in the process is, mathematically, where there appears to be the most debate within the academic research which I read. There appear to be many different algorithms for how to approach developing the similarity index—which is simply a number which compares two words and the relevant contexts in which they appear. When two words appear in a similar context numerous times then they would have a high similarity index and vice versa.

The third phase in developing a synonyms file is simply to pull everything out and finalize it. This was, for obvious reasons the easy bit.

We have developed a system which begins to automate the process. It is far from perfect, but it is a decent start. The README for the repository contains some notes as to the current shortcomings of the script. They are mainly in optimization, as well as perhaps having someone more versed in computational linguistics than we are to take a look at deriving the similarity index.

So we welcome the community to dig in. The base is built, and we’re hopeful that the community can use this script and help make it better.

An Experiment in Government Innovation: Legal Hackers in Residence

18 September 2014

The following is a guest post written by V. David Zvenyach, General Counsel for the Washington D.C. Council, and is cross-posted from their website. We’re excited about Zvenyach’s new program, as it promises to be an important new method of facilitating open data within government. –Waldo Jaquith, US ODI Director

In the last two years, my office has been working with civic hackers—Code for DC, DC Legal Hackers, and others—and organizations like US ODI and the OpenGov Foundation to build open-source tools to make DC’s law more accessible. Increasingly, we are working to expand our efforts and join with partner cities through the Free Law Founders. I’ve even learned to code in Python, Node, and I am learning Ruby.

In the last year, though, I had a realization: it is time to bring the hackers in-house. It is time for government lawyers to embrace innovation and to retool our practice for the digital age. We need to stop using bad tools—or worse paying overpriced vendors for proprietary software that works some of the time—and instead work with developers. In turn, instead of building one-off applications and fighting with legal teams, developers can be a partner in making change.

And so, today, after months of working to refine the concept, I am delighted to announce the establishment of the Free Law Innovation Fellow.

The Innovation Fellow is as much an experiment as anything else: can developers, working in the open and side-by-side with practicing lawyers, make government work better and more accessible? Can lawyers learn to improve their practices by observing and integrating collaborative methods and tools that developers have mastered over the years? And can we do this in a sustainable way, engaging fully with the open-source community?

My theory is that developers can help make government data more accessible—through proactive disclosure of datasets and through smarter internal policies. My theory is that developers can build tools to help government lawyers work more effectively and with a greater focus on serving the public interest.

Let me be blunt: if my theory is right, the Innovation Fellow can be a complete game-changer for the public sector. But, for this to work, we need the right person for the job. We need a civic-hacker-in-residence who has the ability to understand a problem, ship code, and to get the job done. Most importantly, we need a person who wants to make a difference—in our nation’s capital and beyond.

If you are that person, please consider applying to be the first ever Free Law Innovation Fellow. I look forward to working with you.

Open Data in CT, Done Differently

04 September 2014 Christopher Whitaker

Map of the state of Connecticut showing Indian trails, villages and sachemdoms; Photo by University of Connecticut

The Connecticut Data Collaborative is the organization that opened data in Connecticut. How they did it is extraordinary.

They’re not a government agency executing an open data executive order, and they’re not a private organization foisting open data on an unreceptive government. Instead, the Connecticut Data Collaborative is a public-private effort created to advance effective planning and decision-making in Connecticut at the state, regional, and local levels, through the use of open and accessible data. It’s a project of the New Connecticut Foundation, a nonprofit organization affiliated with the Connecticut Economic Resource Center.

Starting Out

Several years back, a grassroots coalition of early childhood education and policy experts came together to improve the availability of open data to make informed policy decisions, and hit upon creating a new organization to open up that data and more. After receiving startup funding from the William C. Graustein Memorial Fund in 2009, they received funding from the Connecticut Health and Education Finance authority to expand their data offerings to include all policy areas, build ctdata.org, and work in collaboration with the Urban Institute’s National Center for Charitable Statistics to build a community platform, the Connecticut Nonprofit Strategy Platform. The Coalition has received major funding for its work from the Connecticut Health and Educational Facilities Authority and the State of Connecticut, though additional funding has been provided by a coalition of private foundations and state agencies.

The Collaborative launched their ctdata.org site during a special meeting, held in front the Connecticut General Assembly’s’s appropriations committee. The legislature had never seen data in such an interactive, visual format before, with the ability to look at a map or scatterplot and compare various towns.

Collaborative Executive Director Michelle Riordan-Nold says that the visualizations were much more powerful than raw numbers. Each legislator immediately looked at his own hometown, to see how it ranked in the presented dataset of third grade reading scores. The legislature saw value in the program and gave the group two years of funding. Getting the website launched, and the data curated and loaded, took about a year and a half, with the time lengthened because ot the nascent organization’s lack of full-time staff.

This spring, they hired Riordan-Nold as the collaborative’s first executive director, though she’s been involved with the collaborative since 2012. (She’d previously worked at the Connecticut Economic Resource Center.)

Getting the Collaborative Rolling

The Connecticut Data Collaborative still struggles with getting data out of cash-strapped agencies. Some of the founders of the collaborative were involved in particular policy areas and had relationships with state agencies, working with the agencies to help free up their data.

Helpfully, this spring the governor launched an open data portal, and this has enabled the Collaborative to put greater attention on enhancing the visualization tools and telling the data stories, as opposed to working solely on access to data.

Current Projects

The Connecticut Data Collaborative is redesigning their website. Their current website uses a visualization software tool called Weave to display data. Riordan-Nold points out that although Weave provides sophisticated visualizations, it requires Java, and its analysis can be too complex for the everyday user. Since Riordan-Nold started, she has advocated for providing user-friendly interfaces and improved visualization tools. Because 80% of the Collaborative’s website visitors are laypeople, they need tools that are easy to use. (Although Weave is at least open source.) The collaborative hopes to launch the new site this month.

Another project that the Collaborative is working on involves a partnership with the Institute for Municipal and Regional Policy at Central Connecticut State University. IMRP received a grant to collect all traffic-stop data as part of the Racial Profiling Prohibition Project and will analyze the data to determine if racial profiling is going on. The Collaborative will make the data available on their website.

To help people interpret the provided raw data, the Collaborative will provide interactive visualizations. Also, the state’s police departments need a one-to-two-page printout of data analysis—this will also be provided by the Connecticut Data Collaborative.

One of the unique things about the collaborative is that the collaborative plays host to both government and non-profit data. The collaborative seeks to be the enterprise solution for private organizations and for state agencies, opening up their data.

“When you think about open data, people sometimes forget about the policy aspect of it. People think about [using data when] starting a business or government transparency—but I would say a lot of our data is more for public policy research. We’re a little different in that sense,” says Riordan-Nold.

Find out more about the Connecticut Data Collaborative on their website.

Casey Kuhlman Has Joined US ODI

21 August 2014 Waldo Jaquith

I’m happy to announce our newest addition here at the U.S. Open Data Institute: Casey Kuhlman. He’s our newest open data evangelist, and will be doing a bit of everything here—writing software, assisting our partners in government and the private sector, giving talks, working with vendors, and helping to chart our collective path to a better open data ecosystem.

Photo of Casey

Casey is a graduate of Vanderbilt University Law School in the United States and is a member of the Tennessee Bar Association. He was the founder of Watershed Legal Services, a full-service law firm headquartered in Hargeisa, Somaliland. His practice focused on advising foreign entities in compliance areas. Casey was also an infantry officer in the United States Marine Corps. His experiences in the Marines led him to write a New York Times best-selling book, Shooter, with Jack Coughlin. After going to Vanderbilt, Casey worked with the prosecutor’s office of the Special Court for Sierra Leone, where he worked on The Prosecutor v. Charles Taylor. He also worked with the Public International Law & Policy Group as the Chief of Party for Somaliland Project, which focused on developing the capacity of the Somaliland Parliament, as well as providing expert international legal advice to the Parliament. An avid software engineer and free and open source advocate, Casey has worked on open source projects at the intersection of legal practice and legislative development. In his free time, Casey has been working on systems of smart contracts with a particular focus on automating regulatory compliance in contractual instruments.

He has a rare collection of unrelated experiences: working with government, developing software, and working with open data. Casey’s background is anything but traditional for somebody in the open data sector, which is exactly what we’d hoped for.

Follow Casey on Twitter or on GitHub to keep up with his work.

Announcing the dat Stable Alpha

19 August 2014 Max Ogden

The following is an announcement from the dat project, reposted here.

The first code went into dat one year ago, on August 17th 2013. Today, after a year of work, we are really excited to release the first major version of dat along with a new website.

Our overall goal with dat is to make a set of tools for creating and sharing streaming data pipelines, a sort of ETL style system but designed from the ground up to be developer friendly, open source and streaming. We are aligned with the goals of the frictionless data initiative and see dat as an important tool for sharing data wrangling, munging and clean-up code so that data consumers can simply dat clone to get good data.

The first six months of dat development were spent making a prototype (thanks to the Knight foundation Prototype Fund). In April of this year we were able to expand the team working on dat from 1 person to 3 persons, thanks to support from the Sloan foundation. At that time dat also became an official US Open Data Institute project, to ensure that open data remains a top priority going forward.

Sloan’s proposition was that they like the initial dat prototype but wanted to see scientific data use cases be treated as top priority. As a result we expanded the scope of the project from its tabular-data-specific beginnings and have focused on adding features that will help us work with larger scientific datasets.

Up until this point, the dat API has been in flux, as we were constantly iterating on it. From this point forward we will be taking backwards compatibility much more seriously, so that third party developers can feel confident building on top of dat.

How to get involved

Try it out

You can install dat today and play around with it by importing or cloning a dataset.

You can also click this button to deploy a dat to Heroku for testing purposes for free (but be aware of the Heroku ephemeral filesystem limitations):

The dat REST API comes bundled with the dat-editor web application.

To start learning about how to use dat please read our getting started guide.

To help you choose an approach to loading data into dat we have created a data importing guide.

Write a module or 5

The benefit of dat isn’t in the dat module, but rather in the ecosystem that it enables to be built around it.

There are a lot of modules that we think would be really awesome to have, and we started a wishlist here. If you see something you are interested in building, please leave a comment on that thread stating your intent. Similarly, if there is a format or storage backend that you would like to see dat support, leave it in the comments.

Pilot users

This release of dat represents our efforts to get it to a point where we can start working with scientists on modeling their data workflows with dat. We will now be starting concrete work on these pilot use cases.

If you have a use case in mind and you want to bounce it off of us please open at issue on the maxogden/dat repository with a detailed description.

While we don’t have many details to share today about these pilots, we hope to change that over the new few months.

Bionode (Bioinformatics – DNA)

Dat core team member @bmpvieira, a Bioinformatics PhD student at Queen Mary University in London, is working on applying dat to the domain of working with various DNA analysis related datasets.

Bruno runs the Bionode project. We will be working on integrating Bionode with dat workflows to solve common problems in DNA bioinformatics research.

RNA-Seq (Bioinformatics – RNA)

Two researchers from UC-San Diego reached out to us recently and have started explaining their use case here and here. We hope to use dat to make their data management problems go away.

Sloan Digital Sky Survey (Astronomy)

We will be working with the SDSS project to share large their scans of the visible universe, and eventually connect their data with other sky survey data from other organizations.

The future of dat

This release is the first step towards our goal of creating a streaming interface between every database or file storage backend in the world. We are trying to solve hard problems the right way. This is a process that takes a lot of time.

In the future we would also like to work on a way to easily host and share datasets online. We envision a sort of data package registry, similar to npmjs.org, but designed with datasets in mind. This kind of project could also eventually turn into a sort of “GitHub for data”.

We also want to hook dat up to P2P networks, so that we can make downloads faster but also so that datasets become more permanent. Dat advisor Juan Benet is now working on IPFS, which we are excited to hook up to dat when it is ready.

Certain datasets are simply too large to share, so we also expect to work on a distributed computation layer on top of dat in the future (similar to the ESGF project).

You can help us discuss these high level future ideas on this issue.

To keep up to date with the dat project you can follow @dat_project on Twitter or watch the repo on GitHub.

Previous Page: 7 of 9 Next