Intro to Data Cleaning with OpenRefine

OpenRefineLogo

What is OpenRefine?

A power tool for working with messy data

More powerful than a spreadsheet

More interactive and visual than scripting

More experimental/playful than a database

David Huynh, OpenRefine creator

OpenRefine is a free, open-source Java application. You can download it from http://openrefine.org/download.html. There is more information on install options on the setup page.

Using OpenRefine enables you to:

detect instances of inconsistent or messy data
take steps to address those issues
enhance your data by linking to external sources

Or you might not have any clear idea in advance of what you want or need to do. OpenRefine is very useful there too, for exploring a dataset and identify what might be useful next steps.

What is messy data? And why should you clean it?

Data that needs cleaning is often called messy data. It is data that is inconsistent or poorly structured in some way. This could be in relation to how it is formatted, where data appears, or in terminology used.

Typically you will look to clean your data because you’re going to do something else with it. For example, to visualise it or present it in some way, or to analyse or draw conclusions from it. Cleaning your data will help ensure data visualisation or data analysis is accurate and reliable. It is thus an important process for researchers to be aware of when using data in their research.

Some examples of messy data

Dates may be in different formats:

Data you have	Desired Data
1st January 2014	2014-01-01
01/01/2014	2014-01-01
Jan 1 2014	2014-01-01
2014-01-01	2014-01-01

Names of people or places may be misspelt or not capitalised:

Data you have	Desired Data
London	London
London]	London
London,]	London
london	London

Information bundled together that you might want in a more granular layout. For example, you may want to get from address data like this:

Address
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE
University of Birmingham, Barnes Library, Medical School, BIRMINGHAM, West Midlands, B15 2TT
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL

to something more like this:

Institution	Library	Address 1	Town/City	Region	Postcode
University of Wales	Llyfrgell Thomas Parry Library	Llanbadarn Fawr	Aberystwyth	Ceredigion	SY23 3AS
University of Aberdeen	Queen Mother Library	Meston Walk	Aberdeen		AB24 3UE
University of Birmingham	Barnes Library	Medical School	Birmingham	West Midlands	B15 2TT
University of Warwick	Library	Gibbett Hill Road	Coventry		CV4 7AL

Tidy Data

The last example above where data is restructured relates to tidy data. This can be thought of as a subset of clean data, more concerned with structuring data correctly. Hadley Wickham has written a really good introduction to this topic.

“Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:

*each variable is a column, *

*each observation is a row, and *

each type of observational unit is a table.” Hadley Wickham, Tidy Data

In the same article, Wickham writes

““It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. Data preparation is not just a first step, but must be repeated many times over the course of analysis as new problems come to light or new data is collected.”

Data Types

OpenRefine accepts data in many different formats - excel, csv, xml, json etc. - but in structure it is tabular. And when we talk about messy data we are talking about values in cells in those tables. Blank cells, duplicates, errors. Data types also relates to how OpenRefine treats numbers, strings etc. - we’ll look more at that later.

Enhancing your Data

As well as cleaning messy data, OpenRefine can be used to enhance your dataset. For example, if you have a list of names of people which can be linked to records in an external authority, related data can then be pulled in from that data source.

Data you have	Date of Birth from VIAF (Virtual Int. Authority File)	Date of Death from VIAF
Braddon, M. E. (Mary Elizabeth)	1835	1915
Rossetti, William Michael	1829	1919
Prest, Thomas Peckett	1810	1879

Other reasons to use OpenRefine

It is important to know what you did to your data. Journals and granting agencies are increasingly requiring documentation of this. With OpenRefine, you can easily capture all of this and share it as supplemental material.
All actions are easily reversed in OpenRefine.
OpenRefine always uses a copy of your data and does not modify your original dataset.
OpenRefine always keeps your data private on your own computer until you choose to share it. It works by running a small server on your computer and using your web browser to interact with it, but your private data never leaves your computer unless you want it to.

💡 OpenRefine helps you…

✅ Clean - Find and fix inconsistency with faceting, clustering, cell transforms

✅ Transform - change formats, restructure, split/join multi-valued cells, split columns, transpose columns/rows

✅ Extend - enrich data by combining files, merging projects, fetching URLs, reconciliation with online databases…

✅ Automate - reuse your processing routine by exporting operation history in JSON!