DCU Library | Digital Humanities Workshops | Intro to Data Cleaning with OpenRefine

Intro to Data Cleaning with OpenRefine

OpenRefineLogo

What is OpenRefine?

A power tool for working with messy data

  • More powerful than a spreadsheet
  • More interactive and visual than scripting
  • More experimental/playful than a database

David Huynh, OpenRefine creator

OpenRefine is a free, open-source Java application. You can download it from http://openrefine.org/download.html. There is more information on install options on the setup page.

Using OpenRefine enables you to:

  1. detect instances of inconsistent or messy data
  2. take steps to address those issues
  3. enhance your data by linking to external sources

Or you might not have any clear idea in advance of what you want or need to do. OpenRefine is very useful there too, for exploring a dataset and identify what might be useful next steps.

What is messy data? And why should you clean it?

Data that needs cleaning is often called messy data. It is data that is inconsistent or poorly structured in some way. This could be in relation to how it is formatted, where data appears, or in terminology used.

Typically you will look to clean your data because you’re going to do something else with it. For example, to visualise it or present it in some way, or to analyse or draw conclusions from it. Cleaning your data will help ensure data visualisation or data analysis is accurate and reliable. It is thus an important process for researchers to be aware of when using data in their research.

Some examples of messy data

Dates may be in different formats:

Data you have Desired Data
1st January 2014 2014-01-01
01/01/2014 2014-01-01
Jan 1 2014 2014-01-01
2014-01-01 2014-01-01

Names of people or places may be misspelt or not capitalised:

Data you have Desired Data
London London
London] London
London,] London
london London

Information bundled together that you might want in a more granular layout. For example, you may want to get from address data like this:

Address
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE
University of Birmingham, Barnes Library, Medical School, BIRMINGHAM, West Midlands, B15 2TT
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL

to something more like this:

Institution Library Address 1 Town/City Region Postcode
University of Wales Llyfrgell Thomas Parry Library Llanbadarn Fawr Aberystwyth Ceredigion SY23 3AS
University of Aberdeen Queen Mother Library Meston Walk Aberdeen   AB24 3UE
University of Birmingham Barnes Library Medical School Birmingham West Midlands B15 2TT
University of Warwick Library Gibbett Hill Road Coventry   CV4 7AL

Tidy Data

The last example above where data is restructured relates to tidy data. This can be thought of as a subset of clean data, more concerned with structuring data correctly. Hadley Wickham has written a really good introduction to this topic. Tidy Data article screengrab

“Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:

  • *each variable is a column, *
  • *each observation is a row, and *
  • each type of observational unit is a table.” Hadley Wickham, Tidy Data

In the same article, Wickham writes

““It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. Data preparation is not just a first step, but must be repeated many times over the course of analysis as new problems come to light or new data is collected.”

Data Types

OpenRefine accepts data in many different formats - excel, csv, xml, json etc. - but in structure it is tabular. And when we talk about messy data we are talking about values in cells in those tables. Blank cells, duplicates, errors. Data types also relates to how OpenRefine treats numbers, strings etc. - we’ll look more at that later.

Enhancing your Data

As well as cleaning messy data, OpenRefine can be used to enhance your dataset. For example, if you have a list of names of people which can be linked to records in an external authority, related data can then be pulled in from that data source.

Data you have Date of Birth from VIAF (Virtual Int. Authority File) Date of Death from VIAF
Braddon, M. E. (Mary Elizabeth) 1835 1915
Rossetti, William Michael 1829 1919
Prest, Thomas Peckett 1810 1879

Other reasons to use OpenRefine


💡 OpenRefine helps you…

Clean - Find and fix inconsistency with faceting, clustering, cell transforms

Transform - change formats, restructure, split/join multi-valued cells, split columns, transpose columns/rows

Extend - enrich data by combining files, merging projects, fetching URLs, reconciliation with online databases…

Automate - reuse your processing routine by exporting operation history in JSON!

Back
Next