A power tool for working with messy data
- More powerful than a spreadsheet
- More interactive and visual than scripting
- More experimental/playful than a database
David Huynh, OpenRefine creator
OpenRefine is a free, open-source Java application. You can download it from http://openrefine.org/download.html. There is more information on install options on the setup page.
Using OpenRefine enables you to:
Or you might not have any clear idea in advance of what you want or need to do. OpenRefine is very useful there too, for exploring a dataset and identify what might be useful next steps.
Data that needs cleaning is often called messy data. It is data that is inconsistent or poorly structured in some way. This could be in relation to how it is formatted, where data appears, or in terminology used.
Typically you will look to clean your data because you’re going to do something else with it. For example, to visualise it or present it in some way, or to analyse or draw conclusions from it. Cleaning your data will help ensure data visualisation or data analysis is accurate and reliable. It is thus an important process for researchers to be aware of when using data in their research.
Dates may be in different formats:
Data you have | Desired Data |
---|---|
1st January 2014 | 2014-01-01 |
01/01/2014 | 2014-01-01 |
Jan 1 2014 | 2014-01-01 |
2014-01-01 | 2014-01-01 |
Names of people or places may be misspelt or not capitalised:
Data you have | Desired Data |
---|---|
London | London |
London] | London |
London,] | London |
london | London |
Information bundled together that you might want in a more granular layout. For example, you may want to get from address data like this:
Address |
---|
University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS |
University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE |
University of Birmingham, Barnes Library, Medical School, BIRMINGHAM, West Midlands, B15 2TT |
University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL |
to something more like this:
Institution | Library | Address 1 | Town/City | Region | Postcode |
---|---|---|---|---|---|
University of Wales | Llyfrgell Thomas Parry Library | Llanbadarn Fawr | Aberystwyth | Ceredigion | SY23 3AS |
University of Aberdeen | Queen Mother Library | Meston Walk | Aberdeen | AB24 3UE | |
University of Birmingham | Barnes Library | Medical School | Birmingham | West Midlands | B15 2TT |
University of Warwick | Library | Gibbett Hill Road | Coventry | CV4 7AL |
The last example above where data is restructured relates to tidy data. This can be thought of as a subset of clean data, more concerned with structuring data correctly. Hadley Wickham has written a really good introduction to this topic.
“Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
- *each variable is a column, *
- *each observation is a row, and *
- each type of observational unit is a table.” Hadley Wickham, Tidy Data
In the same article, Wickham writes
““It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. Data preparation is not just a first step, but must be repeated many times over the course of analysis as new problems come to light or new data is collected.”
OpenRefine accepts data in many different formats - excel, csv, xml, json etc. - but in structure it is tabular. And when we talk about messy data we are talking about values in cells in those tables. Blank cells, duplicates, errors. Data types also relates to how OpenRefine treats numbers, strings etc. - we’ll look more at that later.
As well as cleaning messy data, OpenRefine can be used to enhance your dataset. For example, if you have a list of names of people which can be linked to records in an external authority, related data can then be pulled in from that data source.
Data you have | Date of Birth from VIAF (Virtual Int. Authority File) | Date of Death from VIAF |
---|---|---|
Braddon, M. E. (Mary Elizabeth) | 1835 | 1915 |
Rossetti, William Michael | 1829 | 1919 |
Prest, Thomas Peckett | 1810 | 1879 |
✅ Clean - Find and fix inconsistency with faceting, clustering, cell transforms
✅ Transform - change formats, restructure, split/join multi-valued cells, split columns, transpose columns/rows
✅ Extend - enrich data by combining files, merging projects, fetching URLs, reconciliation with online databases…
✅ Automate - reuse your processing routine by exporting operation history in JSON!