DCU Library | Digital Humanities Workshops | Intro to Data Cleaning with OpenRefine

Getting Started

Importing data

What kinds of data files can I import?

OpenRefine accepts data in a variety of formats including:

How can I move my data into OpenRefine?

There are a number of ways to get data into Refine:

We will do it by uploading a file from a url, specifically: https://github.com/LibraryCarpentry/lc-open-refine/raw/gh-pages/data/doaj-article-sample.csv.

This is taken from the Carpentries OpenRefine lessons, which this lesson is based on (see credits).

Create your first OpenRefine project

Once OpenRefine is launched in your browser:

Import screen

OpenRefine should give you a preview of how it will parse your dataset and gives various options to ensure the data is imported into OpenRefine correctly.

parsing screenshot

What you choose here will depend on the type of data you are importing, or if anything is appearing unexpectedly in the preview window. For this exercise we’ll make a few checks:

Once you are happy click the Create Project >> button at the top right of the screen. This will create the project and open it for you. Projects are saved as you work on them, there is no need to save copies as you go along.

Going Further

Layout of OpenRefine

OpenRefine displays data in tabular format. Each row will usually represent an instance of the data, while each column represents a type or field of information. This is similar to how you might view data in a spreadsheet or a database. As with a spreadsheet, the individual bits or values of data are in cells at the intersection of a row and a column.

Working with data

Moving Columns

If you didn’t have the correct column as the first column you could fix this in OpenRefine by

Rows and Records

OpenRefine has two modes of viewing data: Rows and Records. It defaults to Rows mode.

At the moment the rows and records numbers are the same. To show how this can be different we’ll look for cells which have multiple values and split it across multiple rows.

In these multivalued cells the character separating each value is called a separator or delimiter. Choose a good separator - a comma separator in the author field would cause problems!! Generally you look to choose a separator that is not in your data values.

Exercise

So that is some initial reorganising. Next we’ll look at refining functions - faceting filtering, and clustering

💡 Key Points

✅ Use the Create Project option to import data

✅ Specify how data imports using oarsing options

✅ OpenRefine displays data in rows and columns. Records can link multiple rows to a single record.

✅ Most work in OpenRefine is performed via drop-downs at the top of each column

✅ Split and join multi-valued cells to view or modfiy individual values

✅ When creating multi-valued cells in your data, choose your separator carefully

Back
Next