OpenRefine accepts data in a variety of formats including:
There are a number of ways to get data into Refine:
We will do it by uploading a file from a url, specifically: https://github.com/LibraryCarpentry/lc-open-refine/raw/gh-pages/data/doaj-article-sample.csv.
This is taken from the Carpentries OpenRefine lessons, which this lesson is based on (see credits).
Once OpenRefine is launched in your browser:
Create Project
from the left hand menuWeb Addresses (URLs)
https://github.com/LibraryCarpentry/lc-open-refine/raw/gh-pages/data/doaj-article-sample.csv
in the text box
Next >>
OpenRefine should give you a preview of how it will parse your dataset and gives various options to ensure the data is imported into OpenRefine correctly.
What you choose here will depend on the type of data you are importing, or if anything is appearing unexpectedly in the preview window. For this exercise we’ll make a few checks:
Character encoding
box and set it to UTF-8
. This can be important if you have a dataset that uses certain special characters.1
entered in Parse next 1 line(s) as column headers
(it should default to this when it recognises header content)Attempt to parse cell text into numbers
box is not checked. We will work through this during the exercises.Project Name
box in the upper right corner will default to the title of your imported file. Click here and give your project a different nameOnce you are happy click the Create Project >>
button at the top right of the screen. This will create the project and open it for you. Projects are saved as you work on them, there is no need to save copies as you go along.
record path
i.e. the parts of the file that will form the data rows in the OpenRefine project. This is not covered in this lesson.OpenRefine displays data in tabular format. Each row will usually represent an instance of the data, while each column represents a type or field of information. This is similar to how you might view data in a spreadsheet or a database. As with a spreadsheet, the individual bits or values of data are in cells at the intersection of a row and a column.
OpenRefine displays a limited number of rows at one time. You can adjust this (from 5 rows up to 1000) at the top left of the table.
In OpenRefine we will not generally work with our data record by record; instead we’ll find ways to group or filter it into batches and then work within those batches.
We will mostly use the drop down menus at the top of each column to carry out operations. Click on the small downward arrow to bring up a menu of options for that column. When you select an option in a column (e.g. to make a change to the data), it will affect all the cells in that column. If you want to make changes across several columns, you do this one column at a time.
If you didn’t have the correct column as the first column you could fix this in OpenRefine by
Title > Edit column > Move column to beginning
If you are reordering columns like this at the start an easier way to do it is use the All
menu on the left hand side of the table:All > Edit Columns > Reorder / Remove Columns
This is also a handy way to remove multiple columns if your raw dataset has lots of data not relevant for you.OpenRefine has two modes of viewing data: Rows and Records. It defaults to Rows mode.
At the moment the rows and records numbers are the same. To show how this can be different we’ll look for cells which have multiple values and split it across multiple rows.
author
column is showing lots of cells with more than one author.edit
for closer look at what is in that column. Notice the separator is a | or ‘pipe’ character.Author > Edit cells > split multi valued cells
Author > Edit cells > join multi valued cells
In these multivalued cells the character separating each value is called a separator or delimiter. Choose a good separator - a comma separator in the author field would cause problems!! Generally you look to choose a separator that is not in your data values.
subject
field - split
and join
So that is some initial reorganising. Next we’ll look at refining functions - faceting filtering, and clustering
✅ Use the Create Project
option to import data
✅ Specify how data imports using oarsing options
✅ OpenRefine displays data in rows and columns. Records can link multiple rows to a single record.
✅ Most work in OpenRefine is performed via drop-downs at the top of each column
✅ Split and join multi-valued cells to view or modfiy individual values
✅ When creating multi-valued cells in your data, choose your separator carefully