Facets are one of the most useful features in OpenRefine. They can help you
A facet groups all the values that appear in a column, and then allows you to filter the data by these values (or look at each group records at a time) and apply batch edits. There are different types but the first one we’ll look at is called a text facet
Publisher > Facet > Text Facet
count
we can see what the most frequent one isInclude / Exclude / Invert
options which appear when you put your mouse over a value in the FacetThere are other types of facets - numeric, scatterplot, timeline - which require the data to be in formats other than text. We’ll look at formats in OpenRefine shortly.
As well as giving you this clear view on data you can also use facets to start to work on the data.
text facet
on Language
EN
and English
variationIt is very important to note that when you have filtered the data displayed in OpenRefine (like we did above by selecting facet results) any operations you carry out will apply only to the rows that match the filter i.e. the data currently being displayed. An example of this is in use is if you wish to remove rows that match a filter, you can do this as follows:
All
column headingEdit rows > Remove all matching rows
Clustering is another way to clean your data - somewhat similar to the text facet but it uses algorithms to detect similar values and suggest merges rather than the more manual selection of facets. It looks for patterns of variation without you needing to do quite so much detective work. So for very large datasets this can be really useful.
author
values again (using Edit cells -> Split multi-valued cells
, using the pipe ( | ) character as the separator)author
column Edit cells > cluster and edit
The default method (fingerprinting
) looks at variations in whitespace and case. It is pretty good at pulling out the most common and most obvious inconsistencies. After you’ve applied it and merged them you can try one of the other algorithms. More info on the heuristics that are used here
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
It’s quite helpful that you can have a good look at them and pick individually where you want to make these batch changes.
The clustering is syntactic only - not semantic, so not looking at meaning. You might want to do this with subject terms. To do that you would have to use reconciliation where OpenRefine can link with another dataset. We will look at that later.
After whatever clustering you have done you join multiple values
in the author
cell
OpenRefine is useful to surface near-matches, but it does not automate the work of collapsing them into normalized values. Instead, it focuses one’s attention and labour on exactly that activity. In our initial version of computer-assisted data curation, you still have to touch each data point.
✅ You use facets and filters to explore your data
✅ Facets and filters also enable you to work with subsets of data
✅ You can correct common data issues from a facet
✅ Clustering is a way of finding variant forms within a dataset (e.g. different spellings of a name)
✅ There are a number of different clustering algorithms that work in different ways
✅ The best clustering algorithm to use will depend on your data