DCU Library | Digital Humanities Workshops | Intro to Data Cleaning with OpenRefine

Facets and Clusters

Facets

Facets are one of the most useful features in OpenRefine. They can help you

A facet groups all the values that appear in a column, and then allows you to filter the data by these values (or look at each group records at a time) and apply batch edits. There are different types but the first one we’ll look at is called a text facet

Perform a text facet

There are other types of facets - numeric, scatterplot, timeline - which require the data to be in formats other than text. We’ll look at formats in OpenRefine shortly.

Facet by duplicate

Facet by blank

Use facets to batch edit data

As well as giving you this clear view on data you can also use facets to start to work on the data.

Working with filtered data

It is very important to note that when you have filtered the data displayed in OpenRefine (like we did above by selecting facet results) any operations you carry out will apply only to the rows that match the filter i.e. the data currently being displayed. An example of this is in use is if you wish to remove rows that match a filter, you can do this as follows:


Clustering

Clustering is another way to clean your data - somewhat similar to the text facet but it uses algorithms to detect similar values and suggest merges rather than the more manual selection of facets. It looks for patterns of variation without you needing to do quite so much detective work. So for very large datasets this can be really useful.

The default method (fingerprinting) looks at variations in whitespace and case. It is pretty good at pulling out the most common and most obvious inconsistencies. After you’ve applied it and merged them you can try one of the other algorithms. More info on the heuristics that are used here https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

It’s quite helpful that you can have a good look at them and pick individually where you want to make these batch changes.

The clustering is syntactic only - not semantic, so not looking at meaning. You might want to do this with subject terms. To do that you would have to use reconciliation where OpenRefine can link with another dataset. We will look at that later.

After whatever clustering you have done you join multiple values in the author cell

OpenRefine is useful to surface near-matches, but it does not automate the work of collapsing them into normalized values. Instead, it focuses one’s attention and labour on exactly that activity. In our initial version of computer-assisted data curation, you still have to touch each data point.

💡 Key Points

✅ You use facets and filters to explore your data

✅ Facets and filters also enable you to work with subsets of data

✅ You can correct common data issues from a facet

Clustering is a way of finding variant forms within a dataset (e.g. different spellings of a name)

✅ There are a number of different clustering algorithms that work in different ways

✅ The best clustering algorithm to use will depend on your data

Back
Next