Commonly, OpenRefine is used to »refine« data tables. This might consist of cleaning, formatting, merging and extracting columns. The main functionalities of OpenRefine are:
- Importing various formats
- Filtering huge datasets
- Merging similar/incorrect values
- Operations on entire columns or multiple cells
- Extracting / applying operation commands
- Exporting in various formats
Download and start
Create a new project by importing data from Excel, CSV, TSV, JSON, … file or from the clipboard. For this guide, use the example file MoviesList.csv. If you use the file, use the following options:
- Select to use comma for separation
- Set character encoding to UTF-8
- Ignore the first four lines
- Select to read the first line as header
What makes OpenRefine efficient for big datasets is that it allows viewing only an excerpt of the whole dataset. Usually, the user applies tasks to the complete column. This is for example »convert every cell in this column to lowercase« or »replace the word foo with bar in every cell in this column«. The task is applied to the whole column without the user seeing the actual cell. This saves a great amount of computational power.
The user can choose how many rows this excerpt should include just above the table. Even though it is not possible to view more than 50 rows at once, the user can navigate through each page of the dataset through the option on the right top of the table.
Rows and records
Rows are the absolute lines the tables consists of like in an Excel spreadsheet. Records, however, can consist of multiple rows that are grouped under the same item in the first column. For this, the first row in the record contains a value (i.e.
North America) while the next items do not contain any value in the first column. In this example, the dataset consists of three records and 18 rows.
Rows and records can be easily converted into each other by clicking on the arrow in the header of the first column and selecting
Edit cells/Fill down or
Edit cells/Blank down.
All operations that can be executed on a single column can be selected from the arrow left to the title of the column. Operations that will be executed on all columns can be selected from the default, first column
Renaming a column
Edit column/Rename this column from the
Column column and type in
Hiding a column
View/Collapse this column from any column. Click on the remaining space to show the column again.
Reorder/remove multiple columns
Edit columns/Re-order / remove columns… from the
All column. Change the order by dragging them to the top or bottom or delete them by dragging them to the right field.
Facets allow viewing all occurring values for a column. Through this list, the dataset can be filtered to only show rows with the selected value included.
The most basic form is the text facet. Like every operation we can apply to a column, we click on the arrow left to
Title and select
On the left, we now see a unique list of all titles. The numbers indicate how many times this word appears. If we select
count from the header the list is ordered by this number. In order to export this list we can click on the
17 choices link in the header.
One big advantage of OpenRefine is to filter the list of viewed items. This allows to apply operations to a subset of the dataset and also to export only a fraction of rows.
In this example, we use it to remove duplicates: Sort by
count and click on the first item (
Ulysses' Gaze). Since we can only remove all viewed rows in OpenRefine we need to make one item differ from the other. Do that by clicking on the flag icon in the beginning of the row. Next, select from the
Facet/Facet by flag. In order to remove all currently visible rows, select
Edit rows/Remove all matching rows for the
In order to filter by multiple items simply select one and then select
include when hovering the additional items. Once one item or more are selected, you can also select
invert from the header of the facet list. Filtering with multiple facets is also possible.
If the column we want to apply the facet on consists of numeric values the display as list might not be the best solution. Instead, we can use the
Numeric facet that displays the values as a histogram.
Facet/Numeric facet from the
RottenTomatos UserRating column. In the facet the message
No numeric value present will appear. This is because the column is currently formatted as strings (text). To solve this, select
Edit cells/Common transforms/To number from the
RottenTomatos UserRating column. The values will now appear green and on the right side.
Filtering works the same way by using the handles on both sides.
Custom text/numeric facet
Currently, the items in the facet list match the value in the column. What if we want to modify the value for the facet? Select
Facet/Custom text facet… from the
Title column. Facet by the first character? Use
value. Check if the title contains
The same works for custom numeric facets. Select
Facet/Custom Numeric Facet… from the
RottenTomatos UserRating column and use
value.toNumber() + value.toNumber() to facet by cross sum.
In the menu
Facet/Customized facets you can find commonly used custom facts.
OpenRefine can also handle array values for facts. For example, select Select
Facet/Custom text facet… from the
Genre column. Type
value.split(', ') to split up the columns by the separator
Date facets are similar to numeric facts but require date values. This requires properly formatted string values that can be converted to dates.
One easy way to see possible correlating columns is to use scatterplots. Even though the implementation is very basic, it is a good way to plot every column against every other. Convert the
RottenTomatos column to numeric values and select
Facet/Scatterplot facet. Since the dataset currently, consists of two columns with numeric values, we only select the one scatterplot available. Unfortunately, filtering or labels on the points are not available.
One important functionality is clustering values by similarity. This allows to rapidly clean messy data. Start by creating a text facet for the
Director column and click on
Cluster above the facet list.
The first group of methods are key collision methods that work with alternative/simplified representations of the string. These methods are fast since these alternative versions need to created only once for each clustering. The second group are the nearest neighbour methods that calculate the difference between every pair of words.
The first cluster method is fingerprinting. This simple method is good for matching strings with special characters or an alternative order of words. It works by
- removing leading and trailing whitespace
- changing all characters to their lowercase representation
- removing all punctuation and control characters (tab, new line, …)
- splitting the string into words
- sorting the words and removing duplicates
- joining the words back together
- normalising extended western characters to their ASCII representation (for example
OpenRefine already selects the most occurring appearance as new cells value. You can also click on one of the listed appearances or type your own. In order to apply this change select
Merge? for each cluster and then
Merge Selected & Re-Cluster.
This method uses a similar process but with n-grams. These are parts of the word with a user-defined length. By default, words are split into 2-character chunks that are then sorted and uniquely joined.
This method is most useful with 2- or 1-grams. It can detect shuffled or words with repeating characters.
Most languages allow different writings for similar-sounding words. These methods convert strings into their pronounced writing. Metaphone3 is used for English words, cologne-phonetic for German words.
These methods are useful for incorrect writing, especially for manually written data.
nearest neighbour method calculates the number of changes required from one string to another. For example, starting from
Park we only need one change; hence the Levenshtein distance is 1. Try changing the radius to create clusters that are more diverse.
block Chars will dramatically raise the computational cost. It first performs a key collision clustering that groups words that contain a substring of this size.
This method is useful for common typos or simple alternative writings.
The Prediction by partial matching method is the most complex method. It works by compressing strings into smaller descriptions. For example, the string
ababab can be described as
3 times ab. Now, if you have two strings that are similar then description of theses strings combined should be similar to their independent description. Again, you can change the radius to raise the threshold.
This method is useful if complete parts of a string are missing but can also create many false positive matches.
You can find more information about clustering in the OpenRefine Wiki: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
Just like in Excel, we can edit single cells. For this hover over the first cell in the
Year column and select edit. Make the changes and select
We can also apply changes to cells that consist of the same value. For this select one of the cells containing
Asian from the
Continent column. Type
Asia and select
Apply to All Identical Cells.
One of the most important applications is applying transformations on entire columns.
Commonly used transformations can be found under
Edit cells/Common Transformations. This allows, for example, to convert all cells in the
Title column into title case.
Every new dataset should experience a basic cleanup. This includes:
Edit cells/Common transforms/Trim leading and trailing whitespace
Edit cells/Common transforms/Collapse consecutive whitespace
Edit cells/Common transforms/To numberon numerical columns
Since we can use
Edit cells/Transform… on the
Year column. Type in
Convert German number notation to English notation
Edit cells/Transform… on the
IMDb Rating column. Type in
value.replace('.', '').replace(',', '.').toNumber()
Extracting the pure year value can either be done by using this Regex expression
or by removing the additional text.
Sort words in a column
Edit cells/Transform… on the
Genre column. Type in
value.split(', ').sort().join(', ')
Edit column/Split into several columns… on the
Cinematography/Music column. Use
, as a separator and set the number of columns to
2. Next, you might want to rename the two new columns.
Create a new column based on another column
Edit column/Add column based on this column… on the
Genre column and type in
If you want to reference another column use
Merge two tables
Another great feature of OpenRefine is the ability to merge two tables based on a common column. Make sure you have a second table named
Movies Countries (MoviesCountries.csv) with a column
MovieTitle and second column
MovieCountries. In the original table select
Edit column/Add column based on this column… from the
MovieTitle column. Type in
cell.cross("Movies Countries", "MovieTitle").cells["MovieCountries"].value
This opens the table
Movies Countries and searches for all values from the column
Title in the column
MovieTitle in the target dataset. If it finds equal values it copies the value from the column
MoviesCountries from that row.
Edit cells/Transform… on the
Duration column. Type in
value.match(/(\d+)h(\d+)m/).toNumber() * 60 + value.match(/(\d+)h(\d+)m/).toNumber()
You can find all available information in the OpenRefine Wiki: https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions
All the changes in OpenRefine are non-destructive. The complete history can be viewed in the
Undo/Redo tab below the OpenRefine logo.
Exporting and applying changes
If you need to edit dataset with the same structure multiple times, it is recommended to extract all changes done to one dataset and then apply these to the others. Select
Extract… from the header of the
Undo / Redo tab and copy the JSON object on the right. In the next dataset, select
Apply… and put in the JSON object.
OpenRefine can export to multiple formats. All options can be found under the
Export option in the top right.
Export entire project
You can also export the entire project through
Export/Export project. Save the file and select it when you create a new dataset under the
Import Project section on the start screen.
Custom tabular exporter
If you want to specify exactly which columns, their format and the output file select
Export/Custom tabular exporter…. There you can change the order, select which columns to export and also file format under the
You can also use the JSON object in the
Option Code tab if you want to use these settings multiple times.