owned this note
owned this note
Published
Linked with GitHub
---
title: Open Refine Exercises
spellcheck: true
authors: ["Belinda Weaver", "Tim Dennis", "Library Carpentry community"]
tags: openrefine, data, munging
---
## Library Carpentry - Open Refine Exercises
### Exercise 1: Download data:
* Go to <https://raw.githubusercontent.com/data-lessons/library-openrefine/gh-pages/data/doaj-article-sample.csv>. Right click anywhere on the screen and choose `Save As …` The filename is `doaj-article-sample.csv` and it should be saved to Desktop.
### Exercise 2: Create project
1. Open OpenRefine.
2. Click `Create project` then `Choose File`. Select `doaj-article-sample.csv` and then click Next. In the Parse text section, ensure separators are commas and first line is being parsed as a header. Then click `Create project`.
### Exercise 2 (Alternate): Create project from web address
1. Open OpenRefine: typically located here: <http://127.0.0.1:3333>
2. Click `Create Project` and then `Web Addresses (URLs)`
3. Paste in <https://raw.githubusercontent.com/data-lessons/library-openrefine/gh-pages/data/doaj-article-sample.csv>. In the Parse text section, ensure separators are commas and first line is being parsed as a header. Then click `Create project`.
### Split author names into separate cells
1. To work with the author names effectively we need to split them into separate cells:
2. Click the dropdown menu at the top of the `Author` column
3. Choose `Edit cells ->Split multi-valued cells`
4. In the prompt type the “:” symbol and click `OK`
5. Note that the rows are still numbered sequentially
6. Click the `Records` option to change to `Records` mode and then go back to `Rows`
7. Note how the numbering changes as you toggle - indicating that several rows are related to the same record.
* A common workflow with multi-valued cells is split multi-valued cells into individual cells (what we did above),
* then modify/refine/clean individual cells
* join multi-valued cells back together
We'll cover modifying cells will be covered in future episode, but for now we will cover how to join cells back together that have been split previously.
### Join authors back together
* Click the dropdown menu at the top of the Author column
* Choose `Edit cells->Join multi-valued cells`
* In the prompt type the ( | ) symbol
* Here we are specifying the delimiter character for OpenRefine to use to join the values together.
* Click `OK` to join the Authors cells back together
You will now see that split rows have gone away - the Authors have been joined into a single cell with the specified delimiter. Our Rows and Records values will now be the same since we do not have any more split columns.
Click both the Rows and Records options and observe how the numbers of Rows and Records are equal
### Class exercise:
<https://librarycarpentry.github.io/lc-open-refine/03-working-with-data/index.html#splitting-subjects-into-separate-cells>
## Faceting
* ‘Facet’ groups all the values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time.
* text facet is simplest
* The facet consists of a list of values used in the data. You can filter the data displayed by clicking on one of these headings.
* We can `include` facet values or `exclude` and `invert`
* Let's look at this in action
### Publisher - faceting demo
1. Click on the drop down menu at the top of the `publisher` column and choose `Facet > Text Facet`. The facet will then appear in the left hand panel
2. Filter by facet by clicking or include
3. Select multiple values to include
4. You can `invert` your selections to exclude
5. Include a value and then look at top to `invert inclusion`.
### Class Exercise: Which licences are used for articles in this file?
1. Create a `Facet -> Text facet` for the `Licence` column
2. What is the most common Licence in the file?
3. How many articles in the file don’t have a licence assigned?
### Class Exercise: Find all publications without a DOI
* Use the `Facet -> Facet by blank` function to find all publications in this data set without a DOI
### Exercise 5: Correct the Language values via a facet
1. Create a `Facet -> Text facet` on the Language column
* Notice that there is both ‘EN’ and ‘English’
2. Put the mouse over the `English` value
3. Click `Edit`
4. Type `EN` and click `Apply`
### Exercise 6: Replacing text
1. People often want to replace text. This is done via a GREL `value.replace` transform.
2. Click the dropdown menu at the top of the `Author` column
3. Choose `Edit cells -> Transform`.
4. In the Expression box, type `value.replace("|", ":")`
5. You will see how the change will look in the box below. If you are happy, click OK.
6. This will replace the pipe separator with a colon. Use the same syntax (e.g. `value.replace()` to replace words, punctuation and so on.
### Exercise 8: Use Clustering to clean up author data
1. Choose `Edit cells -> Cluster and edit...` from the `Author` column (which should already be split into individual values from the last exercise)
1. Using the `key collision` method with the `fingerprint` Keying Function work through the clusters of values, merging them to a single value where appropriate
1. Try changing the clustering method being used - which ones work well?
## Working with columns
### Reorder columns
* You can re-order the columns by clicking the drop-down menu at the top of the first column (labelled ‘All’), and choosing `Edit columns->Re-order / remove columns`
### Sort
* sort data in OpenRefine by clicking on the drop-down menu for the column you want to sort on, and choosing Sort.
* Once you have sorted the data, a new Sort drop-down menu will be displayed.
* Unlike in Excel, ‘Sorts’ in OpenRefine are temporary - that is, if you remove the Sort, the data will go back to its original ‘unordered’ state. The ‘Sort’ drop-down menu lets you amend the existing sort (e.g., reverse the sort order), remove existing sorts, and/or make sorts permanent.
## Transformations
* Splitting data that is in a single column into multiple columns (e.g. splitting an address into multiple parts)
* Standardising the format of data in a column without changing the values (e.g. removing punctuation or standardising a date format)
* Extracting a particular type of data from a longer text string (e.g. finding ISBNs in a bibliographic citation)
### Exercise 9: Correct Publisher data
1. Create a `Text facet` on the Publisher column
* Note that in the values there are two that look identical - why does this value appear twice?
1. On the publisher column use the dropdown menu to select `Edit cells -> Common transforms -> Trim leading and trailing whitespace`
1. Look at the publisher facet now - has it changed? (if it hasn’t changed try clicking the Refresh option to make sure it updates)
## Writing transformations
### Exercise 10: Put titles into Title Case
1. `Text Facet` by publisher
1. Select “Akshantala Enterprises” and “Society of Pharmaceutical Technocrats”
1. To select multiple values in the facet use the ‘Include’ link that appears to the right of the facet
1. See that the Titles for these are all in uppercase
1. Click the dropdown menu on the `Title` column
1. Choose `Edit cells->Transform`.
1. In the Expression box, type `value.toTitlecase()`. Show the `Help` tab and how to find the functions.
1. In the Preview, you can see what the affect of running this will be.
1. Click `OK`
1. Explain `Undo/ Redo` (and ramifications if changes are made) and `Extract`:
- that it is possible to extract all the code to reuse or just a section that they might want to re-use in a different file.
- To apply code to a new file, they would paste in the relevant section and click Apply. Also explain the export options, and how original CSV is untouched by the changes made as any export will have a new filename.
### Exercise 11: Reverse author names
1. Most of the author names are written in the natural order. However, a few have been reversed. Let’s find the ones that contain a comma:
1. On the `Authors` column, use the dropdown menu and select `Facet-> Custom text facet…`
1. The `Custom text facet` function allows you to write GREL functions to create a facet
1. In the Expression box type `value.contains(",")`
1. Click `OK`
1. Since the ‘contains’ function outputs a Boolean value, you should see a facet that contains ‘false’ and ‘true’. These represent the outcome of the expression, i.e. true = values containing a comma; false = values not containing a comma
1. In this facet, select `true` to narrow down to the author names that contain a comma. You should see that small subset on the screen.
* Now we have narrowed down to the lines with a comma in a name, we can use the `match` function. The match function allows you to use regular expressions, and output the capture groups as an array, which you can then manipulate.
1. On the `Authors` column use the dropdown menu and select `Edit cells->Transform`
1. In the Expression box type `value.match(/(.*),(.*)/)`
* The “/”, means you are using a regular expression inside a GREL expression. The parentheses indicate you are going to match a group of characters. The “.*” expression will match any character from 0, 1 or more times. So here we are matching any number of characters, a comma, and another set of any number of characters.
* See how this creates an array with two members in each row in the Preview column.
* To get the author name in the natural order, we will then reverse the array and join it back together with a space to create the string you need:
1. In the Expression box, add to the existing expression until it reads `value.match(/(.*),(.*)/).reverse().join(" ")`
1. In the Preview view you should be able see this has reversed the array, and joined it back into a string
1. Click ‘OK’
### Exercise 12: Splitting and joining columns
In this exercise, we will split the `Citation` column and then join it up again, so you can see how easy it is to split and join columns.
1. To split the Citation column, click the down arrow for the column and choose `Edit column -> Split into several columns`.
1. You will then be prompted to specify the separator, which in this case is a comma. Click OK.
1. This will separate the `Citation` into several columns, named Citation 1. Citation 2 and so on.
1. To rejoin them, click the down arrow on `Citation 1` and choose `Edit cells -> Transform`.
1. In the Expression box, enter `cells['Citation 1'].value + ' , ' + cells['Citation 2'].value + ' , ' + cells['Citation 3'].value + ' , ' + cells['Citation 4'].value`
1. This will rejoin the columns data into a single column, and also put a space and a comma between the text from the different columns.
1. Make sure it all looks fine in the box below, and then click OK.
1. You can remove redundant columns by choosing `Edit column -> Remove this column`.
1. You can then rename the column by choosing `Edit column -> Rename this column`.
## Exporting
* Once you have finished working with a data set in OpenRefine you may wish to export it. The export options are accessed through the Export button at the top right of the OpenRefine interface.
* Export formats support include HTML, Excel and comma- and tab-separated value (csv and tsv). You can also write a custom export, selecting to export specific fields, adding a header or footer and specifying the exact format.
## Advanced OpenRefine functions
### Exercise 13: Creating a human-readable date
1. Click the down arrow in the Date column.
2. Choose `Edit cells -> Common transforms -> To date`. This will format the date into an internationally recognisable form.
3. Then choose `Edit column -> Add column` based on this column
4. Give the new column a name, e.g. New Date.
5. Click OK.
6. Then click the down arrow in the New Date column, and choose `Edit cells -> Transform`.
7. In the Expression box, enter `value.toString("dd MMMM yyyy")`
8. Check that the date appears correctly in the box below, and click OK.
9. You can then hide the other date column by clicking the down arrow again and choosing `View -> Collapse this column`.
### Exercise 14: Retrieving journal details from CrossRef via ISSN
1. Select a single row that contains an ISSN
2. Click the star icon for the relevant row in the first column
3. Choose `Facet by Star`
4. Choose the single row
5. In the ISSN column , choose `Edit column->Add column by fetching URLs`
6. Name the column, e.g., Journal details
7. In the expression box, enter `"http://api.crossref.org/journals/"+value`
8. Click 'OK'
9. This GREL is a URL used to retrieve data (the format of the data could be HTML, XML, JSON, or some other text format). At this point you should have a new cell containing a long text string in JSON format (which OpenRefine can parse).
10. In the Journal Details column, choose `Edit column->Add column based on this column`
11. Name the column e.g., Journal Title
12. In the Expression box, enter `value.parseJson().message.title`
13. The reason for using `Add column` based on this column is to allow you to retain the full JSON in the first column and extract further data from it. If you did not need any other information from the JSON, you could use `Edit cells->Transform` with the same GREL expression.
### Exercise 15: Reconcile Publisher names with VIAF IDs
* In this exercise you are going to use the VIAF Reconciliation service written by Jeff Chiu. Jeff offers two ways of using the reconciliation service - either via a public service he runs at <http://refine.codefork.com/>, or by installing and running the service locally using the instructions at <https://github.com/codeforkjeff/refine_viaf>.
* If you are going to do a lot of reconciliation, please install and run your own local reconciliation service - the instructions at <https://github.com/codeforkjeff/refine_viaf> make this reasonably straightforward.
Once you have chosen which service you are going to use:
1. In the Publisher column use the dropdown menu to choose `Reconcile->Start Reconciling`
* If this is the first time you’ve used this particular reconciliation service, you’ll need to add the details of the service now
2. Click `Add Standard Service…` and in the dialogue that appears enter:
<http://refine.codefork.com/reconcile/viaf> for Jeff’s public service
<http://localhost:8080/reconcile/viaf> if you are running the service locally
* You should now see a heading in the list on the left hand side of the Reconciliation dialogue called `VIAF Reconciliation Service`
3. Click on this to choose to use this reconciliation service
* In the middle box in the reconciliation dialogue you may get asked what type of ‘entity’ you want to reconcile to - that is, what type of thing are you looking for. The list will vary depending on what reconciliation service you are using.
4. In this case choose `Corporate Name` (it seems like the VIAF Reconciliation Service is slightly intelligent about this and will only offer options that are relevant)
5. In the box on the righthand side of the reconciliation dialogue you can choose if other columns are used to help the reconciliation service make a match - however it is sometimes hard to tell what use (if any) the reconciliation service makes of these additional columns
6. At the bottom of the reconciliation dialogue there is the option to `Auto-match candidates with high confidence`. This can be a time saver, but in this case you are going to uncheck it, so you can see the results before a match is made
7. Now click `Start Reconciling`
8. Reconciliation is an operation that can take a little time if you have many values to look up. However, in this case there are only 6 publishers to check, so it should work quite quickly.
* Once the reconciliation has completed two Facets should be created automatically:
Publisher: Judgment
Publisher: best candidate’s score
* These are two of several specific reconciliation facets and actions that you can get from the `Reconcile` menu (from the column drop down menu).
9. Close the `Publisher: best candidate’s score` facet, but leave the `Publisher: Judgement` facet open
1. If you look at the Publisher column, you should see some cells have found one or more matches - the potential matches are show in a list in each cell. Next to each potential match there is a ‘tick’ and a ‘double tick’. To accept a reconciliation match you can use the ‘tick’ options in cells. The ‘tick’ accepts the match for the single cell, the ‘double tick’ accepts the match for all identical cells.
1. Create a text facet on the Publisher column
1. Choose ‘International Union of Crystallography’
* In the Publisher column you should be able to see the various potential matches. Clicking on a match will take you to the VIAF page for that entity.
1. Click a ‘double tick’ in one of the Publisher column cells for the option “International Union of Crystallography”
1. This will accept this as a match for all cells - you should see the other options all disappear
1. Check the `Publisher: Judgement` facet. This should now show that 858 items are ‘matched’ (if this does not update, try refreshing the facets)
* We could do these one by one, but if we are confident with matches, there is an option to accept all:
1. Remove all filters/facets from the project so all rows display
1. In the Publisher column use the dropdown menu to choose `Reconcile->Actions->Match each cell to its best candidate`
1. There are two things that reconciliation can do for you. Firstly it gets a standard form of the name or label for the entity. Secondly it gets an ID for the entity - in this case a VIAF id. This is hidden in the default view, but can be extracted:
1. In the Publisher column use the dropdown menu to choose `Edit column->Add column based on this column…`
1. Give the column the name `VIAF ID`
1. In the GREL expression box type `cell.recon.match.id`
* This will create a new column that contains the VIAF ID for the matched entity