Open Refine Exercises

--- title: Open Refine Exercises spellcheck: true authors: ["Belinda Weaver", "Tim Dennis", "Library Carpentry community"] tags: openrefine, data, munging --- ## Library Carpentry - Open Refine Exercises ### Exercise 1: Download data: * Go to <https://raw.githubusercontent.com/data-lessons/library-openrefine/gh-pages/data/doaj-article-sample.csv>. Right click anywhere on the screen and choose `Save As …` The filename is `doaj-article-sample.csv` and it should be saved to Desktop. ### Exercise 2: Create project 1. Open OpenRefine. 2. Click `Create project` then `Choose File`. Select `doaj-article-sample.csv` and then click Next. In the Parse text section, ensure separators are commas and first line is being parsed as a header. Then click `Create project`. ### Exercise 2 (Alternate): Create project from web address 1. Open OpenRefine: typically located here: <http://127.0.0.1:3333> 2. Click `Create Project` and then `Web Addresses (URLs)` 3. Paste in <https://raw.githubusercontent.com/data-lessons/library-openrefine/gh-pages/data/doaj-article-sample.csv>. In the Parse text section, ensure separators are commas and first line is being parsed as a header. Then click `Create project`. ### Split author names into separate cells 1. To work with the author names effectively we need to split them into separate cells: 2. Click the dropdown menu at the top of the `Author` column 3. Choose `Edit cells ->Split multi-valued cells` 4. In the prompt type the “:” symbol and click `OK` 5. Note that the rows are still numbered sequentially 6. Click the `Records` option to change to `Records` mode and then go back to `Rows` 7. Note how the numbering changes as you toggle - indicating that several rows are related to the same record. * A common workflow with multi-valued cells is split multi-valued cells into individual cells (what we did above), * then modify/refine/clean individual cells * join multi-valued cells back together We'll cover modifying cells will be covered in future episode, but for now we will cover how to join cells back together that have been split previously. ### Join authors back together * Click the dropdown menu at the top of the Author column * Choose `Edit cells->Join multi-valued cells` * In the prompt type the ( | ) symbol * Here we are specifying the delimiter character for OpenRefine to use to join the values together. * Click `OK` to join the Authors cells back together You will now see that split rows have gone away - the Authors have been joined into a single cell with the specified delimiter. Our Rows and Records values will now be the same since we do not have any more split columns. Click both the Rows and Records options and observe how the numbers of Rows and Records are equal ### Class exercise: <https://librarycarpentry.github.io/lc-open-refine/03-working-with-data/index.html#splitting-subjects-into-separate-cells> ## Faceting * ‘Facet’ groups all the values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time. * text facet is simplest * The facet consists of a list of values used in the data. You can filter the data displayed by clicking on one of these headings. * We can `include` facet values or `exclude` and `invert` * Let's look at this in action ### Publisher - faceting demo 1. Click on the drop down menu at the top of the `publisher` column and choose `Facet > Text Facet`. The facet will then appear in the left hand panel 2. Filter by facet by clicking or include 3. Select multiple values to include 4. You can `invert` your selections to exclude 5. Include a value and then look at top to `invert inclusion`. ### Class Exercise: Which licences are used for articles in this file? 1. Create a `Facet -> Text facet` for the `Licence` column 2. What is the most common Licence in the file? 3. How many articles in the file don’t have a licence assigned? ### Class Exercise: Find all publications without a DOI * Use the `Facet -> Facet by blank` function to find all publications in this data set without a DOI ### Exercise 5: Correct the Language values via a facet 1. Create a `Facet -> Text facet` on the Language column * Notice that there is both ‘EN’ and ‘English’ 2. Put the mouse over the `English` value 3. Click `Edit` 4. Type `EN` and click `Apply` ### Exercise 6: Replacing text 1. People often want to replace text. This is done via a GREL `value.replace` transform. 2. Click the dropdown menu at the top of the `Author` column 3. Choose `Edit cells -> Transform`. 4. In the Expression box, type `value.replace("|", ":")` 5. You will see how the change will look in the box below. If you are happy, click OK. 6. This will replace the pipe separator with a colon. Use the same syntax (e.g. `value.replace()` to replace words, punctuation and so on. ### Exercise 8: Use Clustering to clean up author data 1. Choose `Edit cells -> Cluster and edit...` from the `Author` column (which should already be split into individual values from the last exercise) 1. Using the `key collision` method with the `fingerprint` Keying Function work through the clusters of values, merging them to a single value where appropriate 1. Try changing the clustering method being used - which ones work well? ## Working with columns ### Reorder columns * You can re-order the columns by clicking the drop-down menu at the top of the first column (labelled ‘All’), and choosing `Edit columns->Re-order / remove columns` ### Sort * sort data in OpenRefine by clicking on the drop-down menu for the column you want to sort on, and choosing Sort. * Once you have sorted the data, a new Sort drop-down menu will be displayed. * Unlike in Excel, ‘Sorts’ in OpenRefine are temporary - that is, if you remove the Sort, the data will go back to its original ‘unordered’ state. The ‘Sort’ drop-down menu lets you amend the existing sort (e.g., reverse the sort order), remove existing sorts, and/or make sorts permanent. ## Introduction to Transformations in OpenRefine - **Purpose of Transformations:** - Used when facets, filters, and clusters can't standardize or modify data. - **Examples of Data Transformations:** - **Splitting Columns:** Break data in one column into multiple (e.g., split an address into street, city, state). - **Standardizing Formats:** Format data without altering its values (e.g., remove punctuation or standardize date formats). - **Extracting Data:** Pull specific information from a text string (e.g., extract an ISBN from a citation). - **Transformations in OpenRefine:** - Written in **GREL** (General Refine Expression Language). - Similar to Excel formulas, but mainly for text manipulation rather than numerical operations. - GREL documentation link for reference: [GREL Documentation](https://docs.openrefine.org/manual/grelfunctions). | Common Transformation | Action | GREL Expression | |--------------------------------------------|--------------------------------------------------------------------------------------------------------|-----------------------| | Trim leading and trailing whitespace | Removes any ‘whitespace’ characters (e.g. spaces, tabs) from the start and end of the current value | `value.trim()` | | To titlecase | Converts the current value to titlecase (i.e. each word starts with an uppercase character) | `value.toTitlecase()` | | To uppercase | Converts the current value to uppercase | `value.toUppercase()` | | To lowercase | Converts the current value to lowercase | `value.toLowercase()` | ## Correct Publisher data 1. Create a `text facet` on the Publisher column 2. Note that in the values there are two that look almost identical - why do these two values appear separately rather than as a single value? if you know share in chat. 3. On the publisher column use the dropdown menu to select `Edit cells->Common transforms->Collapse consecutive whitespace` 4. Look at the publisher facet now - has it changed? (if it hasn't changed try clicking the `Refresh` option to make sure it updates) ## Writing transformations - **Where to Write GREL Expressions:** - Select the column to transform. - Choose `Edit cells → Transform…`. - A new screen opens with an 'Expression' box for entering your transformation. - **Preview:** The effect of the transformation on the first 10 rows of your data is shown immediately. - **Default Expression:** - The default value is `value`, which means no change will occur. This represents the current value in the selected column. - **GREL Syntax:** - Two types of syntax can be used interchangeably: 1. `value.function(options)` 2. `function(value, options)` - Both work the same way; it’s a matter of personal preference. - **Additional Preview Options:** - **History:** View previous transformations and reuse or star them. - **Starred:** Access starred transformations for quick reference. - **Help:** View a list of all GREL functions with a brief explanation. ### Example Transformation: Title Case - **Goal:** Use facets and GREL to convert text in the 'Title' column to Title Case. --- ## Put titles into Title Case Use Facets and the GREL expression `value.toTitlecase()` to put the titles in Title Case 1. Facet by publisher 2. Select "Akshantala Enterprises" and "Society of Pharmaceutical Technocrats" 3. To select multiple values in the facet use the `include` link that appears to the right of the facet 4. See that the Titles for these are all in uppercase 5. Click the dropdown menu on the Title column 6. Choose `Edit cells->Transform...` 7. In the Expression box type `value.toTitlecase()` 8. In the Preview pane under value.toTitlecase() you can see what the effect of running this will be 9. Click `OK` 10. Find examples of titles that are still not correct, or have been incorrectly cased (abbreviations, species names, etc.) ### keypoints - You can alter data in OpenRefine based on specific instructions - You can preview the results of your GREL expression ## Undo/Redo - **Undo/Redo Features:** - OpenRefine allows you to **undo** or **redo** any steps you've taken in the data cleaning process. - These features give you flexibility to experiment with transformations and easily reverse changes if necessary. - **Accessing Undo/Redo:** - The Undo/Redo options are located in the **left-hand panel**. - The panel shows a list of all steps taken so far. - **How Undo Works:** - Click on the last step you want to preserve, and all steps after that will be undone automatically. - Greyed-out steps in the list can be reapplied by clicking on the last step you want to restore. - **Important Considerations:** - If you undo steps and then perform new transformations, the greyed-out steps will disappear, and you will no longer be able to redo them. - **Saving and Reapplying Steps:** - Use the **Extract** button to save steps as a JSON file, which can be shared or reused in different projects. - To reapply steps, use the **Apply** button and paste the JSON file into the project. - In OpenRefine version 3.6.0 or later, you can also directly save the JSON via the **Export** button. - **Undo/Redo Data is Saved:** - Undo/Redo history is saved automatically with the project, so you can pick up where you left off when reopening the project. ### Key Points: - Undo/Redo helps retrace your steps and undo changes. - You can save and reapply steps using the **Extract** and **Apply** features. ## Transforming Strings, Numbers, Dates, and Booleans ### Data Types: - **Importance of Understanding Data Types:** - Knowing data types helps in writing effective transformations using GREL. - **Data Types in OpenRefine:** - **String:** A piece of text (most common type). - **Number:** Numeric data. - **Date:** Date data, necessary for date-related functions. - **Boolean:** Binary values (true/false). - **Array:** Covered in the next lesson. --- ### Dates and Numbers: - **String Representation:** - Often, dates and numbers may be treated as strings in OpenRefine. - **Need for Conversion:** - Some operations (e.g., sorting) require converting these values to their appropriate types (Date or Number). --- ### Reformatting Dates: 1. **Prepare the Data:** - Remove all facets and filters. 2. **Transform the Date Column:** - Select `Edit cells → Transform`. - Enter the expression: `value.toDate("dd/MM/yyyy")`. 3. **Check the Output:** - Values should display in green, adhering to ISO 8601 format. 4. **Create a New Formatted Column:** - Select `Edit column → Add column based on this column`. - Name the new column "Formatted-Date". - Enter the expression: `value.toString("dd MMMM yyyy")`. --- ### Specifying Date Formatting in GREL Expressions: - **Using Pattern Strings:** - GREL allows specifying date and time formats through case-sensitive pattern strings. #### Date and Time Pattern Representation: | Letter | Date or Time Representation | |--------|-----------------------------| | `y` | Year | | `M` | Month in year | | `D` | Day in year | | `d` | Day in month | | `F` | Day of week in month | | `E` | Day name in week | | `u` | Day number of week | | `a` | AM/PM marker | #### Examples of Date Patterns: | Date and Time Pattern Input | Output | |------------------------------|---------------------------| | `"yyyy-MM-dd"` | 2022-06-05 | | `"dd MMM yyyy"` | 05 Jun 2022 | | `"EEE, MMM d, ''yy"` | Mon, Jun 5, '22 | | `"yyyy.MMMM.dd hh:mm a"` | 2022.June.05 12:10 PM | | `"EEE, d MMM yyyy HH:mm:ss"` | Mon, 5 Jun 2022 12:10:10 | --- ### Booleans: - **Definition:** - A Boolean is a binary value: **true** or **false**. ### Exercise: Find Reversed Author Names 1. **Split Author Names:** - Ensure author names are split into individual cells using `Edit cells → Split multi-valued cells`. 2. **Create a Custom Text Facet:** - Select `Facet → Custom text facet...`. 3. **Write GREL for Faceting:** - In the Expression box, type: `value.contains(",")`. - Click `OK`. 4. **Review Results:** - The facet should show 'true' for values containing a comma (indicating reversed names) and 'false' for others. ### Key Points: - You can alter data in OpenRefine based on specific instructions. - Expand data editing functions by building your own - transformations. --- ## Transformations - Handling Arrays ### Preview: - Example: Transforming a list with duplicated values into a unique list using arrays. --- ### Using Arrays in Transformations: - **Function:** `uniques()` removes duplicates from an array. - **Example Subject Column:** - Original: `crystal structure|clozapinium|crystal structure|molecular configuration|hydrogen bonding|supramolecular assembly|Chemistry|QD1-999` - Goal: Remove duplicates. #### GREL Transformation Steps: 1. **Split the String into an Array:** - `value.split("|")` → `["crystal structure", "clozapinium", "crystal structure", ...]` 2. **Remove Duplicates:** - `uniques()` → `["crystal structure", "clozapinium", "molecular configuration", ...]` 3. **Join the Array Back to a String:** - `join("|")` → `crystal structure|clozapinium|molecular configuration|...` --- ### Understanding Arrays: - **Definition:** - An Array is a data type that contains a list of values. - **Example Arrays:** - Strings: `["crystal structure", "clozapinium", ...]` - Numbers: `[1, 2, 3, 4]` - ![list](https://hackmd.io/_uploads/SJPR-eiaR.png) - **Key Operations:** - Arrays can be sorted, de-duplicated, and manipulated in GREL. - Cannot be stored directly in OpenRefine cells. --- ### Recap on Best Practices for Separators: - Avoid using separator characters already present in the text. - Discuss implications of choosing inappropriate separators. --- ### Exercise: Reverse Author Names 1. **Prerequisite:** Ensure author names are split into individual cells using `Edit cells → Split multi-valued cells`. 2. **Create Custom Text Facet:** - On the Authors column, select `Facet → Custom text facet...`. - In the Expression box, type: `value.contains(",")`. 3. **Review the Facet:** - Click `OK`. You should see 'true' for names with a comma. 4. **Narrow Down Results:** - Select 'true' to focus on reversed author names. 5. **Transform Author Names:** - On the Authors column, select `Edit cells → Transform`. - In the Expression box, type: `value.split(", ")`. 6. **Reverse and Join:** - Update the expression to: `value.split(", ").reverse().join(" ")`. 7. **Preview Changes:** - Verify that the transformation reverses the names to personal name first order. 8. **Apply Changes:** - Click `OK`. --- ### Key Points: - Arrays cannot appear directly in an OpenRefine cell. - Arrays can be used in many ways with GREL expressions. ## Exporting Data ### Objectives: - Explain how to export data in different formats from OpenRefine. ### Key Questions: - How do I export data from OpenRefine? --- ### Note about OpenRefine: - Edits made in OpenRefine are stored within the program and do not affect your original file. - OpenRefine uses "import" and "export" to describe moving data in and out of the interface. - To save your work in a viewable format, you need to export your data. --- ### Exporting Data: - After working with a dataset, you may wish to export it. - Access the export options via the **Export** button at the top right of the OpenRefine interface. #### Available Export Formats: - **HTML** - **Excel** - **CSV (Comma-Separated Values)** - **TSV (Tab-Separated Values)** - **Custom Exports:** - Export specific fields. - Add headers or footers. - Specify the format exactly. --- ### Exporting a Portion of Your Data: - Use facets or filters to select specific rows to export. - The resulting file will only include the selected rows. #### Important Note: - Be cautious not to accidentally export only a portion of your data. - Check the top left of the interface to ensure all rows are being displayed for a full export. --- ### Key Points: - You can export your data in a variety of formats. ## Looking Up Data ### Looking Up Data from a URL: - OpenRefine can retrieve data from URLs for additional information. - Example: Look up names against the **Virtual International Authority File (VIAF)** to retrieve details like birth/death dates and identifiers. #### Two-Step Process: 1. **Retrieve Data:** Use the dropdown menu on a column to select `Edit column → Add column by fetching URLs`. - Enter a GREL expression to create a URL based on existing data. 2. **Extract Relevant Information:** Use OpenRefine functions like: - `parseHtml()` - `parseJson()` --- We are going to query the CrossRef API https://www.crossref.org/ >CrossRef is a nonprofit organization that assigns Digital Object Identifiers (DOIs) to scholarly content, ensuring permanent access and discoverability. It helps publishers, institutions, and researchers by linking citations and maintaining up-to-date content metadata across the academic ecosystem. It has an API: https://api.crossref.org/journals/ and we can query it with our issn (a unique number for journals). ### Challenge: Retrieving Journal Details from CrossRef via ISSN 1. **Select a Single Row:** - Click the star icon for the row containing an ISSN. - Facet by Star to choose the row. 2. **Fetch URLs:** - In the ISSN column, select `Edit column → Add column by fetching URLs`. - Name the new column (e.g., "Journal_details"). - Write GREL to create the URL: ```grel "https://api.crossref.org/journals/"+value ``` - Click 'OK' and wait for data to be fetched. 3. **Parse JSON Data:** - In the new column, select `Edit column → Add column based on this column`. - Name the new column (e.g., "Journal_title"). - Enter the expression: ```grel value.parseJson().message.title ``` - Click 'OK' to extract the journal title. --- ### Reconciliation Services: - Reconciliation services allow you to look up terms in your data against external services. - This method is often quicker and more sophisticated than fetching data from URLs. - To use the Reconciliation function, the external resource must support it. #### Example: Using VIAF Virtual International Authority File Reconciliation Service 1. **Start Reconciliation:** - In the Publisher column, select `Reconcile → Start Reconciling`. - Add the service details: - Public service: `https://refine.codefork.com/reconcile/viaf` - Local service: `http://localhost:8080/reconcile/viaf` 2. **Select Entity Type:** - Choose "Corporate Name" for the entity type. 3. **Match Candidates:** - Uncheck "Auto-match candidates with high confidence". (This can be a time saver, but in this case you are going to uncheck it, so you can see the results before a match is made) - Click 'Start Reconciling'. - Review matches in the Publisher column. 4. **Accept Matches:** - Use the 'tick' or 'double tick' options to accept matches. - Use `Reconcile → Actions → Match each cell to its best candidate` for bulk matching. 5. **Extract VIAF ID:** - In the Publisher column, select `Edit column → Add column based on this column`. - Name the new column "VIAF-ID". - Use the expression: ```grel cell.recon.match.id ``` --- ### Using the ‘cross' Function: - The `cross` function allows you to look up data in other OpenRefine projects on the same computer. - It returns an array of matching rows, enabling data comparisons between projects. --- ### Extensions: - OpenRefine functionality can be enhanced by installing extensions. - A list of available extensions can be found at [OpenRefine Extensions](http://openrefine.org/extensions). --- ### Key Points: - OpenRefine can look up custom URLs to fetch data based on project content. - API calls can be custom-built or utilize existing Reconciliation services. - OpenRefine can be further enhanced by installing extensions.