Try   HackMD

2022 UC Carpentries Fall Workshop (OpenRefine)

Workshop Details
Dates: September 6th - 13th, 2022
Time: 9am - 12pm

Workshop Agenda:
https://ucsdlib.github.io/2022-09-06-carpentries-uc/

Day 6: OpenRefine

Software Installation:
http://openrefine.org/download.html

  • Download and Install OpenRefine latest Version
  • Windows kit zip file for Windows
  • Mac kit for MacOS

Lesson Data (download)
https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv

  • right click and “save as” .csv file to your desktop

NOTES:

Introduction to OpenRefine
Intro Slides: https://docs.goog
le.com/presentation/d/1XaF9x9243BOSktfMS8YMhiPohe525HWplhr7FG7PXjw/edit?usp=sharing

Extensions and Distributions can be found here: http://openrefine.org/download.html

Importing data into OpenRefine

Layout of OpenRefine, Rows vs Records

Faceting and filtering
Scatterplot facets are less commonly used. For further information on these see the tutorial at https://web.archive.org/web/20190105063215/http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Exploring_the_data_with_scatter_plots.

Clustering
For more information on the methods used to create Clusters, see https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

Working with columns and sorting

Introduction to Transformations
Full documentation for the GREL is available at https://docs.openrefine.org/manual/grelfunctions.

Writing Transformations

Transformations - Undo and Redo

Transforming Strings, Numbers, Dates and Booleans

Transformations - Handling Arrays

Exporting data

Looking Up Data
CrossRef API: https://github.com/CrossRef/rest-api-doc).
Read more about the CrossRef service: http://www.crossref.org.

OpenRefine has a function for extracting data from JSON (sometimes referred to as ‘parsing’ the JSON). The ‘parseJson’ function is explained in more detail at https://docs.openrefine.org/manual/grelfunctions/#format-based-functions-json-html-xml.

The official User Manual provides detailed information about the reconciliation feature: https://docs.openrefine.org/manual/reconciling

One of the most common ways of using the reconciliation option in OpenRefine is with an extension (see below for more on extensions to OpenRefine) which can use linked data sources for reconciliation. The RDF extension by Stuart Kenny can be downloaded from https://github.com/stkenny/grefine-rdf-extension/releases.

Other extensions are available to do reconciliation against local data such as csv files (see http://okfnlabs.org/reconcile-csv/) and maintained lists of values (see http://okfnlabs.org/projects/nomenklatura/index.html).

For more information on using Reconciliation services see https://docs.openrefine.org/manual/reconciling.

A list of Extensions (not necessarily complete) is given on the OpenRefine downloads page at http://openrefine.org/download.html.

Workshop Day 6 ### First name and Last Name/Organization/Dept./Email | Name

(first & last) Organization Dept. Email
(example) Jane Doe UCSD IT jdoe1@ucsd.edu
Roberto Silva UCSD SIO Rosilva@ucsd.edu
UCSD SIO kmc001@ucsd.edu
Tom Le UCM tle267@ucmerced.edu
Nicole Rosenberg UCSD Scripps nrosenberg@ucsd.edu
Aleks Leszczynska UCSD aleszczynska@ucsd.edu
Edgar Reyna UCLA Urban Planning eareyna@ucla.edu
Oishee Misra UCSD Economics omisra@ucsd.edu
​​​​         | |                           |              |

| Marta Sala Climent |UCSD | Medicine | msalacliment@health.ucsd.edu | Charles Faulhaber|Spanish & Portuguese UCB|cbf@berkeley.edu | | | | | |
| Becky Miller |UCB | Library | rcmiller@berkeley.edu| |
|Apisit Kaewsanit | UCSF |Epidemiology and Biostats | apisit.kaewsanit@ucsf.edu | | |
| KYLE ROKES UCSB kyle_rokes@ucsb.edu|
|Dayana Elizalde |UCR | deliz002@ucr.edu | | |
| Alissa Jae Lazo-Kim | UCSD DBMI internship alazokim@health.ucsd.edu | | | | |Shang Su |U Toledo|Cell and Cancer Biology|shang.su@utoledo.edu|
|Jessica Wu-Woods|UCR|Microbiology|jwuw001@ucr.edu|
|
|
|
|
|
|
|
|
|
|
|
|

Day 6 Exercises

  1. Splitting Subjects into separate cells
    What separator character is used in the Subjects cells?
    How would you split these subject words into individual cells?

  2. Joining the Subjects column back together
    Using what we’ve learned, now Join the Subjects back together

  3. Which licences are used for articles in this file?
    Use a text facet for the licence column and answer these questions:
    What is the most common Licence in the file?
    How many articles in the file don’t have a licence assigned?

  4. Find all publications without a DOI
    Use the Facet by blank function to find all publications in this data set without a DOI

  5. Correct the Language values via a facet
    Create a Text facet on the language column and correct the variation in the EN and English values.

Day 6 Reflection

Please enter how you will use the OpenRefine in your work or research here:

  1. To organize metadata
  2. Clean up data
  3. To reconcile PhiloBiblon project data with VIAF and Wikidata

Day 6 Questions:

Please enter any questions not answered during live session here:
1.

OpenRefine Resources:

Official wiki List of OpenRefine External Resources: https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
Getting started with OpenRefine by Thomas Padilla: http://thomaspadilla.org/dataprep/
Cleaning Data with OpenRefine by Seth van Hooland, Ruben Verborgh and Max De Wilde: http://programminghistorian.org/en/lessons/cleaning-data-with-openrefine
Blog posts on using OpenRefine from Owen Stephens: http://www.meanboyfriend.com/overdue_ideas/tag/openrefine/?orderby=date&order=ASC
Identifying potential headings for Authority work using III Sierra, MS Excel and OpenRefine: https://epublications.marquette.edu/lib_fac/81/
Free your metadata website: https://freeyourmetadata.org/
Data Munging Tools in Preparation for RDF: Catmandu and LODRefine by Christina Harlow: https://journal.code4lib.org/articles/11013
Cleaning Data with OpenRefine by John Little: https://libjohn.github.io/openrefine/
OpenRefine Blog: https://openrefine.org/category/blog.html

End Day 6

Feedback form:
https://forms.gle/5qgx8X6H3GRMacwD6