changed 6 years ago
Linked with GitHub

2018-UCSD-Library-Carpentry

These notes have been locked by the instructors.

September 14, 2018
9am - 4pm

UC San Diego Biomedical Library Building, Classroom 4


Sign In: Name, Program/Dept, Affiliation(academic, staff, student)
Reid Otsuji, RDCP/DLDP, Librarian
Ryan Johnson, Metadata Services/RDCP, Librarian
Mary Linn Bergstrom, ALP, RDCP, Librarian
Amanda Roth, LSV, Librarian
Amber Gallant, CARS
Judea d'Arnaud, CARS, staffhack
Satomi Saito, CARS, staff
Stephen Cruz, SLA-BLB, Staff
Bie-hwa Ma, SCP
Adriana Moran, CARS
SuHui Ho, DUS
Ruihua Zhang, MS
Renee Chin, MS
Jinhong Qi, MS
Erin Glass, RAS, Librarian
Ho Jung Yoo, RDCP, staff


Start your notes here:

More about Library Carpentry: https://librarycarpentry.org/

Markdown intro

bold
italic
hyperlinks
A line break is two spaces

  • Asterisk or plus for bullets
  1. Numbered lists
    code or 'monospace'
import pandas as pd

# code blocks with syntax highlighting

Data Intro for Librarians

Type your notes here:

Jargon Busting

  1. Pair with a neighbor and decide who will take notes.
  2. Talk for 3 minutes (your instructor will be timing you!) on any terms, phrases, or ideas around code or software development in libraries that you’ve come across and perhaps feel you should know better.

hackmd
shell
UNIX
confluence
Pythontake workshop, get a book on it, ask Ryan
data mining
data curation
regular expressionpractice, trial and error
data scraping
share point
R
machine language
API
data flow
DAMS
Chronopolis - take a workshop
data visualization
Chronopolis
visual basic
GIS formatsTalk to GIS librarian
java - W3Schools for programming langauges in order to learn how to program

Pick problematic 3 terms and think about how or where you might learn more about them

w3school - or for some conceptual questions, talk with an expert
Online courses like DataCamp
Regular expressions - can be (programming) language agnostic

Intro to Unix Shell - BASH

online help www.explainshell.com - use to look up shell command help information

Download data here:
https://librarycarpentry.github.io/lc-shell/setup.html

Ryan's BASH script to update an Ubuntu operating system: link


Shell Cheat sheet commands

Shell: Basics
pwd - print working directory

ls - list contents of a directory

ls -l - list file information
ls -lh - list human readable file information
ls -F - list files and directories (directories will have a trailing /)
ls -a - list all files, including hidden files
ls *.txt - list all files that end with .txt
cd change directory
.. - move up one level in directory
cd pathname - takes you to the directory specified by pathname
cd or cd ~ - takes you to your home directory
&& - separates two commands sequentially

Shell: Interacting with Files
mkdir make a directory

cat print to shell or send file or files to output

head output first 10 lines of a file or files

tail output last 10 lines of a file or files

mv rename or move a file or files. Syntax for renaming a file: mv FILENAME NEWFILENAME

cp make a backup copy of a file or files. Syntax: cp FILENAME NEWFILENAME

redirect output. Syntax with cat: cat FILENAME1 FILENAME2 > NEWFILENAME

redirect output by appending to the filename specified. Syntax with cat: cat FILENAME1 FILENAME2 >> NEWFILENAME

rm remove a file or files. NB: USE WITH EXTREME CAUTION!!!

rmdir -r will delete a directory, even if it is not empty.

rmdir -r-i will delete a directory, even if it is not empty, but will ask you to confirm each deletion.

Shell: Wildcards
? a placeholder for one character or number

* a placeholder for zero or more characters or numbers

[] defines a class of characters

Examples

foobar?: matches 7-character strings starting with foobar and ending with one character or number
foobar*: matches strings that start with foobar and end with zero or more other characters or numbers
foobar*txt: matches strings that start with foobar and end with txt
[1-9]foobar?: matches 8-character strings that start that start with a number, have foobar after the number, and end with any character or number.

Shell: Counting and Mining
wc word count

-w: count words
-l: count lines
-c: count characters
sort sort input

grep global regular expression print

-c: displays counts of matches for each file
-i: match with case insensitivity
-w: match whole words
-v: exclude match
--file=FILENAME.txt: use the file FILENAME.txt as the source of strings used in query
|: (vertical bar character) send output from one command into another command

Shell manual/help
On Mac: man
On PC: help


Tidy Data for Librarians

Lesson: https://librarycarpentry.github.io/lc-spreadsheets/

Group Discussion

  • How many people have used spreadsheets in their work?
  • What kind of operations do you do in spreadsheets?
  • Which ones do you think spreadsheets are good for?
  • What are some things that you’ve been frustrated that you can’t do easily?
  • What have you accidentally done in a spreadsheet?

Problems with Spreadsheets

Real world example: Borwein, Jonathan, and David H. Bailey. “The Reinhart-Rogoff Error – or How Not to Excel at Economics.” The Conversation, The Conversation, 22 Apr. 2013, http://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646.

Tidy data spreadsheets

What do you do with spreadsheets?
Data entry
Organizing data
Subsetting and sorting data
Statistics
Plotting

Types of spreadsheet software
Excel
Googlesheets
LibreOffice
Gnumeric
OpenOffice.org

Problems with spreadsheets:
we push them to the limits
lots -
too many sheets
null vaules
selecting data and deleting it
Excel has it's own mind - transforms data or numeric values

Good practices:
well formatted data
data organization
create new file or tab (create a new sheet for notes)

KEEP YOU RAW DATA RAW!

track your steps! Spreadsheet functions are not reproducible.

Keep variables columns heading (heading column is the variable)
observations (values) in row

columns = variables/Rows - Observations - values

date handling in spreadsheets

dates infomration stored in Excel stores dates as a number

e.g. jan 1, 1900 is number 1

excel does a mathmatical calcualtion to figure out the dates

Good practice:

Split out dates into seperate columns in excel

Remember, dates are a big iusse in Excel:

be aware a out how your are formatting dates in the spreadsheet
Have a plan to format dates properly

remember setup the spreadsheet to minimize errors as much as possible

Use readme files to document work you have done.

Sorting


Exercise 1: Messy library training data

  1. Download the data
    https://librarycarpentry.github.io/lc-spreadsheets/data/training_attendance.xlsx
  2. Open up the data in Excel or another spreadsheet program
  • There are three tabs.
  • Various people have recorded training attendance statistics over 2016 and 2017, and they have kept track of the data in their own way.
  • Now you’re being asked to evaluate the training program and you want to be able to start doing statistics with the data.
  1. With the person next to you, reorganize the messy data so that a computer will be able to understand it (10 minutes)
  2. Clean up the 2016 and 2017 tabs, and put them all together in one spreadsheet. Start by choosing your columns.
  • Do not forget of our first piece of advice: create a new file (or tab) for the cleaned data, never modify the original (raw) data.

Exercise 1 responses:

Problems with data in the spreadsheet:
different formatting
different information
wrong years
formats dates
multiple information in the wrong column
multiple tables on one sheet

what did you do to reformat data?
create new sheet setting up columns (variables)
added year to dates

Discussion: Messy library training data

Exercise 2: pulling month, day and year out of dates in Excel

  • In the Dates tab of your Excel file you have some more training data from 2017. There’s a date column.
  • First, let’s insert three new columns: month, day, year
  • Now let’s extract month, day, and year with the built in Excel functions
    =MONTH(A2)
    =DAY(A2)
    =YEAR(A2)
    And then autofill down

To do:

create a new sheet
create new columns for month, day, and year
use formula above to calculate each date property

Bonus Exercise(1a) pulling date of year out of dates in Excel

  • Let’s insert one more new columns: day_of_year
  • Now let’s extract these with a more complex Excel functions
    =A2-DATE(YEAR(A2),1,0)
    And then autofill down

Exercise: Quality Assurance

We can’t have half a person attending a workshop, so let’s try this out by setting the num_registered column in our Dates spreadsheet to only allow whole numbers between 1 and 100.

  1. Select the num_registered column
  2. On the Data tab select Data Validation
  3. In the Allow box select Whole number
  4. Set the minimum and maximum values to 1 and 100.
  5. Now let’s try entering a new value in the num_registered column that isn’t a valid size, like 15.1. The spreadsheet stops us from entering the wrong value and asks us if we would like to try again.
  6. You can also customize the resulting message to be more informative by entering your own message in the Input Message tab and allow invalid data to just result in a warning by modifying the Style option on the Error Alert tab.

Quality assurance can make data entry easier as well as more robust. For example, if you use a list of options to restrict data entry, the spreadsheet will provide you with a drop-downlist of the available items. So, instead of trying to remember the initials of all your trainers, you can just select the right option from the list.

Exercise: Sorting

  • Let’s try this with the Date tab in our messy Dates spreadsheet. Go to that tab. Select Data then select Sort
  • Sort by column date in the order Smallest to Largest
  • When you do this sort, do you notice anything strange?
  • Try sorting by other columns. Anything strange there?

Exercise: Conditional formatting

  1. Let’s try this again with length of time. Highlight the len_hours column.
  2. Go to Home tab and click Conditional Formatting.
  3. We’ll do the 2-Color Scale with Lowest to Highest for the orange colors.
  4. Now we can scan through and different colors will stand out. Again, do we notice any strange values?
    It is nice to do be able to do these scans in spreadsheets, but we also can do these checks in a programming language like Python or R, or in OpenRefine or SQL

Exercise: Export to CSV

  • To save a file you have opened in Excel in *.csv format:
  • From the top menu select ‘File’ and ‘Save as’.
  • In the ‘Format’ field, from the list, select ‘Comma Separated Values’ (*.csv).
  • Double check the file name and the location where you want to save it and hit ‘Save’.

Intro to OpenRefine

We will get data for this lesson here:
https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv

Start OpenRefine using Chrome!

There seem to be problems starting open refine in other web browsers

use for:
cleaning up messy data
quick summary of data

open refine can open a variety of data files.
cross platform compatability

works with google data

be aware of sorting and importing in Excel.
principles of avoid creating messy data

You can apply json scripts saved from previous OR session to re run another process.

UTF-8 encoding is the recommended default selection

OR exports as .tar file.

Dates in OR can be slpit out in to seperate columns
Month, day, year
Avtivities are useful! Please take time to check students were able to type commands before showing the result (for upgrading the instruction).

it's easier, compared with Excel, to edit the faceted data within the cluster without highlighting the specific columns and messing out the same data in other columns


Minute notes:

Please write one thing that you thought was positive during the workshop day:

The hands on activities and the command short cut notes.

The hands-on activities were well-suited to the length of time that we had and felt like we could build upon them.

Please write one thing that you thought we could improve or what you would like to learn more about:

I appreciated the number of helpers in the room! Felt very supportive. Need to take time to apply what I learned in this fast paced day.

I just want to delve into more Library Carpentry topics! I felt like it wasn't long enough. I would also have really liked real world examples of data curation and data visualization stemming from projects that librarians have done in the past, so I could really visualize the applications.

I'm not sure how I'm going to be able to keep this fresh in my mind since the applications wouldn't be something I would usd daily.

There is a lot of potential in improving work effeiencies. The morning session was helpful in understanding the fundamentals of commands. OpenRefine offers quite a bit of potential for helping our work with spreadsheets (which comprise 80% of my work). I appreciated the help with the programs/laptop.

Activities are useful! Please take time to check the majority of students are able to finish typing commands before showing the results (for upgrading instruction :))

This very helpful workshop. a lot of things to learn.

Select a repo