owned this note
owned this note
Published
Linked with GitHub
# 2018-UCSD-Library-Carpentry
# These notes have been locked by the instructors.
**September 14, 2018**
**9am - 4pm**
**UC San Diego Biomedical Library Building, Classroom 4**
---
**Sign In: Name, Program/Dept, Affiliation(academic, staff, student)**
Reid Otsuji, RDCP/DLDP, Librarian
Ryan Johnson, Metadata Services/RDCP, Librarian
Mary Linn Bergstrom, ALP, RDCP, Librarian
Amanda Roth, LSV, Librarian
Amber Gallant, CARS
Judea d'Arnaud, CARS, staffhack
Satomi Saito, CARS, staff
Stephen Cruz, SLA-BLB, Staff
Bie-hwa Ma, SCP
Adriana Moran, CARS
SuHui Ho, DUS
Ruihua Zhang, MS
Renee Chin, MS
Jinhong Qi, MS
Erin Glass, RAS, Librarian
Ho Jung Yoo, RDCP, staff
---
### Start your notes here:
More about Library Carpentry: https://librarycarpentry.org/
### Markdown intro
**bold**
_italic_
[hyperlinks](https://github.com)
A line break is two spaces
* Asterisk or plus for bullets
1. Numbered lists
`code` or 'monospace'
```python
import pandas as pd
# code blocks with syntax highlighting
```
# Data Intro for Librarians
Type your notes here:
## Jargon Busting
1. Pair with a neighbor and decide who will take notes.
2. Talk for 3 minutes (your instructor will be timing you!) on any terms, phrases, or ideas around code or software development in libraries that you’ve come across and perhaps feel you should know better.
hackmd
shell
UNIX
confluence
Python--take workshop, get a book on it, ask Ryan
data mining
data curation
regular expression--practice, trial and error
data scraping
share point
R
machine language
API
data flow
DAMS
Chronopolis - take a workshop
data visualization
Chronopolis
visual basic
GIS formats--Talk to GIS librarian
java - W3Schools for programming langauges in order to learn how to program
## **Pick problematic 3 terms and think about how or where you might learn more about them**
w3school - or for some conceptual questions, talk with an expert
Online courses like [DataCamp](https://datacamp.com)
Regular expressions - can be (programming) language agnostic
# Intro to Unix Shell - BASH
online help www.explainshell.com - use to look up shell command help information
Download data here:
https://librarycarpentry.github.io/lc-shell/setup.html
Ryan's BASH script to update an Ubuntu operating system: [link](https://github.com/remerjohnson/linux-stuff/blob/master/update)
___
### Shell Cheat sheet commands
**Shell: Basics**
`pwd` - print working directory
`ls` - list contents of a directory
`ls -l` - list file information
`ls -lh` - list human readable file information
`ls -F` - list files and directories (directories will have a trailing /)
`ls -a` - list all files, including hidden files
`ls *.txt` - list all files that end with .txt
`cd` change directory
`..` - move up one level in directory
`cd pathname` - takes you to the directory specified by pathname
`cd` or `cd ~` - takes you to your home directory
`&&` - separates two commands sequentially
**Shell: Interacting with Files**
`mkdir` make a directory
`cat` print to shell or send file or files to output
`head` output first 10 lines of a file or files
`tail` output last 10 lines of a file or files
`mv` rename or move a file or files. Syntax for renaming a file: mv FILENAME NEWFILENAME
`cp` make a backup copy of a file or files. Syntax: `cp FILENAME NEWFILENAME`
> redirect output. Syntax with cat: cat FILENAME1 FILENAME2 > NEWFILENAME
>> redirect output by appending to the filename specified. Syntax with cat: cat FILENAME1 FILENAME2 >> NEWFILENAME
`rm` remove a file or files. NB: USE WITH EXTREME CAUTION!!!
`rmdir -r` will delete a directory, even if it is not empty.
`rmdir -r-i` will delete a directory, even if it is not empty, but will ask you to confirm each deletion.
**Shell: Wildcards**
`?` a placeholder for one character or number
`*` a placeholder for zero or more characters or numbers
`[]` defines a class of characters
**Examples**
`foobar?`: matches 7-character strings starting with foobar and ending with one character or number
`foobar*`: matches strings that start with foobar and end with zero or more other characters or numbers
`foobar*txt`: matches strings that start with foobar and end with txt
`[1-9]foobar?`: matches 8-character strings that start that start with a number, have foobar after the number, and end with any character or number.
**Shell: Counting and Mining**
`wc` word count
`-w`: count words
`-l`: count lines
`-c`: count characters
`sort` sort input
`grep` global regular expression print
`-c`: displays counts of matches for each file
`-i`: match with case insensitivity
`-w`: match whole words
`-v`: exclude match
`--file=FILENAME.txt`: use the file FILENAME.txt as the source of strings used in query
`|`: (vertical bar character) send output from one command into another command
**Shell manual/help**
On Mac: `man`
On PC: `help`
___
# Tidy Data for Librarians
Lesson: https://librarycarpentry.github.io/lc-spreadsheets/
## Group Discussion
* How many people have used spreadsheets in their work?
* What kind of operations do you do in spreadsheets?
* Which ones do you think spreadsheets are good for?
* What are some things that you’ve been frustrated that you can’t do easily?
* What have you accidentally done in a spreadsheet?
## Problems with Spreadsheets
Real world example: Borwein, Jonathan, and David H. Bailey. “The Reinhart-Rogoff Error – or How Not to Excel at Economics.” The Conversation, The Conversation, 22 Apr. 2013, http://theconversation.com/the-reinhart-rogoff-error-or-how-not-to-excel-at-economics-13646.
## Tidy data spreadsheets
What do you do with spreadsheets?
Data entry
Organizing data
Subsetting and sorting data
Statistics
Plotting
**Types of spreadsheet software**
Excel
Googlesheets
[LibreOffice](https://www.libreoffice.org/)
Gnumeric
OpenOffice.org
**Problems with spreadsheets:**
we push them to the limits
lots -
too many sheets
null vaules
selecting data and deleting it
Excel has it's own mind - transforms data or numeric values
**Good practices:**
well formatted data
data organization
create new file or tab (create a new sheet for notes)
#### KEEP YOU RAW DATA RAW!
track your steps! Spreadsheet functions are not reproducible.
Keep variables columns heading (heading column is the variable)
observations (values) in row
### columns = variables/Rows - Observations - values
### date handling in spreadsheets
dates infomration stored in Excel stores dates as a number
e.g. jan 1, 1900 is number 1
excel does a mathmatical calcualtion to figure out the dates
### Good practice:
Split out dates into seperate columns in excel
### Remember, dates are a big iusse in Excel:
be aware a out how your are formatting dates in the spreadsheet
Have a plan to format dates properly
remember setup the spreadsheet to minimize errors as much as possible
Use **readme** files to document work you have done.
## Sorting
---
## Exercise 1: Messy library training data
1. Download the data
https://librarycarpentry.github.io/lc-spreadsheets/data/training_attendance.xlsx
2. Open up the data in Excel or another spreadsheet program
* There are three tabs.
* Various people have recorded training attendance statistics over 2016 and 2017, and they have kept track of the data in their own way.
* Now you’re being asked to evaluate the training program and you want to be able to start doing statistics with the data.
3. With the person next to you, reorganize the messy data so that a computer will be able to understand it (10 minutes)
4. Clean up the 2016 and 2017 tabs, and put them all together in one spreadsheet. Start by choosing your columns.
* Do not forget of our first piece of advice: create a new file (or tab) for the cleaned data, never modify the original (raw) data.
## Exercise 1 responses:
**Problems with data in the spreadsheet:**
different formatting
different information
wrong years
formats dates
multiple information in the wrong column
multiple tables on one sheet
**what did you do to reformat data?**
create new sheet setting up columns (variables)
added year to dates
## Discussion: Messy library training data
* What was wrong with this data?
** Formatting Problems - Common Mistakes: https://librarycarpentry.github.io/lc-spreadsheets/02-common-mistakes/
* How did you fix it?
* An excellent reference, in particular with regard to R scripting is
Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.
## Exercise 2: pulling month, day and year out of dates in Excel
* In the Dates tab of your Excel file you have some more training data from 2017. There’s a date column.
* First, let’s insert three new columns: month, day, year
* Now let’s extract month, day, and year with the built in Excel functions
=MONTH(A2)
=DAY(A2)
=YEAR(A2)
And then autofill down
### To do:
create a new sheet
create new columns for month, day, and year
use formula above to calculate each date property
## Bonus Exercise(1a) pulling date of year out of dates in Excel
* Let’s insert one more new columns: day_of_year
* Now let’s extract these with a more complex Excel functions
=A2-DATE(YEAR(A2),1,0)
And then autofill down
## Exercise: Quality Assurance
We can’t have half a person attending a workshop, so let’s try this out by setting the num_registered column in our Dates spreadsheet to only allow whole numbers between 1 and 100.
1. Select the num_registered column
1. On the Data tab select `Data Validation`
1. In the `Allow box` select `Whole number`
1. Set the minimum and maximum values to 1 and 100.
1. Now let’s try entering a new value in the `num_registered` column that isn’t a valid size, like 15.1. The spreadsheet stops us from entering the wrong value and asks us if we would like to try again.
1. You can also customize the resulting message to be more informative by entering your own message in the `Input Message` tab and allow invalid data to just result in a warning by modifying the Style option on the Error Alert tab.
Quality assurance can make data entry easier as well as more robust. For example, if you use a list of options to restrict data entry, the spreadsheet will provide you with a drop-downlist of the available items. So, instead of trying to remember the initials of all your trainers, you can just select the right option from the list.
## Exercise: Sorting
* Let’s try this with the Date tab in our messy Dates spreadsheet. Go to that tab. Select Data then select Sort
* Sort by column date in the order Smallest to Largest
*
* When you do this sort, do you notice anything strange?
* Try sorting by other columns. Anything strange there?
## Exercise: Conditional formatting
1. Let’s try this again with length of time. Highlight the len_hours column.
1. Go to Home tab and click Conditional Formatting.
1. We’ll do the 2-Color Scale with Lowest to Highest for the orange colors.
1. Now we can scan through and different colors will stand out. Again, do we notice any strange values?
It is nice to do be able to do these scans in spreadsheets, but we also can do these checks in a programming language like Python or R, or in OpenRefine or SQL
## Exercise: Export to CSV
* To save a file you have opened in Excel in *.csv format:
* From the top menu select ‘File’ and ‘Save as’.
* In the ‘Format’ field, from the list, select ‘Comma Separated Values’ (*.csv).
* Double check the file name and the location where you want to save it and hit ‘Save’.
---
# Intro to OpenRefine
We will get data for this lesson here:
https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv
# Start OpenRefine using Chrome!
**There seem to be problems starting open refine in other web browsers**
use for:
cleaning up messy data
quick summary of data
open refine can open a variety of data files.
cross platform compatability
works with google data
be aware of sorting and importing in Excel.
principles of avoid creating messy data
You can apply json scripts saved from previous OR session to re run another process.
UTF-8 encoding is the recommended default selection
OR exports as .tar file.
Dates in OR can be slpit out in to seperate columns
Month, day, year
Avtivities are useful! Please take time to check students were able to type commands before showing the result (for upgrading the instruction).
it's easier, compared with Excel, to edit the faceted data within the cluster without highlighting the specific columns and messing out the same data in other columns
---
# Minute notes:
## **Please write one thing that you thought was positive during the workshop day:**
The hands on activities and the command short cut notes.
The hands-on activities were well-suited to the length of time that we had and felt like we could build upon them.
**Please write one thing that you thought we could improve or what you would like to learn more about:**
I appreciated the number of helpers in the room! Felt very supportive. Need to take time to apply what I learned in this fast paced day.
I just want to delve into more Library Carpentry topics! I felt like it wasn't long enough. I would also have really liked real world examples of data curation and data visualization stemming from projects that librarians have done in the past, so I could really visualize the applications.
I'm not sure how I'm going to be able to keep this fresh in my mind since the applications wouldn't be something I would usd daily.
There is a lot of potential in improving work effeiencies. The morning session was helpful in understanding the fundamentals of commands. OpenRefine offers quite a bit of potential for helping our work with spreadsheets (which comprise 80% of my work). I appreciated the help with the programs/laptop.
Activities are useful! Please take time to check the majority of students are able to finish typing commands before showing the results (for upgrading instruction :))
---
This very helpful workshop. a lot of things to learn.