changed 3 years ago
Linked with GitHub

2021-GPS-Data-Skills-Course-Spreadsheets

Introduction to Data Management

Data Organization using Excel Spread Sheet

What kinds of things went wrong?

  • Load a CSV file, and Excel eats all the precision
  • had to repeat a lot of monotonous steps with excel
  • Dates in excel

Link to Gene article with excel error: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80

A graduate student recreate the analysis and discovered the error.

Create a data dictionary link: https://help.osf.io/hc/en-us/articles/360019739054-How-to-Make-a-Data-Dictionary

Best practices

  1. Leave raw data raw! Work with a copy.
  2. Keep versions (thinking about using git or other version control system).
  3. Keep a record of any cleaning steps you do.
  4. Export to format like csv

Tidy-data principles

  1. Put all variables (Things you're measuring) in columns - Tidydata!
  2. Put each observation in its own row.
  3. Don't combine multiple piece of informatin in one cell

Question: Any other data type? CSV seems too primitive, and not enough support to preserve the structure of the data.

What's wrong with the SAFI_messy data?

Mozambique Sheet

  • Multiple data in livestock_owned_and_numbers
  • under "plots" sometimes they use yes or not and sometimes the number 1
  • -999 null value? Not clear.
  • inconsistent naming of roof_type
  • Multiple tables in a sheet BAD BAD for export

Tanzania sheet:

  • inconsistent naming
  • multiple tables on one sheet
  • formatting of livestock

Common Formatting Problems

  1. Using multiple tables and/or tabs
  2. Not filling in zeros
  3. Using non-standard null values
  4. Using formatting to convey information
  5. Entering more than one piece of information per cell
  6. Using confusing field names
  7. Using special characters in data
  8. Issues with dates as data

Data as dates

Excel stores dates as series numbers - counts the dats from Dec 31, 1899.

Excel is unable to aprse dates from before this date. Be careful when working with historical data

Microsoft explaining the date system: https://docs.microsoft.com/en-US/office/troubleshoot/excel/1900-and-1904-date-system

One solution: having Day, Month, and Year columns - no ambiguity

Metadata

Metadata is "data about your data"
Don't save them in the sheet itself

SQL Notes -

Database used is book_store.db.sql

  • What are some common operations that you want to do to your inventory?
    Codify the books as per genre, year, author etc

  • What are some tools?
    Spreadsheet

  • What are some problems with these tools?
    data entry, accessing data, keeping track of what kind of access do you want to give to people.

Database management system (DBMS) can address these questions

database - shared collection of related data

Database management system (DBMS) - prgrams that support CRUD operations and control access to database

  • CRUD - acronym - Create, Read, Update, Delete

  • SQL - Structured QUery Language - language used to talk to DBMS or (relational database management systems) RDBMS

Keys link tables together
primary key - the column that uniquely identify each row in a table
foreign key - a column in a table that references the foreign key in another table

SQL intro

  • SQL or sequel - data query language

    • e.g. sql can be used with Google big query
  • SQL is formaily defined but implemented differently , there are imcompatibilities with other RDBMS .

  • popular extensions - MySQL, PostgreSQL

SQL can be run in programming langages like R and Python

SQL language and syntax

  • SQL is case insensitive
    • UPPERCASE is FOR KEYWORDS
  • End your statment with';'

Working with the book database:
open DBbrowser
open database - load book_store.db database


First queries:

SELECT * FROM books;

SELECT title, author
FROM books; 

commenting in sql is a -- e.g. --this is a comment

SELECT * FROM Transactions;
--how many books are sold (not how many total copies of books)? SELECT DISTINCT ISBN FROM transactions;

Using DISTINCT statement

SELECT DISTINCT purchase_number, ISBN FROM transactions, customers;

using ORDER BY statement

SELECT title, autor FROM books ORDER BY author DESC;
SELECT first, last, transaction_id FROM customers, transactions ORDER BY transaction_id ASC;

find out who purchased the most books?

SELECT first, last, purchase_number FROM customers ORDER BY purchase_number DESC;

Day 4 SQL - part 2#####

Review:

Primary Key
Foreign Key
Database
Relational Database Management system
SQL

In Class slide review questions:

--question 1: SELECT first, last FROM customers; --questions 2: SELECT * FROM customers; --question 3: SELECT DISTINCT customer_id FROM transactions; --question 4: SELECT * FROM books ORDER BY title DESC;

Boolean Algebra in SQL

AND and OR and NOT

boolean example:
(1 < -1) OR (0 == 0) -> True
(1 < -1) AND (-1 < 2) -> False

WHERE clause
example
who purcahsed more than 1 book?

SELECT first, last, purchase_number FROM Customers WHERE purchase_number > 1;

Whose first name started with letter A or last name with letter L?

SELECT first, last FROM customers WHERE (first LIKE 'A%') OR (last LIKE 'L%')

What books did customers whose id is 11111 and 11112 purchase?

SELECT ISBN FROM transactions WHERE (customer_id == 11111) OR (customer_id == 11112); -- using parenthasis helps readability
--using IN statement, result is similar to using OR SELECT ISBN FROM transactions WHERE customer_id IN (11111,11112);

Question: I also want to know who these customers and what books are. Using WHERE, how can I find out about these information?

--manually joining 2 tables SELECT id, first, last FROM Customers WHERE id IN (11111,11112); SELECT ISBN13, title FROM books WHERE ISBN13 IN (9780872205468, 9780140449044, 9780141026282);

Question: I want to check who purchased between 1 and 3 books. WHat is wrong with the following code?

SELECT first, last, purchase_number FROM customers WHERE (purcahse_number > 1) OR (purchase_number < 3); --change OR to AND SELECT first, last, purchase_number FROM customers WHERE (purchase_number > 1) AND (purchase_number < 3);

caluculations

'||' renaming using AS clause

--- using AS to rename columns and combine full name columns SELECT first ||''|| last AS full_name FROM customers;

Aggregate function

avg
count
min
max

What is the average rating of the books?

--using the builtin average avg() calculation function SELECT avg(avg_rating) FROM books; --get book count SELECT count(avg_rating) FROM books; --get min avg rating with title SELECT title, min(avg_rating) FROM books;

group by clause

Usually used with aggreate functions

Example: what is teh number of purchased by each person in transaction table?

--group results by customer ID SELECT customer_id, count(*) FROM transactions GROUP BY customer_id;

Joins

combine data from multiple tables

How do we get infomration about each customers' fav_book?
Not just the ISBN, but the title, author

what does this statement do?

-- Joining customers and books data SELECT Customers.last ||" "|| Customers.first, Customers.fav_book, Books.title, Books.author FROM Customers JOIN Books ON Customers.fav_book == books.ISBN13;

Tell SQL the specific columns we are joining, us ON clause:

Use where filter out all the transactins made to customers whos IDs are 11111, and 11112. Use JOIN find out who they are and USE JOIN to find out what books they purchased.

--combining everything you learned SELECT Customers.last, Customers.first, Customers.id, Books.title FROM Books JOIN Transactions JOIN Customers ON Books.ISBN13 == Transactions.ISBN AND Transactions.customer_id == Customers.id WHERE Customers.id IN (11111, 11112);

Using SQL in R:
load
library RSQLite

Full SQL lesson:
https://datacarpentry.org/sql-socialsci/

Google bigquery

######Unix Shell notes####

ls

listing the content of the curent directory

ls -l

Listing out, with more details using the -l flag

pwd

print current directory - check where you are at

man ls

Read the documentary of command ls

cd ..

moving up from the current folder

cat

concatenate - "print" content of a file

mv

move the files, notice that you also rename files using mv.

rm

DANGEROUS! You cannot recover once you removed a file

sort -n length.txt

sort the content of the file

sort -n length.txt > sorted-lengths.txt

sort the content, pipe (>) it to the file sorted-lengths.txt

head 
tail

print the head or tail of the file

wc

word count

mkdir result

make a folder called result

grep

Global Regular Express Print

[Regular expression hand-on tutorial] (https://regexone.com/)

cat [filename]
grep 1999 *.tsv
grep -c 1999 *.tsv
grep -c revolution *.tsv

Search the number of instances of "1999"
Search the number of instances of "revolution"

grep -ci revolution *.tsv

the -i flag means ignore case.

grep -i revolution *.tsv > results/2021-03-04_JAi-revolution.tsv

pipe the result to the file 2021-03-04_JAi-revolution.tsv

grep -iw revolution *.tsv > results/2021-03-04_JAiw-revolution.tsv

-w search for whole word, no partial match


What is an API? (Application Programming Interface)

API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other. Each time you use an app like Facebook, send an instant message, or check the weather on your phone, you’re using an API.

HTTP - HyperText Transfer Protocol - set of rules taht the client and server agree upon on what information to send and what to be expected

API request status code 200
200 means "ok"

HTTP response Status Codes
always look for 200 which means everyone is "ok"

bad errors
e.g. 404 Not found code - server cannot find information

**server error **
500 codes are server errors

APIs use: these services have APIs available to use:
Twitter
spotify
worldbank

HTTP: query String

part of a URL that assigns values to specific parameters

https://www.google.com/search?newwindow=1&q=is+santa+true

  • Request Methods (Get) - most
  • important concept to remember
  • response status codes
  • Query string

Demo

OpenFEC API

to use the API, you will need to request a key - it i slike an acocunt

Select a repo