owned this note
owned this note
Published
Linked with GitHub
---
tags: gps
---
# 2021-GPS-Data-Skills-Course-Spreadsheets
# Introduction to Data Management
# Data Organization using Excel Spread Sheet
## What kinds of things went wrong?
* Load a CSV file, and Excel eats all the precision
* had to repeat a lot of monotonous steps with excel
* Dates in excel
Link to Gene article with excel error: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-80
A graduate student recreate the analysis and discovered the error.
Create a data dictionary link: https://help.osf.io/hc/en-us/articles/360019739054-How-to-Make-a-Data-Dictionary
## Best practices
1. Leave raw data raw! Work with a copy.
1. Keep versions (thinking about using git or other version control system).
1. Keep a record of any cleaning steps you do.
1. Export to format like csv
**Tidy-data principles**
1. Put all variables (Things you're measuring) in columns - **Tidydata**!
1. Put each observation in its own row.
1. Don't combine multiple piece of informatin in one cell
**Question**: Any other data type? CSV seems too primitive, and not enough support to preserve the structure of the data.
## What's wrong with the SAFI_messy data?
**Mozambique** Sheet
* Multiple data in livestock_owned_and_numbers
* under "plots" sometimes they use yes or not and sometimes the number 1
* -999 -- null value? Not clear.
* inconsistent naming of roof_type
* Multiple tables in a sheet -- BAD BAD for export
**Tanzania** sheet:
* inconsistent naming
* multiple tables on one sheet
* formatting of livestock
## Common Formatting Problems
1. Using multiple tables and/or tabs
2. Not filling in zeros
3. Using non-standard null values
4. Using formatting to convey information
5. Entering more than one piece of information per cell
6. Using confusing field names
7. Using special characters in data
8. Issues with dates as data
## Data as dates
Excel stores dates as series numbers - counts the dats from Dec 31, 1899.
Excel is unable to aprse dates from before this date. Be careful when working with historical data
Microsoft explaining the date system: https://docs.microsoft.com/en-US/office/troubleshoot/excel/1900-and-1904-date-system
One solution: having Day, Month, and Year columns - no ambiguity
## Metadata
Metadata is "data about your data"
Don't save them in the sheet itself
#### SQL Notes - ####
**Database used is book_store.db.sql**
* What are some common operations that you want to do to your inventory?
Codify the books as per genre, year, author etc
* What are some tools?
Spreadsheet
* What are some problems with these tools?
data entry, accessing data, keeping track of what kind of access do you want to give to people.
`Database management system (DBMS)` can address these questions
`database` - shared collection of related data
`Database management system (DBMS)` - prgrams that support CRUD operations and control access to database
* CRUD - acronym - Create, Read, Update, Delete
* SQL - Structured QUery Language - language used to talk to DBMS or (relational database management systems) RDBMS
Keys link tables together
primary key - the column that uniquely identify each row in a table
foreign key - a column in a table that references the foreign key in another table
### SQL intro
- SQL or sequel - data query language
* e.g. sql can be used with Google big query
- SQL is formaily defined but implemented differently , there are imcompatibilities with other RDBMS .
- popular extensions - MySQL, PostgreSQL
SQL can be run in programming langages like R and Python
SQL language and syntax
* SQL is case insensitive
* UPPERCASE is FOR KEYWORDS
* End your statment with';'
Working with the book database:
open DBbrowser
open database - load book_store.db database
---
#### First queries:
```sql
SELECT * FROM books;
```
```sql
SELECT title, author
FROM books;
```
commenting in sql is a `--` e.g. `--this is a comment`
```sql=
SELECT *
FROM Transactions;
```
```sql=
--how many books are sold (not how many total copies of books)?
SELECT DISTINCT ISBN
FROM transactions;
```
Using DISTINCT statement
```sql=
SELECT DISTINCT purchase_number, ISBN
FROM transactions, customers;
```
using ORDER BY statement
```sql=
SELECT title, autor
FROM books
ORDER BY author DESC;
```
```sql=
SELECT first, last, transaction_id FROM customers, transactions
ORDER BY transaction_id ASC;
```
find out who purchased the most books?
```sql=
SELECT first, last, purchase_number
FROM customers
ORDER BY purchase_number DESC;
```
## Day 4 SQL - part 2#####
### Review:
Primary Key
Foreign Key
Database
Relational Database Management system
SQL
In Class slide review questions:
```sql=
--question 1:
SELECT first, last FROM customers;
--questions 2:
SELECT * FROM customers;
--question 3:
SELECT DISTINCT customer_id FROM transactions;
--question 4:
SELECT * FROM books
ORDER BY title DESC;
```
#### Boolean Algebra in SQL
`AND` and `OR` and `NOT`
boolean example:
(1 < -1) OR (0 == 0) -> True
(1 < -1) AND (-1 < 2) -> False
`WHERE` clause
example
who purcahsed more than 1 book?
```sql=
SELECT first, last, purchase_number FROM Customers
WHERE purchase_number > 1;
```
Whose first name started with letter A or last name with letter L?
```sql=
SELECT first, last FROM customers
WHERE (first LIKE 'A%') OR (last LIKE 'L%')
```
What books did customers whose id is 11111 and 11112 purchase?
```sql=
SELECT ISBN FROM transactions
WHERE (customer_id == 11111) OR (customer_id == 11112);
-- using parenthasis helps readability
```
```sql=
--using IN statement, result is similar to using OR
SELECT ISBN FROM transactions
WHERE customer_id IN (11111,11112);
```
Question: I also want to know who these customers and what books are. Using WHERE, how can I find out about these information?
```sql=
--manually joining 2 tables
SELECT id, first, last FROM Customers
WHERE id IN (11111,11112);
SELECT ISBN13, title FROM books
WHERE ISBN13 IN (9780872205468, 9780140449044, 9780141026282);
```
Question: I want to check who purchased between 1 and 3 books. WHat is wrong with the following code?
```sql=
SELECT first, last, purchase_number FROM customers
WHERE (purcahse_number > 1) OR (purchase_number < 3);
--change OR to AND
SELECT first, last, purchase_number FROM customers
WHERE (purchase_number > 1) AND (purchase_number < 3);
```
#### caluculations
'||' renaming using AS clause
```sql=
--- using AS to rename columns and combine full name columns
SELECT first ||''|| last AS full_name
FROM customers;
```
#### Aggregate function
avg
count
min
max
What is the average rating of the books?
```sql=
--using the builtin average avg() calculation function
SELECT avg(avg_rating) FROM books;
--get book count
SELECT count(avg_rating) FROM books;
--get min avg rating with title
SELECT title, min(avg_rating) FROM books;
```
#### group by clause
Usually used with aggreate functions
Example: what is teh number of purchased by each person in transaction table?
```sql=
--group results by customer ID
SELECT customer_id, count(*) FROM transactions
GROUP BY customer_id;
```
#### Joins
combine data from multiple tables
How do we get infomration about each customers' fav_book?
Not just the ISBN, but the title, author....
what does this statement do?
```sql=
-- Joining customers and books data
SELECT Customers.last ||" "|| Customers.first, Customers.fav_book, Books.title, Books.author
FROM Customers
JOIN Books
ON Customers.fav_book == books.ISBN13;
```
Tell SQL the specific columns we are joining, us ON clause:
Use where filter out all the transactins made to customers whos IDs are 11111, and 11112. Use JOIN find out who they are and USE JOIN to find out what books they purchased.
```sql=
--combining everything you learned
SELECT Customers.last, Customers.first, Customers.id, Books.title
FROM Books
JOIN Transactions JOIN Customers
ON Books.ISBN13 == Transactions.ISBN
AND Transactions.customer_id == Customers.id
WHERE Customers.id IN (11111, 11112);
```
Using SQL in R:
load
library RSQLite
Full SQL lesson:
https://datacarpentry.org/sql-socialsci/
Google bigquery
# ######Unix Shell notes####
```
ls
```
listing the content of the curent directory
```
ls -l
```
Listing out, with more details using the -l flag
```
pwd
```
print current directory - check where you are at
```
man ls
```
Read the documentary of command `ls`
```
cd ..
```
moving up from the current folder
```
cat
```
concatenate - "print" content of a file
```
mv
```
move the files, notice that you also rename files using mv.
```
rm
```
DANGEROUS! You cannot recover once you removed a file
```
sort -n length.txt
```
sort the content of the file
```
sort -n length.txt > sorted-lengths.txt
```
sort the content, pipe (`>`) it to the file `sorted-lengths.txt`
```
head
tail
```
print the head or tail of the file
```
wc
```
word count
```
mkdir result
```
make a folder called `result`
```
grep
```
**G**lobal **R**egular **E**xpress **P**rint
[Regular expression hand-on tutorial] (https://regexone.com/)
```
cat [filename]
```
```
grep 1999 *.tsv
```
```
grep -c 1999 *.tsv
grep -c revolution *.tsv
```
Search the number of instances of "1999"
Search the number of instances of "revolution"
```
grep -ci revolution *.tsv
```
the `-i` flag means ignore case.
```
grep -i revolution *.tsv > results/2021-03-04_JAi-revolution.tsv
```
pipe the result to the file `2021-03-04_JAi-revolution.tsv`
```
grep -iw revolution *.tsv > results/2021-03-04_JAiw-revolution.tsv
```
`-w` search for whole word, no partial match
------
### What is an API? (Application Programming Interface)
API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other. Each time you use an app like Facebook, send an instant message, or check the weather on your phone, you’re using an API.
**HTTP** - HyperText Transfer Protocol - set of rules taht the client and server agree upon on what information to send and what to be expected
**API request status code 200**
200 means "ok"
**HTTP response Status Codes**
always look for 200 which means everyone is "ok"
**bad errors**
e.g. 404 Not found code - server cannot find information
**server error **
500 codes are server errors
**APIs use:** these services have APIs available to use:
Twitter
spotify
worldbank
**HTTP: query String**
part of a URL that assigns values to specific parameters
https://www.google.com/search?newwindow=1&q=is+santa+true
* Request Methods (Get) - most
* important concept to remember
* response status codes
* Query string
## Demo
**OpenFEC** API
to use the API, you will need to request a key - it i slike an acocunt