Codebook
=
[TOC]
# A Random dataframe
## Python
Below code creates a random pandas dataframe with columns with all the data types.
specify n=# for required sample size.
This first creates a **dictionary** ddict which is supplied as 'data' to dataframe.
This uses **list comprehensions** for creating ID and names variables.
```{python}
import pandas as pd
import random as rm
import numpy as np
import string as st
n = 6
ddict = {'ID': [''.join(rm.choices(st.ascii_letters + st.digits, k = 5)) for i in range(n)],
'name' : [''.join(rm.choices(st.ascii_lowercase, k = 9)) for i in range(n)],
'order':range(n),
'items': rm.choices(range(n), k = n),
'age': rm.sample(range(20,40),n),
'x': np.random.normal(size=n),
'date': pd.date_range(start='2022-01-31', periods=n, freq='M') }
df = pd.DataFrame(ddict)
df
```
**Output**
| | ID | name | order | items | age | x | date |
|-|-|-|-|-|-|-|-|
| 0 | yFUiR | ryrafuhjx | 0 | 2 | 27 | -3.29647 | 1/31/2022 |
| 1 | 9nqMw | bpnlwpjhw | 1 | 0 | 33 | -0.27372 | 2/28/2022 |
| 2 | QX1Mf | rxxmtpkhl | 2 | 4 | 22 | -0.51878 | 3/31/2022 |
| 3 | izsaB | ygezcbhga | 3 | 1 | 23 | -0.40394 | 4/30/2022 |
| 4 | 19NeT | xhydvnkko | 4 | 4 | 24 | 0.223726 | 5/31/2022 |
| 5 | JPrCT | pgeqjgdxz | 5 | 0 | 31 | -0.42457 | 6/30/2022 |
## R
```{R}
set.seed(42)
n <- 6
df <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-1"), by="month", length.out = n),
group=rep(LETTERS[1:2], n/2),
name=sample(LETTERS, n, replace=TRUE),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=rnorm(n))
df
```
Output
| | id | date | group | name | age | type | x |
|-|-|-|-|-|-|-|-|
| 1 | 1 | 12/1/2020 | A | Q | 19 | type 1 | -0.09466 |
| 2 | 2 | 1/1/2021 | B | E | 27 | type 2 | 2.018424 |
| 3 | 3 | 2/1/2021 | A | A | 18 | type 3 | -0.06271 |
| 4 | 4 | 3/1/2021 | B | Y | 25 | type 4 | 1.30487 |
| 5 | 5 | 4/1/2021 | A | J | 24 | type 5 | 2.286645 |
| 6 | 6 | 5/1/2021 | B | D | 21 | type 6 | -1.38886 |
# Library / packages
## python
insall a package:
```{python}
!pip install <matplotlib>
```
import a package
```{python}
import pandas as pd
import matplotlib.pyplot as plt
from random import sample
```
## R
install a library:
```{R}
install.packages('tidyverse')
```
Import a library
```{R}
library(tidyverse)
```
# Directories / paths
## python
### Import / read from file
To read data from a csv file in current working directory
```{python}
df = pd.read_csv('lending_club_loan_two.csv')
```
To read data from a local csv file
```{python}
df = pd.read_csv(r'C:\Users\Lenovo\Documents\lending_club_loan_two.csv')
```
> Python raw string is created by prefixing a string literal with ‘r’ or ‘R’. Python raw string treats backslash `\` as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.
To read data from a local excel file
```{python}
df = pd.read_excel(r'C:\Users\Lenovo\Documents\lending_club_loan_two.xlsx')
```
Read csv file in an url
```{python}
pd.read_csv(r"https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv")
```
Read a sas dataset
```
pd.read_sas(r'C:\Users\Lenovo\Documents\KGB.sas7bdat')
```
When you need to read data from multiple files and have to append all of them horizontally.
```{python}
import glob import os
#File path joined with file names using wild charecter
file_path = os.path.join(r"C:\Users\Lenovo\Downloads\2021", "*NIFTY.txt")
#Lists all the file paths
file_list = glob.glob(file_path)
#Since we do not have header in the files, we need to use function in order to be able to create headers and use in map function
def f(i):
return pd.read_csv(i, names = ['Index','Date','Time','Open','High','Low','Close','c1','c2'])
#appends all the data horizontally
dfj = pd.concat(map(f, file_list))
```
### Export / Write to file
```
df.to_csv(r'C:\Users\Lenovo\Documents\KGB_data.csv', index=False)
```
index=False is required to remove the index column in output file.
## R
set current working directory
```
setwd("C:/Users/docuements/")
```
### Import / Read from file
```
df <- read.csv(r"(C:\Users\Documents\text.csv)")
```
### Export / Write to file
```
write.csv(df, "/kaggle/working/file.csv")
```
# View the data
## python
### View rows
```
# first few and last few rows
df
# To show first 5 rows of a data frame
df.head()
# To show last 5 rows of a data frame
df.tail()
# To show first 10 rows of a data frame
df.head(10)
```
### View columns
```
#To see the information of each column header and respective data type
df.info()
#To see the columns names only
df.columns
#To see all the column names
list(df.columns)
#OR
list(df.keys())
#OR
df2.columns.values
```
### data types
```
#see data types of all columns
df.dtypes
#to see the data type of selected columns
df['col'].dtypes
#to see unique values of a column also shows the data type of selected column at the end
df.col.unique()
#OR
df['col'].unique()
```
### Summary stats
```
#To see number of rows and columns
df.shape
# To show summary statistics (min,max,mean,percentiles, etc) of all columns in a data frame
df.describe()
# To show summary statistics of a specific column of a data frame
df['col'].describe()
#To see count / frequency of each values in a column
df['col'].value_counts()
# OR
df.col.value_counts()
#to see unique values of a column
df.col.unique()
#OR
df['col'].unique()
#To see unique values of all columns of a dataframe
print(df.apply(lambda col: col.unique()))
#To see the count of NA / null values in a column
#df.col.isna().sum()
```
## R
```
#view whole table in viewer
View(df)
```
### View rows
```
#View first 6 rows
head(df)
#View first 10 rows
head(df, n=10)
#View last 6 rows
tail(df)
```
### View columns
```
#View column names
names(dataset)
colnames(dataset)
#See values of a column
dataset$column
```
### data types
```
#list the structure of mydata
#which inclues listing of each columns
#their data types
#their first few values
str(mydata)
#data type of an object
str(x)
# class of an object (numeric, matrix, data frame, etc)
class(object)
```
### Summary stats
```
# dimensions of an object
dim(object)
#see the unique values of a column
unique(df$col)
#See the table of unique values of an object with frequency of each value
table(df$col)
```
# Strings
## python
Write string in multiple lines with use of 3 quotes and still read whole string as one:
```
“”” This is string,
In multiple lines,
Considered as one!
“””
```
# Loops
## python
### List comprehension
List comprehension offers a **shorter** syntax when you want to create a new list based on the values of an existing list (using loops and conditional statements).
Below is **without** list comprehension:
```
fruits = ["apple", "banana", "cherry", "kiwi", "mango"\]
newlist = []
for x in fruits:
if "a" in x:
newlist.append(x)
print(newlist)
```
Output: ['apple', 'banana', 'mango']
Below is **with** list comprehension:
```
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
newlist = [x for x in fruits if "a" in x]
print(newlist)
```
Output: ['apple', 'banana', 'mango']
List comprehension with conditional statement:
```
#Syntax:
newlist = [_expression_ for _item_ in _iterable_ if _condition_ == True]
#Example:
newlist = [x for x in fruits if x != "apple"]
```
Output: ['banana', 'cherry', 'kiwi', 'mango']
# Plotting
## python
Example: plotting a distribution of a variable
```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x = np.random.normal(size=100000)
sns.distplot(x, hist=False)
plt.show()
```

# Data filtering
## python
## R
# Data treatments
## Python
### Data type conversions
### Null Values
```
#Drop null value records from data
df.dropna()
```
## R
### Data type conversions
use `as.datatype` to convert to required datatype
```
#convert charecter to date
date1 <- as.Date("2010-12-31")
#convert vector to dataframe
df <- as.data.frame(vec1)
```
### Null values
# Oracle DB and SQL
## Python
```{python}
import cx_Oracle
import pandas as pd
connection = cx_Oracle.connect('Username/password@Hostname:port/ServiceName')
cursor = connection.cursor()
query = """
select *
from tablename
where <conditions>
"""
df = pd.read_sql_query(query, connection)
```
# Estimate Execution time
## Python
```
import time
# get the start time
st = time.time()
# main program
# find sum to first 1 million numbers
sum_x = 0
for i in range(1000000):
sum_x += i
# wait for 3 seconds
time.sleep(3)
print('Sum of first 1 million numbers is:', sum_x)
# get the end time
et = time.time()
# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')
```
## R
# Python ⇔ R
```
df.head() ⇔ head(df)
df.head(3) ⇔ head(df,3)
df.tail(3) ⇔ tail(df,3)
df.shape[0] ⇔ nrow(df)
df.shape[1] ⇔ ncol(df)
df.shape ⇔ dim(df)
df.info() ⇔ NO EQUIVALENT
df.describe() ⇔ summary(df) # similar
NO EQUIVALENT ⇔ str(df)
```ckedit.io/).