Codebook

A Random dataframe

Python

Below code creates a random pandas dataframe with columns with all the data types.

specify n=# for required sample size.

This first creates a dictionary ddict which is supplied as 'data' to dataframe.

This uses list comprehensions for creating ID and names variables.

import pandas as pd
import random as rm
import numpy as np
import string as st

n = 6

ddict = {'ID': [''.join(rm.choices(st.ascii_letters + st.digits, k = 5)) for i in range(n)],
         'name' : [''.join(rm.choices(st.ascii_lowercase, k = 9)) for i in range(n)],
         'order':range(n), 
         'items': rm.choices(range(n),  k = n),
         'age': rm.sample(range(20,40),n), 
         'x': np.random.normal(size=n), 
         'date': pd.date_range(start='2022-01-31', periods=n, freq='M') }
        
df = pd.DataFrame(ddict)
df

Output

ID name order items age x date
0 yFUiR ryrafuhjx 0 2 27 -3.29647 1/31/2022
1 9nqMw bpnlwpjhw 1 0 33 -0.27372 2/28/2022
2 QX1Mf rxxmtpkhl 2 4 22 -0.51878 3/31/2022
3 izsaB ygezcbhga 3 1 23 -0.40394 4/30/2022
4 19NeT xhydvnkko 4 4 24 0.223726 5/31/2022
5 JPrCT pgeqjgdxz 5 0 31 -0.42457 6/30/2022

R

set.seed(42)
n <- 6
df <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-1"), by="month", length.out = n),
                  group=rep(LETTERS[1:2], n/2),
                  name=sample(LETTERS, n, replace=TRUE),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
                  
df

Output

id date group name age type x
1 1 12/1/2020 A Q 19 type 1 -0.09466
2 2 1/1/2021 B E 27 type 2 2.018424
3 3 2/1/2021 A A 18 type 3 -0.06271
4 4 3/1/2021 B Y 25 type 4 1.30487
5 5 4/1/2021 A J 24 type 5 2.286645
6 6 5/1/2021 B D 21 type 6 -1.38886

Library / packages

python

insall a package:

!pip install <matplotlib>

import a package

import pandas as pd
import matplotlib.pyplot as plt
from random import sample

R

install a library:

install.packages('tidyverse')

Import a library

library(tidyverse)

Directories / paths

python

Import / read from file

To read data from a csv file in current working directory

df = pd.read_csv('lending_club_loan_two.csv')

To read data from a local csv file

df = pd.read_csv(r'C:\Users\Lenovo\Documents\lending_club_loan_two.csv')

Python raw string is created by prefixing a string literal with β€˜r’ or β€˜R’. Python raw string treats backslash \ as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.

To read data from a local excel file

df = pd.read_excel(r'C:\Users\Lenovo\Documents\lending_club_loan_two.xlsx')

Read csv file in an url

pd.read_csv(r"https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv")

Read a sas dataset

pd.read_sas(r'C:\Users\Lenovo\Documents\KGB.sas7bdat')

When you need to read data from multiple files and have to append all of them horizontally.

import glob import os

#File path joined with file names using wild charecter
file_path = os.path.join(r"C:\Users\Lenovo\Downloads\2021", "*NIFTY.txt")

#Lists all the file paths
file_list = glob.glob(file_path)

#Since we do not have header in the files, we need to use function in order to be able to create headers and use in map function
def f(i):
   return pd.read_csv(i, names = ['Index','Date','Time','Open','High','Low','Close','c1','c2'])

#appends all the data horizontally
dfj = pd.concat(map(f, file_list))

Export / Write to file

df.to_csv(r'C:\Users\Lenovo\Documents\KGB_data.csv', index=False)

index=False is required to remove the index column in output file.

R

set current working directory

setwd("C:/Users/docuements/")

Import / Read from file

df <- read.csv(r"(C:\Users\Documents\text.csv)")

Export / Write to file

write.csv(df, "/kaggle/working/file.csv")

View the data

python

View rows

# first few and last few rows
df

# To show first 5 rows of a data frame
df.head()

# To show last 5 rows of a data frame
df.tail()

# To show first 10 rows of a data frame
df.head(10)

View columns

#To see the information of each column header and respective data type
df.info()

#To see the columns names only
df.columns

#To see all the column names
list(df.columns) 
#OR
list(df.keys())
#OR
df2.columns.values

data types

#see data types of all columns
df.dtypes

#to see the data type of selected columns
df['col'].dtypes

#to see unique values of a column also shows the data type of selected column at the end
df.col.unique()
#OR
df['col'].unique()

Summary stats

#To see number of rows and columns
df.shape

# To show summary statistics (min,max,mean,percentiles, etc) of all columns in a data frame
df.describe()

# To show summary statistics of a specific column of a data frame
df['col'].describe()

#To see count / frequency of each values in a column
df['col'].value_counts()
# OR
df.col.value_counts()

#to see unique values of a column
df.col.unique()
#OR
df['col'].unique()

#To see unique values of all columns of a dataframe
print(df.apply(lambda col: col.unique()))

#To see the count of NA / null values in a column
#df.col.isna().sum()

R

#view whole table in viewer
View(df)

View rows

#View first 6 rows
head(df)

#View first 10 rows
head(df, n=10)

#View last 6 rows
tail(df)

View columns

#View column names
names(dataset)
colnames(dataset)

#See values of a column
dataset$column


data types

#list the structure of mydata 
#which inclues listing of each columns
#their data types 
#their first few values
str(mydata)

#data type of an object
str(x)

# class of an object (numeric, matrix, data frame, etc)
class(object)

Summary stats

# dimensions of an object
dim(object)

#see the unique values of a column
unique(df$col)

#See the table of unique values of an object with frequency of each value
table(df$col)

Strings

python

Write string in multiple lines with use of 3 quotes and still read whole string as one:


β€œβ€β€ This is string, 

In multiple lines, 

Considered as one! 

β€œβ€β€ 

Loops

python

List comprehension

List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list (using loops and conditional statements).

Below is without list comprehension:


fruits = ["apple", "banana", "cherry", "kiwi", "mango"\]  
newlist = []  
  
for x in fruits:  
 if "a" in x:  
 newlist.append(x)  
  
print(newlist) 

Output: ['apple', 'banana', 'mango']

Below is with list comprehension:


fruits = ["apple", "banana", "cherry", "kiwi", "mango"]  
  
newlist = [x   for x in fruits if "a" in x]  
  
print(newlist) 

Output: ['apple', 'banana', 'mango']

List comprehension with conditional statement:


#Syntax: 

newlist = [_expression_ for _item_ in _iterable_ if _condition_ == True] 

#Example: 

newlist = [x for x in fruits if x != "apple"] 

Output: ['banana', 'cherry', 'kiwi', 'mango']

Plotting

python

Example: plotting a distribution of a variable


import numpy as np 

import matplotlib.pyplot as plt 

import seaborn as sns 

x = np.random.normal(size=100000) 

sns.distplot(x, hist=False) 

plt.show() 

Data filtering

python

R

Data treatments

Python

Data type conversions

Null Values

#Drop null value records from data
df.dropna()

R

Data type conversions

use as.datatype to convert to required datatype

#convert charecter to date
date1 <- as.Date("2010-12-31")

#convert vector to dataframe
df <- as.data.frame(vec1)

Null values

Oracle DB and SQL

Python

import cx_Oracle
import pandas as pd

connection = cx_Oracle.connect('Username/password@Hostname:port/ServiceName')
cursor = connection.cursor()

query = """
select * 
from tablename 
where <conditions>
"""

df = pd.read_sql_query(query, connection)

Estimate Execution time

Python

import time

# get the start time
st = time.time()

# main program
# find sum to first 1 million numbers
sum_x = 0
for i in range(1000000):
    sum_x += i

# wait for 3 seconds
time.sleep(3)
print('Sum of first 1 million numbers is:', sum_x)

# get the end time
et = time.time()

# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')

R

Python ⇔ R

df.head()       ⇔  head(df)
df.head(3)      ⇔  head(df,3)
df.tail(3)      ⇔  tail(df,3)
df.shape[0]     ⇔  nrow(df)
df.shape[1]     ⇔  ncol(df)
df.shape        ⇔  dim(df)
df.info()       ⇔  NO EQUIVALENT
df.describe()   ⇔  summary(df)     # similar
NO EQUIVALENT   ⇔  str(df)
```ckedit.io/).