Codebook = [TOC] # A Random dataframe ## Python Below code creates a random pandas dataframe with columns with all the data types. specify n=# for required sample size. This first creates a **dictionary** ddict which is supplied as 'data' to dataframe. This uses **list comprehensions** for creating ID and names variables. ```{python} import pandas as pd import random as rm import numpy as np import string as st n = 6 ddict = {'ID': [''.join(rm.choices(st.ascii_letters + st.digits, k = 5)) for i in range(n)], 'name' : [''.join(rm.choices(st.ascii_lowercase, k = 9)) for i in range(n)], 'order':range(n), 'items': rm.choices(range(n), k = n), 'age': rm.sample(range(20,40),n), 'x': np.random.normal(size=n), 'date': pd.date_range(start='2022-01-31', periods=n, freq='M') } df = pd.DataFrame(ddict) df ``` **Output** | | ID | name | order | items | age | x | date | |-|-|-|-|-|-|-|-| | 0 | yFUiR | ryrafuhjx | 0 | 2 | 27 | -3.29647 | 1/31/2022 | | 1 | 9nqMw | bpnlwpjhw | 1 | 0 | 33 | -0.27372 | 2/28/2022 | | 2 | QX1Mf | rxxmtpkhl | 2 | 4 | 22 | -0.51878 | 3/31/2022 | | 3 | izsaB | ygezcbhga | 3 | 1 | 23 | -0.40394 | 4/30/2022 | | 4 | 19NeT | xhydvnkko | 4 | 4 | 24 | 0.223726 | 5/31/2022 | | 5 | JPrCT | pgeqjgdxz | 5 | 0 | 31 | -0.42457 | 6/30/2022 | ## R ```{R} set.seed(42) n <- 6 df <- data.frame(id=1:n, date=seq.Date(as.Date("2020-12-1"), by="month", length.out = n), group=rep(LETTERS[1:2], n/2), name=sample(LETTERS, n, replace=TRUE), age=sample(18:30, n, replace=TRUE), type=factor(paste("type", 1:n)), x=rnorm(n)) df ``` Output | | id | date | group | name | age | type | x | |-|-|-|-|-|-|-|-| | 1 | 1 | 12/1/2020 | A | Q | 19 | type 1 | -0.09466 | | 2 | 2 | 1/1/2021 | B | E | 27 | type 2 | 2.018424 | | 3 | 3 | 2/1/2021 | A | A | 18 | type 3 | -0.06271 | | 4 | 4 | 3/1/2021 | B | Y | 25 | type 4 | 1.30487 | | 5 | 5 | 4/1/2021 | A | J | 24 | type 5 | 2.286645 | | 6 | 6 | 5/1/2021 | B | D | 21 | type 6 | -1.38886 | # Library / packages ## python insall a package: ```{python} !pip install <matplotlib> ``` import a package ```{python} import pandas as pd import matplotlib.pyplot as plt from random import sample ``` ## R install a library: ```{R} install.packages('tidyverse') ``` Import a library ```{R} library(tidyverse) ``` # Directories / paths ## python ### Import / read from file To read data from a csv file in current working directory ```{python} df = pd.read_csv('lending_club_loan_two.csv') ``` To read data from a local csv file ```{python} df = pd.read_csv(r'C:\Users\Lenovo\Documents\lending_club_loan_two.csv') ``` > Python raw string is created by prefixing a string literal with ‘r’ or ‘R’. Python raw string treats backslash `\` as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character. To read data from a local excel file ```{python} df = pd.read_excel(r'C:\Users\Lenovo\Documents\lending_club_loan_two.xlsx') ``` Read csv file in an url ```{python} pd.read_csv(r"https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv") ``` Read a sas dataset ``` pd.read_sas(r'C:\Users\Lenovo\Documents\KGB.sas7bdat') ``` When you need to read data from multiple files and have to append all of them horizontally. ```{python} import glob import os #File path joined with file names using wild charecter file_path = os.path.join(r"C:\Users\Lenovo\Downloads\2021", "*NIFTY.txt") #Lists all the file paths file_list = glob.glob(file_path) #Since we do not have header in the files, we need to use function in order to be able to create headers and use in map function def f(i): return pd.read_csv(i, names = ['Index','Date','Time','Open','High','Low','Close','c1','c2']) #appends all the data horizontally dfj = pd.concat(map(f, file_list)) ``` ### Export / Write to file ``` df.to_csv(r'C:\Users\Lenovo\Documents\KGB_data.csv', index=False) ``` index=False is required to remove the index column in output file. ## R set current working directory ``` setwd("C:/Users/docuements/") ``` ### Import / Read from file ``` df <- read.csv(r"(C:\Users\Documents\text.csv)") ``` ### Export / Write to file ``` write.csv(df, "/kaggle/working/file.csv") ``` # View the data ## python ### View rows ``` # first few and last few rows df # To show first 5 rows of a data frame df.head() # To show last 5 rows of a data frame df.tail() # To show first 10 rows of a data frame df.head(10) ``` ### View columns ``` #To see the information of each column header and respective data type df.info() #To see the columns names only df.columns #To see all the column names list(df.columns) #OR list(df.keys()) #OR df2.columns.values ``` ### data types ``` #see data types of all columns df.dtypes #to see the data type of selected columns df['col'].dtypes #to see unique values of a column also shows the data type of selected column at the end df.col.unique() #OR df['col'].unique() ``` ### Summary stats ``` #To see number of rows and columns df.shape # To show summary statistics (min,max,mean,percentiles, etc) of all columns in a data frame df.describe() # To show summary statistics of a specific column of a data frame df['col'].describe() #To see count / frequency of each values in a column df['col'].value_counts() # OR df.col.value_counts() #to see unique values of a column df.col.unique() #OR df['col'].unique() #To see unique values of all columns of a dataframe print(df.apply(lambda col: col.unique())) #To see the count of NA / null values in a column #df.col.isna().sum() ``` ## R ``` #view whole table in viewer View(df) ``` ### View rows ``` #View first 6 rows head(df) #View first 10 rows head(df, n=10) #View last 6 rows tail(df) ``` ### View columns ``` #View column names names(dataset) colnames(dataset) #See values of a column dataset$column ``` ### data types ``` #list the structure of mydata #which inclues listing of each columns #their data types #their first few values str(mydata) #data type of an object str(x) # class of an object (numeric, matrix, data frame, etc) class(object) ``` ### Summary stats ``` # dimensions of an object dim(object) #see the unique values of a column unique(df$col) #See the table of unique values of an object with frequency of each value table(df$col) ``` # Strings ## python Write string in multiple lines with use of 3 quotes and still read whole string as one: ``` “”” This is string, In multiple lines, Considered as one! “”” ``` # Loops ## python ### List comprehension List comprehension offers a **shorter** syntax when you want to create a new list based on the values of an existing list (using loops and conditional statements). Below is **without** list comprehension: ``` fruits = ["apple", "banana", "cherry", "kiwi", "mango"\] newlist = [] for x in fruits: if "a" in x: newlist.append(x) print(newlist) ``` Output: ['apple', 'banana', 'mango'] Below is **with** list comprehension: ``` fruits = ["apple", "banana", "cherry", "kiwi", "mango"] newlist = [x  for x in fruits if "a" in x] print(newlist) ``` Output: ['apple', 'banana', 'mango'] List comprehension with conditional statement: ``` #Syntax: newlist = [_expression_ for _item_ in _iterable_ if _condition_ == True] #Example: newlist = [x for x in fruits if x != "apple"] ``` Output: ['banana', 'cherry', 'kiwi', 'mango'] # Plotting ## python Example: plotting a distribution of a variable ``` import numpy as np import matplotlib.pyplot as plt import seaborn as sns x = np.random.normal(size=100000) sns.distplot(x, hist=False) plt.show() ``` ![](https://i.imgur.com/a1YNmgB.png) # Data filtering ## python ## R # Data treatments ## Python ### Data type conversions ### Null Values ``` #Drop null value records from data df.dropna() ``` ## R ### Data type conversions use `as.datatype` to convert to required datatype ``` #convert charecter to date date1 <- as.Date("2010-12-31") #convert vector to dataframe df <- as.data.frame(vec1) ``` ### Null values # Oracle DB and SQL ## Python ```{python} import cx_Oracle import pandas as pd connection = cx_Oracle.connect('Username/password@Hostname:port/ServiceName') cursor = connection.cursor() query = """ select * from tablename where <conditions> """ df = pd.read_sql_query(query, connection) ``` # Estimate Execution time ## Python ``` import time # get the start time st = time.time() # main program # find sum to first 1 million numbers sum_x = 0 for i in range(1000000): sum_x += i # wait for 3 seconds time.sleep(3) print('Sum of first 1 million numbers is:', sum_x) # get the end time et = time.time() # get the execution time elapsed_time = et - st print('Execution time:', elapsed_time, 'seconds') ``` ## R # Python ⇔ R ``` df.head() ⇔ head(df) df.head(3) ⇔ head(df,3) df.tail(3) ⇔ tail(df,3) df.shape[0] ⇔ nrow(df) df.shape[1] ⇔ ncol(df) df.shape ⇔ dim(df) df.info() ⇔ NO EQUIVALENT df.describe() ⇔ summary(df) # similar NO EQUIVALENT ⇔ str(df) ```ckedit.io/).