[Exercise] LAB1

# [Exercise] LAB1 :::info :pencil: Name: Mai Nguyen Viet Phu :pencil: Id: 10421047 :pencil: Introduction to Data Science and AI ::: ## :beginner: Lab 1 Requirement :bulb:Integrated Development Environment (IDE): PyCharm. :bulb:Version Control: Git and TortoiseGit. :bulb:Compiler & Interpreter: Python 3 (WinPython on Windows or Ana- conda on Linux). :bulb: Additional Libraries: Pandas, NumPy, SciPy, Matplotlib, Sklearn, (and PyTorch). :bulb: Data Sets: Iris and MNIST :mag_right:1.1) Download and install all the items in the requirements. :mag_right:1.2) Check the installation is correct. :mag_right:1.3) Load data set IRIS by using NumPy. :mag_right:1.4) Print the IRIS data set to console. :mag_right:1.5) PMake 5% values in IRIS to nan. :mag_right:1.6) Preproces missing data (i.e. nan) by using all the methods in the lecture.. :mag_right:1.7) For each preprocessing method, use a classification model (e.g., naive Bayes) and evaluate the accuracy. :mag_right:1.8) Repeat step 1.6 and 1.7 with 10% nan values.. :mag_right:1.9) Repeat step 1.6 and 1.7 with 15% nan values. :mag_right:1.10) Repeat step 1.6 and 1.7 with 20% nan values. :mag_right:1.11) Use min-max-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy. :mag_right:1.12) Use z-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy. ## :triangular_flag_on_post: Explanation :dart: Import Library: import numpy as np from sklearn.datasets import load_iris from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score :dart: 1.3. Load data set IRIS by using NumPy. + :dart: 1.4. Print the IRIS data set to console. 1.3 iris_data = load_iris() load data from file iris.data 1.4 print(iris_data) print the iris_data to console. :dart: 1.5. Make 5% values in IRIS to nan. calculate number of NaN values to create base on % nan_percent = 5 n_samples, n_features = iris_data.data.shape n_nan = int(n_samples * n_features * nan_percent / 100) give percentage equal to 5, the using data.shape to get the number of samples and features from iris_data then we have formula to calculate the number of NaN create random indices for NaN values nan_index = np.random.choice(n_samples * n_features, n_nan, replace=False) using random to get random variables from the iris_data set selected values to NaN 1.5 iris_data.data.ravel()[nan_index] = np.nan using ravel to change the missing value in iris_data to NaN count number of NaN values and print nan_count = np.isnan(iris_data.data).sum() print(f"Number of NaN values created: {nan_count} ({nan_percent}% of the dataset)") using isnan to cout the number of NaN and then print :dart: 1.6. Preproces missing data (i.e. nan) by using all the methods in the lecture. :books: there are five method: :notebook: method1: ignore rows containing missing data ignore_rows_data = iris_data.data[~np.isnan(iris_data.data).any(axis=1)] creates a boolean array the size of the number of lines of iris_data.data,with the value True on lines containing at least one value of NaN and False on lines that do not contain NaN . values ~ is used to get lines that do not contain Nan ignore_rows_target = iris_data.target[~np.isnan(iris_data.data).any(axis=1)] create a new array from iris_data.target by keeping only the corresponding entries in the target array with the corresponding lines that do not contain NaN in iris_data.data. :notebook: method2: fill in manually method 2: fill in manually manual_data = iris_data.data.copy() manual_target = iris_data.target.copy() make a copy from irs_data and make a copy of the original target values of the iris. for i in range(n_features): mean_val = np.mean(manual_data[:, i][~np.isnan(manual_data[:, i])]) manual_data[:, i][np.isnan(manual_data[:, i])] = mean_val setup a loop that will iterate through each object (column) in the data.calculate the mean of the feature, ignoring the NaN.Replace NaN with the calculated mean :notebook: method3: use a global constant constant_data = iris_data.data.copy() constant_target = iris_data.target.copy() creates a copy of iris_data which will be used to replace the missing values with a common constant. make a copy of the target values for iris_data set constant_value = 0 set constant value equal to 0 constant_data[np.isnan(constant_data)] = constant_value replace missing values in dataset with constant_value(0) replace all missing values in data set with common constant value (0) use to find all positions in data set whose value is NaN then those positions are updated with the common constant value. :notebook: method4: fill with measure of central tendency central_data = iris_data.data.copy() central_target = iris_data.target.copy() for i in range(n_features): central_data[:, i][np.isnan(central_data[:, i])] = np.mean(central_data[:, i][~np.isnan(central_data[:, i])]) sselect the value is NaN value of the media value of it is not including NaN assign all NaN values equal to its mean :notebook: method5: fill with most probable value probable_data = iris_data.data.copy() probable_target = iris_data.target.copy() imp = SimpleImputer(strategy="most_frequent") probable_data = imp.fit_transform(probable_data) initialize simpleImputer with strategy as most frequent <=> missing values will be replaced with most frequent values imp.fit_transform: command to call on the simpleImputer object to perform the substitution. Output the array of results assigned to probable_data :dart: 1.7. For each preprocessing method, use a classification model (e.g., naive Bayes) and evaluate the accuracy.. X_train, X_test, y_train, y_test = train_test_split(ignore_rows_data, ignore_rows_target, test_size=0.2, random_state=42) itrain_test_split is used to split imputed data X-scaled input data Iris_data.target fixes class labels test_size 0.2 spends 20% of the data for testing random_state = 42 random seeds for reproducibility. (changing the seed will change the data for testing and training) gnb = GaussianNB() gnb.fit(X_train, y_train) train the classifier on the data y_pred = gnb.predict(X_test) make predictions on test data using a classifier that has been trained and assigned to y_pred acc = accuracy_score(y_test, y_pred) calculate the accuracy of the classifier using the accuracy_score command and assign it to accuraccy print(f"Accuracy of Gaussian Naive Bayes classifier with ignore rows method: {acc}") :dart: 1.8. Repeat step 1.6 and 1.7 with 10% nan values. +1.9. Repeat step 1.6 and 1.7 with 15% nan values.+ 1.10. Repeat step 1.6 and 1.7 with 20% nan values. just replace the percent of this line nan_percent = 10 for 1.8 nan_percent = 15 for 1.9 nan_percent = 15 for 1.20 after many times run the code, i see that the larger the percentage, the more difficult the prediction is :dart: 1.11. Use min-max-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy. scaler_min_max = MinMaxScaler() create a MinMaxScaler object that normalizes iris_data perform min-max normalization on the data iris_data_normalized = scaler_min_max.fit_transform(iris_data.data) fit_transform creates a new data containing normalized data by calculate the maximum and minimum value of each data apply min-max normalization formula to scale to range [0,1] fit_transfom calculates the min and max values of each feature in the dataset and returns a normalized dataset where the minimum value of each feature is mapped to 0 and the maximum value of each feature is mapped to 1. :dart: 1.12. Use Z-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy. scaler_Z = StandardScaler() create a StandardScarler object to normalize the data iris_data_normalized = scaler_Z.fit_transform(iris_data.data) fit_transform creates a new data by calculating the mean and deviation of each feature in the data then transform the data by centering around zero and scale it to get the variance <div align="center"> ## :computer: Code [![View on GitHub](https://img.shields.io/badge/View%20on-GitHub-brightgreen)](https://github.com/HalloPeanut/PeanutLab1.github.io) </div> ## :file_folder: Contact <center> ![Profile Picture](https://scontent.fsgn2-8.fna.fbcdn.net/v/t39.30808-6/324094541_6363090273704637_6864410919998281766_n.jpg?_nc_cat=102&ccb=1-7&_nc_sid=174925&_nc_ohc=YGw0g-n5ql4AX8HBd8b&_nc_ht=scontent.fsgn2-8.fna&oh=00_AfAuCdyrCxENh_veTI39mwgJRA8zThc-1aweFaRzONpDVA&oe=643E5BCB) :phone: 0905657088 :email: 10421047@student.vgu.edu.vn :card_index: [Peanut](https://www.facebook.com/profile.php?id=100014377148812) </center>