# [Exercise] LAB1 :::info :pencil: Name: Mai Nguyen Viet Phu :pencil: Id: 10421047 :pencil: Introduction to Data Science and AI ::: ## :beginner: Lab 1 Requirement :bulb:Integrated Development Environment (IDE): PyCharm. :bulb:Version Control: Git and TortoiseGit. :bulb:Compiler & Interpreter: Python 3 (WinPython on Windows or Ana- conda on Linux). :bulb: Additional Libraries: Pandas, NumPy, SciPy, Matplotlib, Sklearn, (and PyTorch). :bulb: Data Sets: Iris and MNIST :mag_right:1.1) Download and install all the items in the requirements. :mag_right:1.2) Check the installation is correct. :mag_right:1.3) Load data set IRIS by using NumPy. :mag_right:1.4) Print the IRIS data set to console. :mag_right:1.5) PMake 5% values in IRIS to nan. :mag_right:1.6) Preproces missing data (i.e. nan) by using all the methods in the lecture.. :mag_right:1.7) For each preprocessing method, use a classification model (e.g., naive Bayes) and evaluate the accuracy. :mag_right:1.8) Repeat step 1.6 and 1.7 with 10% nan values.. :mag_right:1.9) Repeat step 1.6 and 1.7 with 15% nan values. :mag_right:1.10) Repeat step 1.6 and 1.7 with 20% nan values. :mag_right:1.11) Use min-max-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy. :mag_right:1.12) Use z-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy. ## :triangular_flag_on_post: Explanation :dart: <span style="color:orange">Import Library: </span> import numpy as np from sklearn.datasets import load_iris from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score :dart: <span style="color:orange">1.3. Load data set IRIS by using NumPy.</span> + :dart: <span style="color:orange">1.4. Print the IRIS data set to console.</span> 1.3 iris_data = load_iris() <span style="color:cyan">load data from file iris.data</span> 1.4 print(iris_data) <span style="color:cyan">print the iris_data to console.</span> :dart: <span style="color:orange">1.5. Make 5% values in IRIS to nan.</span> <span style="color:cyan">calculate number of NaN values to create base on %</span> nan_percent = 5 n_samples, n_features = iris_data.data.shape n_nan = int(n_samples * n_features * nan_percent / 100) <span style="color:cyan">give percentage equal to 5, the using data.shape to get the number of samples and features from iris_data then we have formula to calculate the number of NaN </span> create random indices for NaN values nan_index = np.random.choice(n_samples * n_features, n_nan, replace=False) <span style="color:cyan">using random to get random variables from the iris_data</span> set selected values to NaN 1.5 iris_data.data.ravel()[nan_index] = np.nan <span style="color:cyan">using ravel to change the missing value in iris_data to NaN</span> count number of NaN values and print nan_count = np.isnan(iris_data.data).sum() print(f"Number of NaN values created: {nan_count} ({nan_percent}% of the dataset)") <span style="color:cyan">using isnan to cout the number of NaN and then print</span> :dart: <span style="color:orange">1.6. Preproces missing data (i.e. nan) by using all the methods in the lecture.</span> :books: there are five method: :notebook: method1: ignore rows containing missing data ignore_rows_data = iris_data.data[~np.isnan(iris_data.data).any(axis=1)] <span style="color:cyan">creates a boolean array the size of the number of lines of iris_data.data,with the value True on lines containing at least one value of NaN and False on lines that do not contain NaN . values</span> <span style="color:cyan">~ is used to get lines that do not contain Nan</span> ignore_rows_target = iris_data.target[~np.isnan(iris_data.data).any(axis=1)] <span style="color:cyan">create a new array from iris_data.target by keeping only the corresponding entries in the target array with the corresponding lines that do not contain NaN in iris_data.data.</span> :notebook: method2: fill in manually method 2: fill in manually manual_data = iris_data.data.copy() manual_target = iris_data.target.copy() <span style="color:cyan">make a copy from irs_data and make a copy of the original target values of the iris.</span> for i in range(n_features): mean_val = np.mean(manual_data[:, i][~np.isnan(manual_data[:, i])]) manual_data[:, i][np.isnan(manual_data[:, i])] = mean_val <span style="color:cyan"><span style="color:cyan">setup a loop that will iterate through each object (column) in the data.calculate the mean of the feature, ignoring the NaN.Replace NaN with the calculated mean</span> :notebook: method3: use a global constant constant_data = iris_data.data.copy() constant_target = iris_data.target.copy() <span style="color:cyan">creates a copy of iris_data which will be used to replace the missing values with a common constant. make a copy of the target values for iris_data set</span> constant_value = 0 <span style="color:cyan">set constant value equal to 0 </span> constant_data[np.isnan(constant_data)] = constant_value <span style="color:cyan">replace missing values in dataset with constant_value(0) replace all missing values in data set with common constant value (0) use to find all positions in data set whose value is NaN then those positions are updated with the common constant value. </span> :notebook: method4: fill with measure of central tendency central_data = iris_data.data.copy() central_target = iris_data.target.copy() for i in range(n_features): central_data[:, i][np.isnan(central_data[:, i])] = np.mean(central_data[:, i][~np.isnan(central_data[:, i])]) <span style="color:cyan">sselect the value is NaN value of the media value of it is not including NaN assign all NaN values equal to its mean </span> :notebook: method5: fill with most probable value probable_data = iris_data.data.copy() probable_target = iris_data.target.copy() imp = SimpleImputer(strategy="most_frequent") probable_data = imp.fit_transform(probable_data) <span style="color:cyan">initialize simpleImputer with strategy as most frequent <=> missing values will be replaced with most frequent values imp.fit_transform: command to call on the simpleImputer object to perform the substitution. Output the array of results assigned to probable_data </span> :dart: <span style="color:orange">1.7. For each preprocessing method, use a classification model (e.g., naive Bayes) and evaluate the accuracy..</span> X_train, X_test, y_train, y_test = train_test_split(ignore_rows_data, ignore_rows_target, test_size=0.2, random_state=42) <span style="color:cyan">itrain_test_split is used to split imputed data X-scaled input data Iris_data.target fixes class labels test_size 0.2 spends 20% of the data for testing random_state = 42 random seeds for reproducibility. (changing the seed will change the data for testing and training) </span> gnb = GaussianNB() gnb.fit(X_train, y_train) <span style="color:cyan">train the classifier on the data </span> y_pred = gnb.predict(X_test) <span style="color:cyan">make predictions on test data using a classifier that has been trained and assigned to y_pred </span> acc = accuracy_score(y_test, y_pred) <span style="color:cyan">calculate the accuracy of the classifier using the accuracy_score command and assign it to accuraccy </span> print(f"Accuracy of Gaussian Naive Bayes classifier with ignore rows method: {acc}") :dart: <span style="color:orange">1.8. Repeat step 1.6 and 1.7 with 10% nan values.</span> +<span style="color:orange">1.9. Repeat step 1.6 and 1.7 with 15% nan values.</span>+ <span style="color:orange">1.10. Repeat step 1.6 and 1.7 with 20% nan values.</span> <span style="color:cyan">just replace the percent of this line </span> nan_percent = 10 for 1.8 nan_percent = 15 for 1.9 nan_percent = 15 for 1.20 <span style="color:cyan">after many times run the code, i see that the larger the percentage, the more difficult the prediction is </span> :dart: <span style="color:orange">1.11. Use min-max-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy.</span> scaler_min_max = MinMaxScaler() <span style="color:cyan">create a MinMaxScaler object that normalizes iris_data </span> perform min-max normalization on the data iris_data_normalized = scaler_min_max.fit_transform(iris_data.data) <span style="color:cyan">fit_transform creates a new data containing normalized data by calculate the maximum and minimum value of each data apply min-max normalization formula to scale to range [0,1] fit_transfom calculates the min and max values of each feature in the dataset and returns a normalized dataset where the minimum value of each feature is mapped to 0 and the maximum value of each feature is mapped to 1. </span> :dart: <span style="color:orange">1.12. Use Z-normalization on the data set and use a classification model (e.g., naive Bayes) and evaluate the accuracy.</span> scaler_Z = StandardScaler() <span style="color:cyan">create a StandardScarler object to normalize the data </span> iris_data_normalized = scaler_Z.fit_transform(iris_data.data) <span style="color:cyan">fit_transform creates a new data by calculating the mean and deviation of each feature in the data then transform the data by centering around zero and scale it to get the variance </span> <div align="center"> ## :computer: Code [![View on GitHub](https://img.shields.io/badge/View%20on-GitHub-brightgreen)](https://github.com/HalloPeanut/PeanutLab1.github.io) </div> ## :file_folder: Contact <center> ![Profile Picture](https://scontent.fsgn2-8.fna.fbcdn.net/v/t39.30808-6/324094541_6363090273704637_6864410919998281766_n.jpg?_nc_cat=102&ccb=1-7&_nc_sid=174925&_nc_ohc=YGw0g-n5ql4AX8HBd8b&_nc_ht=scontent.fsgn2-8.fna&oh=00_AfAuCdyrCxENh_veTI39mwgJRA8zThc-1aweFaRzONpDVA&oe=643E5BCB) :phone: 0905657088 :email: 10421047@student.vgu.edu.vn :card_index: [Peanut](https://www.facebook.com/profile.php?id=100014377148812) </center>