# Notes Database 3 # H2: AI, Data Mining and Machine Learning ## 2.1 Definitions - Artificial Intelligence: the automation of activities that we associate with human thinking, activities such as decision-making, problem solving and learning. - Machine Learning: the study, design and development of algorithms that give computers the capability to learn without being explicitly programmed. - Data Mining: the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. Often, machine learning techniques are used in data mining. ## 2.2 Supervised Learning - specifically designated attribute, called a label. - we use given data to predict the value of the label for instances that have not yet been seen. Classification (categorical) Regression (numerical) ## 2.3 Unsupervised Learning Data that does not have any specifically designated attribute (unlabeled) # H3: Process Model ![](https://i.imgur.com/BIGh1lU.png) - It is assumed that ca. 80-90 % of the overall project time is spent on preprocessing - Because understanding the business and the data and cleaning data consumes a lot of time. - Most data analytics algorithms are available off-the-shelf in software libraries or even complete applications. - Important remark: although some attributes are represented by an integer (like ticket class) they are categorical by nature. This is an important observation when building a model. # H4 : Data Exploration https://colab.research.google.com/github/HOGENT-Databases/DB3-Workshops/blob/master/notebooks/Exploration_Titanic.ipynb # H5: Classification Algorithms ## 5.1 Naive Bayes Classification (Probability Theory) - Conditional probability is the probability that something will happen, given that something else has already occurred. Using the conditional probability, we can calculate the probability of an event using its prior knowledge. - **P(H|E)=P(E|H)∗P(H)P(E)** - P(H) is the probability of hypothesis H being true. This is known as the prior probability. - P(E) is the probability of the evidence (regardless of the hypothesis). - P(E|H) is the probability of the evidence given that hypothesis is true. - P(H|E) is the probability of the hypothesis given that the evidence is there. - The class with the highest probability is considered as the most likely class. This is also known as Maximum A Posteriori (MAP). - Naive Bayes classifier assumes that all the features are unrelated to each other. Presence or absence of a feature does not influence the presence or absence of any other feature. ### 5.1.1 Example: Fruits ![](https://i.imgur.com/zIOnLc8.png) Prior Probalities ```xml= P(Banana) = 500/1000 P(Orange) = 300/1000 P(Others) = 200/1000 ``` Evidence Probabilities ```xml= P(Long) = 500/1000 P(Sweet) = 650/1000 P(Yellow) = 800/1000 ``` Likelihood Probabilities ```xml= P(Long | Banana) = 400/500 P(Long | Orange) = 0 P(Yellow | Others) = 50/200 ``` Given a fruit, how do we classify it? ```xml= P(Banana | Long, Sweet and Yellow) P(Orange | Long, Sweet and Yellow) P(Others | Long, Sweet and Yellow) P(Banana | Long, Sweet and Yellow) = P(Long | Banana) * P(Sweet | Banana) * P(Yellow | Banana) / P(Long) * P(Sweet) * P(Yellow) = 0.8 * 0.7 * 0.9 * 0.5 / P(Evidence) = 0.252 / P(Evidence) P(Orange | Long, Sweet and Yellow) = 0 P(Others | Long, Sweet and Yellow) = ... = 0.01875 => We can classify this fruit as likely to be a Banana ``` ### 5.1.2 Gaussian Naive Bayes (Normally Distributed) - When attributes can take lots of possible values or are even continuous it’s not longer that easy to calculate the probabilities for each combination in your dataset. - This model can be fit by simply finding the mean and standard deviation of the points within each label, which is all you need to define such a distribution. ![](https://i.imgur.com/oM4GpZm.png) ![](https://i.imgur.com/hkThcLO.png) ### 5.1.3 Multinomial Naive Bayes - The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates from a simple multinomial distribution. - For example used when predicting spam based on count rates of words. ## 5.2 Tree Classification ### 5.2.1 Introduction - This algorithm is a widely-used method of constructing a model from a dataset in the form of a decision tree which has the advantage of being meaningful and easy to interpret. - In a well-constructed tree, each question will cut the number of options by approximately half, very quickly narrowing the options even among a large number of classes. - Each node will iteratively split the data along one or the other axis according to some quantitative criterion, and at each level assign the label of the new region according to a majority vote of points within it. ### 5.2.2 Creating a decision tree ![](https://i.imgur.com/tPS5xfc.png) - Overfitting turns out to be a general property of decision trees; it is very easy to go too deep in the tree, and thus to fit details or **noise** of the particular data rather than the overall properties of the distributions they are drawn from. ![](https://i.imgur.com/YXEp1Nm.png) ### 5.2.3 Random Forests (Ensembles) - The key observation is that the inconsistencies tend to happen where the classification is less certain, and thus by using information from both of these trees, we might come up with a better result. - Multiple overfitting estimators can be combined to reduce the effect of this overfitting, which underlies an ensemble method called **bagging**. This uses **parallel** estimators, each of which overfits the data, and averages the results to find a better classification. - The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining the predictions from each tree reduces the variance of the predictions, improving the performance on test data. - The randomness injected into the training process includes: - Subsampling the original dataset on each iteration to get a different training set (a.k.a. bootstrapping). - Considering different random subsets of features to split on at each tree node. - To make a prediction on a new instance, a random forest must **aggregate** the predictions from its set of decision trees, which is done differently for classification and regression. - Classification : Majority vote. Each tree’s prediction is counted as a vote for one class. The label is predicted to be the class which receives the most votes. - Regression : Averaging. Each tree predicts a real value. The label is predicted to be the average of the tree predictions. ## 5.3 Logistic Regression - A technique in which essentially a (linear or nonlinear) decision boundary is determined, used for **binary classification**. ![](https://i.imgur.com/wuJnvQm.png) - In the above example we have to determine a line with the equation ![](https://i.imgur.com/BeFpEMc.png) ## 5.4 Ensemble Methods - Random Forest is an example of an ensemble method in which several estimators of the same type (i.e. decision tree) are combined to determine the estimated class by **voting**. - The same principle can be used when combining estimators of different types and predict the label based on a voting mechanism, which improves the predictive accuracy. ## 5.5 Estimating the predictive accuracy of a classifier ### 5.5.1 Introduction - The predictive accuracy of a classifier is the proportion of a set of unseen instances that it correctly classifies - The available data is split into two parts: a training set and a test set. - The training set is used to construct a classifier which is then used to predict the classification for the instances in the test set. - If the test set contains N instances of which C are correctly classified the predictive accuracy of the classifier for the test set is; **p = C/N**. This can also be used as an estimate of its performance on any unseen dataset. ![](https://i.imgur.com/DKqmHfM.png) - A random division into two parts in proportions such as 1:1, 2:1, 70:30 or 60:40 is customary (the largest part is the training set). Obviously, the larger the proportion of the training set, the better the model, but the worse the correctness of the calculated accuracy and vice versa. ### 5.5.2 Other estimators for the predictive accuracy - Suppose you want to predict if a person has cancer based on some features like tumor size and age. Only a very small subset of the people really have cancer, so your model should at least be better than a “guessing” method. - We make use of a **confusion matrix** ![](https://i.imgur.com/jVNcF5R.png) - We divide different types of errors - Type I: False positives - Type II: False negatives ![](https://i.imgur.com/Ecdo5O8.png) ![](https://i.imgur.com/rxz1N4w.png) ![](https://i.imgur.com/hfasKaQ.png) ### 5.5.3 Exercise: Down ![](https://i.imgur.com/ag1c2zo.png) # H6: Machine Learning with Python Scikit-Learn ## 6.1 Naive Bayes Classification ### 6.1.1 Data Exploration ```python= # import the library import pandas as pd # read the data into a dataframe url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv' titanic = pd.read_csv(url) # show the first 5 lines titanic.head() ``` ```python= # explore the data to estimate if we have enough (statistically relevant) data for both classes titanic.groupby('Survived').count() ``` ![](https://i.imgur.com/6x4q635.png) ### 6.1.2 Data Preparation or Feature Selection ```python= # We drop clearly irrelevant attributes. Pay attention for bias! Don't let your own opinion play. titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1) titanic.head() ``` ![](https://i.imgur.com/gUrJbep.png) ### 6.1.3 Data cleaning ```python= print('Before') print(titanic.count()) print() # drop all lines that contain empty (null or NaN) values titanic = titanic.dropna() print('After') print(titanic.count()) ``` ![](https://i.imgur.com/vudsyqz.png) ![](https://i.imgur.com/FnBBIKk.png) ```python= # see what remains titanic.groupby('Survived').count() ``` ![](https://i.imgur.com/aeBKSCf.png) ### 6.1.4 Data Type Conversion ```python= # convert string to numeric for input of machine learning algorithms # numpy is a Python library that offers lots of data manipulation functions import numpy as np titanic['Sex'] = np.where(titanic['Sex']=='male', 1, 2) titanic.head() ``` ![](https://i.imgur.com/rfK1ury.png) ### 6.1.5 Define feature matrix, label column and split the data set ```python= # define feature matrix and label column import sklearn from sklearn.model_selection import train_test_split X = titanic.drop('Survived',axis=1) y = titanic['Survived'] # split the data set in a training set and a test set X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) ``` ![](https://i.imgur.com/wHNIOAJ.png) ![](https://i.imgur.com/B7bpQsf.png) ### 6.1.6 Choose the model and fit the data ```python= # assume values are Gaussian distributed, so choose Gaussian Naive Bayes from sklearn.naive_bayes import GaussianNB model = GaussianNB() # here the training happens: the model is constructed based ons the training data model.fit(X_train,y_train) ``` ### 6.1.7 Estimate the predictive accuracy of the classifier ```python= # use the model to predict the label for the test data y_test2 = model.predict(X_test) ``` ```python= # compare the predicted labels with the real values to obtain the estimated accuracy score from sklearn.metrics import accuracy_score accuracy_score(y_test, y_test2) 0.7581395348837209 We see that the accuracy is about 75%, which is better than simply predicting not survided (only 61%) ``` ```python= # Determine the false negative rate: what's the proportion of the passengers # who survived that we declared death. (assuming survived = positive) results = pd.DataFrame({'true':y_test,'estimated':y_test2}) print(results.head()) ``` ![](https://i.imgur.com/UrSp9Z1.png) ```python results['TP'] = np.where((results['true'] == 1) & (results['estimated'] == 1),1,0) results['TN'] = np.where((results['true'] == 0) & (results['estimated'] == 0),1,0) results['FP'] = np.where((results['true'] == 0) & (results['estimated'] == 1),1,0) results['FN'] = np.where((results['true'] == 1) & (results['estimated'] == 0),1,0) ``` ![](https://i.imgur.com/QIQ8Rul.png) ```python= FNrate = results['FN'].sum()/(results['FN'].sum() + results['TP'].sum()) print(FNrate) 0.24 This means for about 32 % of all passengers who did survive we predicted they didn’t. ``` ```python= # show confusion matrix from sklearn.metrics import confusion_matrix #Matplotlib is a Python visualization library import matplotlib.pyplot as plt #Set matplotlib visualization style plt.style.use('classic') # specifiy matplotlib graphs are shown "inline" in the ouput %matplotlib inline # Seabonr is a Python data visualization library based on matplotlib import seaborn as sns;sns.set() ``` ```python= # calculate the confusion matrix. mat = confusion_matrix(y_test,y_test2) print(mat) #rename data labels: 0 = not survived, 1 = survived labels = ['not survived','survived'] # keep the alphanumeric order of the original class labels! # mat.T = transpose the matrix # data labels (0,1) are sorted from left to right (for the horizontal axis) # square=True: each cell will be square-shaped # and from top to bottom (for the vertical axis) # annot=True: data value in each cell # fmt='d': format labels as 'double' (not scientific notation) # cbar=True: color side bar sns.heatmap(mat.T,square=True, annot=True, fmt='d', cbar=True, xticklabels=labels, yticklabels=labels) plt.xlabel('true category') plt.ylabel('predicted category') ``` ![](https://i.imgur.com/DTDpTAo.png) ## 6.2 Random Forest Classification ### 6.2.0 Random Tree ```python= # The process of fitting a decision tree to our data can be done as follows; # However, it’s better to use Random Forest to reduce overfitting. from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier().fit(X, y) ``` ### 6.2.1 Data Exploration ```python= # import the library import pandas as pd url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv' titanic = pd.read_csv(url) titanic.head() ``` ```python= # explore the data to estimate if we have enough (statistically relevant) data for both classes titanic.groupby('Survived').count() ``` ### 6.2.2 Data Preparation ```python= # We drop clearly irrelevant attributes. Pay attention for bias! Don't let your own opinion play. titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1) titanic.head() ``` ### 6.2.3 Data Cleaning ```python= print('Before') print(titanic.count()) print() # drop all lines that contain empty (null or NaN) values titanic = titanic.dropna() print('After') print(titanic.count()) ``` ```python= # see what remains titanic.groupby('Survived').count() ``` ### 6.2.4 Data Type Conversion ```python= import numpy as np titanic['Sex'] = np.where(titanic['Sex']>='male', 1, 2) titanic.head() ``` ### 6.2.5 Define feature matrix, label column and split the data set ```python= from sklearn.model_selection import train_test_split X = titanic.drop('Survived',axis=1) y = titanic['Survived'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) ``` ### 6.2.6 Choose the model and fit the data ```python= from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=300) model.fit(X_train, y_train) ``` ### 6.2.7 Estimate the predictive accuracy of the classifier ```python= y_test2 = model.predict(X_test) ``` ```python= from sklearn.metrics import accuracy_score accuracy_score(y_test, y_test2) 0.813953488372093 ``` OR ```python= # We could also try to find the optimal number of trees in a automated way. best_accuracy = 0 best_trees = 0 for trees in range(50,550,50): model = RandomForestClassifier(n_estimators=trees) model.fit(X_train, y_train) y_test2 = model.predict(X_test) accuracy = accuracy_score(y_test, y_test2) if accuracy > best_accuracy: best_accuracy = accuracy best_trees = trees print('Optimal number of trees = % s' %(best_trees)) print('Accuracy = % 3.2f' % (best_accuracy)) Optimal number of trees = 350 Accuracy = 0.82 ``` - The above approach involves a certain risk of "leaking" the test data to the tuning of the algorithm. - A better approach would be to use a training, a validation and a test set and use the validation set for hyperparameter tuning and only use the test set to determine the accuracy after these tuning. ```python= X_remainder, X_test, y_remainder, y_test = train_test_split(X,y,test_size=0.30) best_accuracy = 0 best_trees = 0 for trees in range(50,550,50): X_train, X_validation, y_train, y_validation = train_test_split(X_remainder,y_remainder,test_size=0.30) model = RandomForestClassifier(n_estimators=trees) model.fit(X_train, y_train) y_validation2 = model.predict(X_validation) accuracy = accuracy_score(y_validation, y_validation2) if accuracy > best_accuracy: best_accuracy = accuracy best_trees = trees best_test = model.predict(X_test) best_model = model print('Optimal number of trees = % s' %(best_trees)) print('Accuracy on validation set = % 3.2f' % (best_accuracy)) accuracyOnTestSet = accuracy_score(y_test, best_test) print('Accuracy on test set = % 3.2f' % (accuracyOnTestSet)) Optimal number of trees = 50 Accuracy on validation set = 0.83 Accuracy on test set = 0.80 ``` - Eventhough the accuracy of Random Forest is very close to Naive Bayes, Decision Tree and Random Forest classifiers have one major advantage; you can determine the relative importance of each feature. ```python= print(X_train.columns) print(best_model.feature_importances_) ``` ```python= # we now combine those two collections into a dataframe pd.DataFrame(best_model.feature_importances_,columns=['Importance'],index=X_train.columns).sort_values(by='Importance',ascending=False) ``` ![](https://i.imgur.com/qIvzZiX.png) ```python= # Determine the false negative rate: what's the proportion of the passengers # who survived that we declared death. results = pd.DataFrame({'true':y_test,'estimated':y_test2}) results['TP'] = np.where((results['true'] == 1) & (results['estimated'] == 1),1,0) results['TN'] = np.where((results['true'] == 0) & (results['estimated'] == 0),1,0) results['FP'] = np.where((results['true'] == 0) & (results['estimated'] == 1),1,0) results['FN'] = np.where((results['true'] == 1) & (results['estimated'] == 0),1,0) FNrate = results['FN'].sum()/(results['FN'].sum() + results['TP'].sum()) print(FNrate) 0.5851063829787234 ``` ### 6.2.8 One Hot Encoding In our example they would assume that a female is twice a male, which makes no sense. One hot encoding effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0 respectively We can replace the line ```python= titanic['Sex'] = np.where(titanic['Sex']>='male', 1, 2) ``` by ```python= titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"]) ``` Let's now redo the modeling using one-hot-encoding for Sex. For brevity we don't search for the optimal number of trees, but use what we have found above. ```python= url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv' titanic = pd.read_csv(url) titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1) titanic = titanic.dropna() titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"]) print(titanic.head()) X = titanic.drop('Survived',axis=1) y = titanic['Survived'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) model = RandomForestClassifier(n_estimators=best_trees) # ues the optimal number of trees we have found above model.fit(X_train, y_train) y_test2 = model.predict(X_test) accuracy_score(y_test, y_test2) ``` ![](https://i.imgur.com/2IOQ8Sl.png) ```python= importances = pd.DataFrame(model.feature_importances_,columns=['Importance'],index=X_train.columns).sort_values(by='Importance',ascending=False).reset_index() importances ``` ![](https://i.imgur.com/4bvb5lm.png) ```python= # We can group these relative importances together and make the sum of there values: importances['index'] = np.where(importances['index'].str.startswith ('Sex'),'Sex',importances['index']) imp = importances.groupby(['index'])['Importance'].sum().reset_index().sort_values(by='Importance',ascending=False).reset_index() imp ``` ![](https://i.imgur.com/pZ4MSxW.png) ## 6.3 Logistic Regression Classification ### 6.3.1 Data Exploration ```python= # import the library import pandas as pd url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv' titanic = pd.read_csv(url) titanic.head() ``` ### 6.3.2 Data Preparation ```python= titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1) titanic.head() ``` ![](https://i.imgur.com/5jSvOBA.png) ### 6.3.3 Data Cleaning ```python= titanic = titanic.dropna() ``` ![](https://i.imgur.com/ws6zdBS.png) ### 6.3.4 Data Type Conversion ```python= titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"]) print(titanic.head()) ``` ### 6.3.5 Define feature matrix, label column and split the data set ```python= from sklearn.model_selection import train_test_split X = titanic.drop('Survived',axis=1) y = titanic['Survived'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) ``` ### 6.3.6 Choose the model and fit the data ```python= from sklearn.linear_model import LogisticRegression model = LogisticRegression(solver='newton-cg') model.fit(X_train, y_train) y_test2 = model.predict(X_test) from sklearn.metrics import accuracy_score accuracy_score(y_test, y_test2) 0.7906976744186046 ``` ## 6.4 Ensemble Classification ### 6.4.1 Data Exploration ```python= # import the library import pandas as pd url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv' titanic = pd.read_csv(url) titanic.head() ``` ### 6.4.2 Data Preparation ```python= # We drop clearly irrelevant attributes. Pay attention for bias! Don't let your own opinion play. titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1) titanic.head() ``` ### 6.4.3 Data Cleaning ```python= titanic = titanic.dropna() ``` ### 6.4.4 Data Type Conversion ```python= titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"]) ``` ### 6.4.5 Define feature matrix, label column and split the data set ```python= from sklearn.model_selection import train_test_split X = titanic.drop('Survived',axis=1) y = titanic['Survived'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) ``` The class is selected based on a voting mechanism - **Hard**: counts for each instance the number of times each predicted class appears and chooses the class which appears most. - **Soft**: predicts the class label based on the sum of the predicted probabilities of each class, which is recommended in general. ### 6.4.6 Choose the model and fit the data ```python= from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, VotingClassifier lr = LogisticRegression(solver='newton-cg') rf100 = RandomForestClassifier(n_estimators=100) rf150 = RandomForestClassifier(n_estimators=150) rf200 = RandomForestClassifier(n_estimators=200) rf250 = RandomForestClassifier(n_estimators=250) gnb = GaussianNB() model = VotingClassifier(estimators=[('lr', lr), ('rf100', rf100),('rf150', rf150), ('rf200', rf200), ('rf250', rf250), ('gnb', gnb)], voting='soft') model.fit(X_train, y_train) y_test2 = model.predict(X_test) from sklearn.metrics import accuracy_score accuracy_score(y_test, y_test2) 0.772093023255814 ``` ## 6.5 Model Deployment ### 6.5.1 Build Model ```python= # we first build the model. import pandas as pd url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv' titanic = pd.read_csv(url) titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1) titanic = titanic.dropna() titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"]) from sklearn.model_selection import train_test_split X = titanic.drop('Survived',axis=1) y = titanic['Survived'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, VotingClassifier lr = LogisticRegression(solver='newton-cg') rf100 = RandomForestClassifier(n_estimators=100) rf150 = RandomForestClassifier(n_estimators=150) rf200 = RandomForestClassifier(n_estimators=200) rf250 = RandomForestClassifier(n_estimators=250) gnb = GaussianNB() model = VotingClassifier(estimators=[('lr', lr), ('rf100', rf100),('rf150', rf150), ('rf200', rf200), ('rf250', rf250), ('gnb', gnb)], voting='soft') model.fit(X_train, y_train) ``` ### 6.5.2 Save Model ```python= # we now save the model to a file # see https://scikit-learn.org/stable/modules/model_persistence.html from google.colab import drive drive.mount('/content/gdrive') from joblib import dump dump(model, '/content/gdrive/My Drive/survival_prediction_model.joblib') ``` ### 6.5.3 Use Model ```python= # We will now use this model to guess wether or not an unseen passenger (one that has not been used to build the model) # has survived or not. def PredictSurvival(model,Pclass,Sex,Age,SibSp,Parch): import pandas as pd passenger=pd.DataFrame(columns=['Pclass','Sex','Age','SibSp','Parch']) new_passenger = {'Pclass':Pclass, 'Sex':Sex, 'Age':Age, 'SibSp':SibSp, 'Parch':Parch} passenger = passenger.append(new_passenger,ignore_index=True) if Sex == 'male': passenger['Sex_male'] = 1 passenger['Sex_female'] = 0 else: passenger['Sex_male'] = 0 passenger['Sex_female'] = 1 passenger.drop(columns=['Sex'],axis=1,inplace=True) # we can't use pd.get_dummies here because not all values (male,female) are available # for a single customer survived = model.predict(passenger) # most sklearn algorithms also offer a predict_proba method that returns an array of # probabilities per class: survived_proba = model.predict_proba(passenger) return survived[0],survived_proba[0].max() from joblib import load model = load('/content/gdrive/My Drive/survival_prediction_model.joblib') survived = PredictSurvival(model,Pclass=3,Sex='male',Age=40,SibSp=0,Parch=0) print(survived) survived = PredictSurvival(model,Pclass=1,Sex='female',Age=27,SibSp=0,Parch=0) print(survived) (1, 1.0) (0, 0.93) ``` # H7: Regression ## 7.1 Simple Linear Regression ### 7.1.1 Data Exploration ```python= import pandas as pd url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/wintertempbrussels.xlsx' brussels = pd.read_excel(url) print(brussels.head(3)) print(brussels.tail(3)) ``` ### 7.1.2 Data Visualisation ```python= # Plot the dataset import matplotlib.pyplot as plt # enable Matplotlib in this notebook. %matplotlib inline xmin = brussels.Year.min() - 5 xmax = brussels.Year.max() + 5 ymin = brussels.Temperature.min() - 2 ymax = brussels.Temperature.max() + 2 plt.xlim([xmin, xmax]) plt.ylim([ymin,ymax]) plt.scatter(brussels.Year, brussels.Temperature) plt.xlabel('Year') plt.ylabel('Temperature') plt.show() ``` ![](https://i.imgur.com/ueUPxpZ.png) ### 7.1.3 Splitting the Data for Training and Testing By default the `sklearn.linear_model` estimator uses all the numeric features in a dataset to perform multiple linear regression. For simple linear regression select one feature as the independent variable In this example Date and Temperature are the **independent** and **dependent** variable respectively. ```python= from sklearn.model_selection import train_test_split X = brussels.drop('Temperature',axis=1) y = brussels['Temperature'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20) print(X_train.shape) print(X_test.shape) (148, 1) (38, 1) ``` ### 7.1.4 Training the Model ```python= from sklearn.linear_model import LinearRegression linear_regression = LinearRegression() linear_regression.fit(X=X_train, y=y_train) ``` LinearRegression estimator iteratively adjusts the slope and intercept to minimize the sum of the squares of the data points’ distances from the line to find the best regression line for the data. ![](https://i.imgur.com/wdArl5U.png) ```python= # The slope (i.e. yearly increase) can be defined as linear_regression.coef_ array([0.01056579]) ``` ```python= # Estimated temperature in year 0 linear_regression.intercept_ -17.747780091567012 ``` ### 7.1.5 Testing the Model ```python= predicted = linear_regression.predict(X_test) expected = y_test ``` ```python= # The code predicted[::5] uses a step of 5 to create a slice with every 5th element for p, e in zip(predicted[::5], expected[::5]): # check every 5th element print(f'predicted: {p:.2f}, expected: {e:.2f}') ``` ### 7.1.6 Determine the accuracy of the regression model ![](https://i.imgur.com/nd1dn1f.png) ```python= from sklearn import metrics MAE = metrics.mean_absolute_error(expected,predicted) print('Mean Absolute Error: '+ str(MAE)) print() MSE = metrics.mean_squared_error(expected,predicted) print('Mean Squared Error: '+ str(MSE)) print() import numpy as np RMSE = np.sqrt(metrics.mean_squared_error(expected,predicted)) print('Root Mean Squared Error: '+ str(RMSE)) print() r2 = metrics.r2_score(expected,predicted) print('R square: ' + str(r2)) print() ``` ### 7.1.7 Predicting Temperatures ```python= # lambda implements y = mx + b predict = (lambda x: linear_regression.coef_ * x + linear_regression.intercept_) print(predict(1890)) print(predict(2020)) print(predict(2100)) [2.2215542] [3.5951063] [4.44036912] ``` ### 7.1.8 Visualize Regression Line ```python= plt.scatter(X_train,y_train) plt.plot(X_train, linear_regression.intercept_ + X_train * linear_regression.coef_, color='red') plt.show() ``` ![](https://i.imgur.com/mS27uzf.png) ### 7.1.9 Exercise ```python= # Is global warming getting worse since 1950? brussels = brussels[brussels['Year'] >= 1950] ``` ```python= xmin = brussels.Year.min() - 5 xmax = brussels.Year.max() + 5 ymin = brussels.Temperature.min() - 2 ymax = brussels.Temperature.max() + 2 plt.xlim([xmin, xmax]) plt.ylim([ymin,ymax]) plt.scatter(brussels.Year, brussels.Temperature) plt.xlabel('Year') plt.ylabel('Temperature') plt.show() ``` ![](https://i.imgur.com/CHIHK8V.png) ```python= X = brussels.drop('Temperature',axis=1) y = brussels['Temperature'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20) linear_regression.fit(X=X_train, y=y_train) linear_regression.coef_ array([0.02809793]) ``` ```python= # What will, according to this model, be the temperature in 2100? print(predict(2100)) [6.52241764] ``` ```python= predicted = linear_regression.predict(X_test) expected = y_test MAE = metrics.mean_absolute_error(expected,predicted) print('Mean Absolute Error: '+ str(MAE)) print() Mean Absolute Error: 0.8281620022711979 ``` ## 7.2 Multiple Linear Regression - This concept of linear can be extended to cases where there are more than two variables. For instance, the dependent variable (target) is dependent upon several independent variables. - You can use the multiple linear regression to find out which factor has the highest impact on the predicted output and how different variables relate to each other ![](https://i.imgur.com/kFLfyZL.png) ### 7.2.1 Data Exploration ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt # Read the file Advertising.csv url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/advertising.csv' data = pd.read_csv(url) data.head() ``` ```python= # Give the dimensions of the dataset data.shape (200, 5) ``` ### 7.2.2 Data Preparation ```python= # Drop the column Unnamed: 0 data = data.drop('Unnamed: 0', axis = 1) data.head() ``` ### 7.2.3 Data Visualisation ```python= import math def roundup(x): return int(math.ceil(x / 10.0)) * 10 def rounddown(x): return int(math.floor(x / 10.0)) * 10 ``` ```python= # Plot the dataset Sales vs TV # First calculate the minimum and the maximumvalue for TV xminTV = rounddown(data['TV'].min()) - 50 xmaxTV = roundup(data['TV'].max()) + 50 plt.scatter(data['TV'], data['Sales']) plt.xlim([xminTV, xmaxTV]) plt.xlabel('TV') plt.ylabel('Sales') ``` ![](https://i.imgur.com/p90NBsL.png) ```python= # Plot the dataset Sales vs Radio # First calculate the minimum and the maximumvalue for Radio xminRadio = rounddown(data['Radio'].min()) - 20 xmaxRadio = roundup(data['Radio'].max()) + 20 plt.scatter(data['Radio'], data['Sales'], color='purple') plt.xlim([xminRadio, xmaxRadio]) plt.xlabel('Radio') plt.ylabel('Sales') ``` ![](https://i.imgur.com/GwCNRNf.png) ```python= # Plot the dataset Sales vs Newspaper # First calculate the minimum and the maximumvalue for Newspaper xminNewspaper = rounddown(data['Newspaper'].min()) - 20 xmaxNewspaper = roundup(data['Newspaper'].max()) + 20 plt.scatter(data['Newspaper'], data['Sales'], color='yellow') plt.xlim([xminNewspaper, xmaxNewspaper]) plt.xlabel('Newspaper') plt.ylabel('Sales') ``` ![](https://i.imgur.com/VJAOJFt.png) In this case, `y = b0 + m1∗TV + m2∗Radio + m3∗Newspaper` ### 7.2.4 Splitting the Data for Training and Testing ```python= # Use LinearRegression to predict the Sales from sklearn.model_selection import train_test_split X = data.drop('Sales',axis=1) y = data['Sales'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) ``` ### 7.2.5 Training the Model ```python= from sklearn import metrics from sklearn.linear_model import LinearRegression # First we are using LinearRegression model = LinearRegression() model.fit(X_train,y_train) #To retrieve the intercept: print("Intercept") print(model.intercept_) print() #For retrieving the coefficients: print("Coefficients") print(model.coef_) Intercept 3.0303203591416192 Coefficients [0.04758891 0.17324881 0.00411952] ``` Depending on the random split in training and test set we get something like `y = 2.99 + 0.0448∗TV + 0.191∗Radio − 0.003∗Newspaper` Important notes: - This is a statement of correlation, not causation - An increase in Newspaper ad spending is associated with a decrease in sales because the Newspaper coefficient is negative. ```python= # Calculate the common evaluation metrics for regression problems y_predict = model.predict(X_test) MAE = metrics.mean_absolute_error(y_test,y_predict) print('Mean Absolute Error: '+ str(MAE)) print() MSE = metrics.mean_squared_error(y_test,y_predict) print('Mean Squared Error: '+ str(MSE)) print() RMSE = np.sqrt(metrics.mean_squared_error(y_test,y_predict)) print('Root Mean Squared Error: '+ str(RMSE)) print() r2 = metrics.r2_score(y_test,y_predict) print('R square: ' + str(r2)) print() Mean Absolute Error: 1.3046420399153504 Mean Squared Error: 3.0339274271223067 Root Mean Squared Error: 1.7418172771913554 R square: 0.8786156588740994 ``` ## 7.3 Polynomial Regression A regression equation is a polynomial regression equation if the power of an independent variable is more than 1, noted as `y= a+b∗x+ c∗x²` ![](https://i.imgur.com/DAhC8zB.png) ## 7.3.1 Data Exploration ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/wintertempbrussels.xlsx' brussels = pd.read_excel(url) brussels.head() ``` ## 7.3.2 Splitting the Data for Training and Testing ```python= from sklearn.model_selection import train_test_split X = brussels.drop('Temperature',axis=1) y = brussels['Temperature'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) from sklearn import metrics from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression ``` ## 7.3.3 Training the Model ```python= # Now we are using Polynomial Regression. # This is still LinearRegression because the coefficients/weights associated with the features are still linear poly = PolynomialFeatures(degree=2) # fit_transform will turn x**2 into a feaure X_train_transform = poly.fit_transform(X_train) X_test_transform = poly.fit_transform(X_test) model = LinearRegression() model.fit(X_train_transform,y_train) #To retrieve the intercept: print("Intercept") print(model.intercept_) print() #For retrieving the coefficients: print("Coefficients") print(model.coef_) print() Intercept 162.54852970489443 Coefficients [ 0.00000000e+00 -1.78486448e-01 4.95971814e-05] ``` ```python= y_predict = model.predict(X_test_transform) MAE = metrics.mean_absolute_error(y_test,y_predict) print('Mean Absolute Error: '+ str(MAE)) print() MSE = metrics.mean_squared_error(y_test,y_predict) print('Mean Squared Error: '+ str(MSE)) print() RMSE = np.sqrt(metrics.mean_squared_error(y_test,y_predict)) print('Root Mean Squared Error: '+ str(RMSE)) print() mean = brussels['Temperature'].mean() print ('Mean: ' + str(mean)) print() r2 = metrics.r2_score(y_test,y_predict) print('R square: ' + str(r2)) print() Mean Absolute Error: 1.4291245632304352 Mean Squared Error: 3.040690942327554 Root Mean Squared Error: 1.7437577074604012 Mean: 2.7284946236559136 R square: -0.03916388267722226 ``` ## 7.3.4 Data Visualisation ```python= # Calculate the result of the polynomial for a specific value of x def p(x): result = model.intercept_ for i in range(0, len(model.coef_)): result += model.coef_[i] * x**i return result ``` ```python= # Plot the dataset plt.scatter(X_test, y_test) xmin = brussels['Year'].min() xmax = brussels['Year'].max() plt.xlim([xmin, xmax]) plt.xlabel('year') plt.ylabel('temperature') # Plot the polynomial t1 = np.arange(xmin, xmax, 0.01) plt.plot(t1, p(t1), color='red') plt.show() ``` ![](https://i.imgur.com/kgJWJKH.png) ```python= # We will now use a for loop to create a model voor polynomials of degree = 1 .. 5 and to write out the root mean squared error from sklearn.model_selection import train_test_split X = brussels.drop('Temperature',axis=1) y = brussels['Temperature'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) for i in range(1,6): poly = PolynomialFeatures(degree=i) X_train_transform = poly.fit_transform(X_train) X_test_transform = poly.fit_transform(X_test) model = LinearRegression() model.fit(X_train_transform,y_train) y_predict = model.predict(X_test_transform) RMSE = np.sqrt(metrics.mean_squared_error(y_test,y_predict)) print('Root Mean Squared Error for i (test set) = ' + str(i) + ' is '+ str(RMSE)) print() y_predict = model.predict(X_train_transform) RMSE = np.sqrt(metrics.mean_squared_error(y_train,y_predict)) print('Root Mean Squared Error for i (training set) = ' + str(i) + ' is '+ str(RMSE)) print() Root Mean Squared Error for i (test set) = 1 is 1.6792491728402292 Root Mean Squared Error for i (training set) = 1 is 1.6274607486082815 Root Mean Squared Error for i (test set) = 2 is 1.6579951890388185 Root Mean Squared Error for i (training set) = 2 is 1.6214100836516399 Root Mean Squared Error for i (test set) = 3 is 1.6695299217903294 Root Mean Squared Error for i (training set) = 3 is 1.6170428308597287 Root Mean Squared Error for i (test set) = 4 is 1.6700552702359055 Root Mean Squared Error for i (training set) = 4 is 1.6167530117877327 Root Mean Squared Error for i (test set) = 5 is 1.670593847414687 Root Mean Squared Error for i (training set) = 5 is 1.6164627174172808 ``` This is a clear example of overfitting: accuracy of the test set is getting better till degree 2. For higher degrees the accuracy is getting worse because of overfitting. This is confirmed by a better accuracy of the training set for higher degrees. ## 7.4 Random Forest Regression ### 7.4.1 Data Exploration & Preparation - Blueblike ```python= import pandas as pd url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/bluebike.csv' bb = pd.read_csv(url,sep=';') ``` ```python= # restrict to all observations as of 2018-01-24 bb = bb[bb['Timestamp'] >= '2018-01-24'] print(bb.head(10)) ``` ```python= # drop columns irrelevant for model bb = bb.drop('SPCapacityTotal',axis=1).drop('SPCapacityInUse',axis=1).drop('SPCapacityInMaintenance',axis=1)\ .drop('SPCheckSum',axis=1).drop('DPCapacityTotal',axis=1).drop('DPCapacityInUse',axis=1).drop('DPCapacityAvailable',axis=1)\ .drop('DPCapacityInMaintenance',axis=1).drop('DPCheckSum',axis=1) ``` - Weather ```python= # read weather observations url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/weatherhistory.csv' weather = pd.read_csv(url) weather.head() ``` ```python= # read weather observations url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/weatherhistory.csv' weather = pd.read_csv(url) weather.head() ``` ```python= # strip off seconds weather['observationtime'] = weather.observationtime.str[0:16] weather.head() ``` ```python= # round to half hours and drop duplicates import numpy as np weather['observationtime'] = np.where(weather.observationtime.str[14:16]<'30',\ weather.observationtime.str[0:14]+'00',weather.observationtime.str[0:14]+'30') weather = weather.drop_duplicates(subset='observationtime', keep='first') print(weather.head()) bb['Timestamp'] = np.where(bb.Timestamp.str[14:16]<'30',\ bb.Timestamp.str[0:14]+'00',bb.Timestamp.str[0:14]+'30') bb = bb.drop_duplicates(subset='Timestamp', keep='first') print(bb.head()) ``` ```python= # merge the to dataframe together based on Timestamp and observationtime bbweather = pd.merge(bb,weather,left_on='Timestamp',right_on='observationtime') bbweather.head(10) ``` ```python= # extract hour from observationtime and convert to float bbweather['hour'] = bbweather.observationtime.str[11:13].astype(float)\ + np.where(bbweather.observationtime.str[14:16] == '00',0,0.5) ``` ```python= # extract weekday from date from datetime import date,datetime bbweather['date'] = bbweather.observationtime.str[0:10] bbweather['date'] = bbweather['date'].apply(pd.to_datetime,format='%Y-%m-%d') bbweather['weekday'] = bbweather['date'].apply(date.weekday) bbweather = bbweather.drop('date',axis=1) bbweather = bbweather.drop('Timestamp',axis=1) bbweather = bbweather.drop('observationtime',axis=1) print(bbweather.head(10)) ``` ```python= # use one hot encoding for weather status bbweather = pd.get_dummies(bbweather, columns=["weather"], prefix=["weather"]) print(bbweather.head(10) ``` ### 7.4.2 Splitting the Data for Training and Testing ```python= from sklearn.model_selection import train_test_split X = bbweather.drop('SPCapacityAvailable',axis=1) y = bbweather['SPCapacityAvailable'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30) ``` ### 7.4.3 Training the Model We will now build a model to predict the number of available blue-bikes based on hour of the day, day of the week and weather predictions. Building and using the model is very similar to the RandomForestClassifier ```python= from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score from sklearn.ensemble import RandomForestRegressor # create dictionary of models modeldict = {'modelForest25':RandomForestRegressor(n_estimators=25),\ 'modelForest50':RandomForestRegressor(n_estimators=50),\ 'modelForest100':RandomForestRegressor(n_estimators=100),\ 'modelForest150':RandomForestRegressor(n_estimators=150),\ 'modelForest200':RandomForestRegressor(n_estimators=200),\ 'modelForest250':RandomForestRegressor(n_estimators=250)} # initialize MAE by choosing a high value MAE = 100000 # initialize bestmodel bestmodel = 'modelForest25' for modelkey in modeldict: model = modeldict[modelkey] model.fit(X_train,y_train) y_predict = model.predict(X_test) NEWMAE = mean_absolute_error(y_test,y_predict) if NEWMAE < MAE: MAE = NEWMAE bestmodel = modelkey print('Bestmodel: ' + modelkey) print('Mean Absolute Error: '+ str(MAE)) r2 = r2_score(y_test,y_predict) print('R square: ' + str(r2)) Bestmodel: modelForest250 Mean Absolute Error: 4.754012722412812 R square: 0.6877796659147795 ``` We learn that the more trees the better the prediction (at least till 250 trees). Our predictions from the best model have an error (mean absolute value of deviation) of less than 5 bikes. We could try to further improve the model by e.g. hot encoding the weekday or adding extra features, for example “holiday (y/n)”. # H8: Image Classification ## 8.1 Explore Digit Recognition ```python= import numpy as np import pandas as pd # keras import for the dataset from keras.datasets import mnist ``` We learn from the Keras documentation that mnist.load_data() returns 2 tuples: - X_test: uint8 array of grayscale image data with shape (num_samples, 28, 28). - y_train, y_test: uint8 array of digit labels (integers in range 0-9) with shape (num_samples ```python= (X_train, y_train), (X_test, y_test) = mnist.load_data() ``` ```python= # each training and test element is a 28 x 28 pixel grayvalue image print(X_train[0].shape) np.set_printoptions(linewidth=np.inf) # avoid line wrapping when printing array print(X_train[0]) ``` ![](https://i.imgur.com/w5BTgaF.png) ```python= # the corresponding label is the "real" digit print(y_train[0]) print(np.unique(y_train, return_counts=True)) # show all unique labels 5 ``` Remark: imshow() expects a numpy array as its first parameter ```python= import matplotlib.pyplot as plt %matplotlib inline plt.figure() nrows,ncols = 3,4 plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 12)) for i in range(12): # show first 12 digits plt.subplot(nrows,ncols,i+1) # i+1 is position of subplot in 3 x 4 table # show bitmap, interpret 0 as white and 255 as black (grayvalues) plt.imshow(X_train[i], cmap=plt.cm.gray_r) plt.title(y_train[i]) # real value as title plt.xticks([]) # no ticks on x axis plt.yticks([]) # not ticks on y axis ``` ![](https://i.imgur.com/FYb2Sqa.png) ## 8.2 Image Classification with Random Forest ### 8.2.1 Data Exploration ```python= import numpy as np # keras import for the dataset from keras.datasets import mnist ``` ### 8.2.2 Data Preparation ```python= (X_train, y_train), (X_test, y_test) = mnist.load_data() ``` - For the machine learning algorithms we can’t use images directly but we need a feature matrix. To obtain a matrix we “linearize” each picture into a 784 length vector. Each entry in the vector holds a value between 0 and 255. - We also normalize the values to a floating point number between 0 and 1. Normalized data usually delivers better models. ```python= # let's print the shape before we reshape and normalize print("X_train shape", X_train.shape) print("y_train shape", y_train.shape) print("X_test shape", X_test.shape) print("y_test shape", y_test.shape) # building the input vector from the 28x28 pixels = linearize the image to get a 784 (= 28x28) vector X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) # normalizing the data to help with the training # normalized data leads to better models X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255 X_test /= 255 # print the final input shape ready for training print("Train matrix shape", X_train.shape) print("Test matrix shape", X_test.shape) ``` ### 8.2.3 Training the Model ```python= from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=300) model.fit(X_train, y_train) ``` ```python= y_test2 = model.predict(X_test) ``` ```python= from sklearn.metrics import accuracy_score accuracy_score(y_test, y_test2) 0.9716 ``` ## 8.3 [Artificial Neural Networks (ANN)](https:https://colab.research.google.com/github/HOGENT-Databases/DB3-Workshops/blob/master/notebooks/Deep_Learning.ipynb//) ## 8.4 Image Classification with ANN ### 8.4.1 Setup ```python= # Check if gpu is used (optional) from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) ``` ```python= try: # %tensorflow_version only exists in Colab. %tensorflow_version 2.x except Exception: pass # TensorFlow and tf.keras import tensorflow as tf from tensorflow import keras print(tf.__version__) # Helper libraries import numpy as np import matplotlib.pyplot as plt import sklearn as sk import pandas as pd # fix random seed for reproducibility seed = 2020 np.random.seed(seed) import sklearn as sk from sklearn.model_selection import train_test_split from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from tensorflow.keras.optimizers import Adam from tensorflow.keras.constraints import max_norm from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint from tensorflow.keras.models import load_model ``` ```python= # helper functions for visualisation # plotting the loss functions used in this notebook # we plot the loss we want to optimise on the left (in this case: accuracy) def plot_history(history): plt.figure(figsize = (12,4)) plt.subplot(1,2,1) plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.plot(history.epoch, np.array(history.history['accuracy']),'g-', label='Train accuracy') plt.plot(history.epoch, np.array(history.history['val_accuracy']),'r-', label = 'Validation accuracy') plt.legend() plt.subplot(1,2,2) plt.xlabel('Epoch') plt.ylabel('Loss minimised by model') plt.plot(history.epoch, np.array(history.history['loss']),'g-', label='Train loss') plt.plot(history.epoch, np.array(history.history['val_loss']),'r-', label = 'Validation loss') plt.legend() ``` ### 8.4.2 Loading the data ```python= # load train and test data (X_train_all, y_train_all), (X_test, y_test) = mnist.load_data() ``` ```python= # let's print the shape before we reshape and normalize print("X_train_all shape", X_train_all.shape) print("y_train_all shape", y_train_all.shape) print("X_test shape", X_test.shape) print("y_test shape", y_test.shape) # building the input vector from the 28x28 pixels = linearize the image to get a 784 (= 28x28) vector X_train_all = X_train_all.reshape(60000, 784) X_test = X_test.reshape(10000, 784) # some preprocessing ... convert integers to floating point and rescale them to [0,1] range # normalized data leads to better models X_train_all = X_train_all.astype('float32') X_test = X_test.astype('float32') X_train_all /= 255 X_test /= 255 # print the final input shape print("Train_all matrix shape", X_train_all.shape) print("Test matrix shape", X_test.shape) ``` ```python= # This data set contains a training set and a test set # we still need to split off a validation set # Number of test samples N_test = X_test.shape[0] # split off 10000 samples for validation N_val = 10000 N_train = X_train_all.shape[0] - N_val # now extract the samples into train, validate and test sets # set random state = 0 to make sure you get the same split each time X_train, X_val, y_train, y_val = train_test_split(X_train_all, y_train_all, test_size = N_val, random_state=0) ``` ![](https://i.imgur.com/TeujqLO.png) ### 8.4.3 Multiclass Classification ```python= y_train_all = keras.utils.to_categorical(y_train_all) y_train = keras.utils.to_categorical(y_train) y_val = keras.utils.to_categorical(y_val) y_test = keras.utils.to_categorical(y_test) # look at the new labels for the first sample print(y_train[0]) ``` ```python= num_classes = 10 # this first network has 2 hidden layers # the first layer needs to be told explicitly what the input shape is # the output layer has 10 neurons: one neuron per class (digit) # Note that we use the "He" initialisation scheme here, since this is often advised # for layers with ReLu neurons # Also note that "dropout" is implemented in separate layers in Keras # they are added below in comment # note that you can also start your network with a dropout layer (randomly setting input features to 0) def initial_model(): # we create a variable called model, and we set it equal to an instance of a Sequential object. model = Sequential() # The first Dense object is the first hidden layer. Dense is one particular type of layer, but there are many other types # Dense is the most basic kind of layer in an ANN and each output of a dense layer is computed using every input to the layer # The input shape parameter input_shape=(784,) tells us how many neurons our input layer has, so in our case, we have 784. # The neural network needs to start with some weights and then iteratively updates them to better values. # The term kernel_initializer is a fancy term for the statistical distribution or function # to use for initialising the weights. # The input layer shape is specified as a parameter to the first Dense object’s constructor. model.add(Dense(32, activation='relu', input_shape=(784,), kernel_initializer='he_uniform')) # then add some dropout, set at a very low value for now # model.add(Dropout(0.5)) # a second dense layer with half as many neurons model.add(Dense(16, activation='relu', kernel_initializer='he_uniform')) # some more dropout # model.add(Dropout(0.5)) # and the output layer model.add(Dense(num_classes, activation='softmax')) # Before we can train our model, we must compile it # To the compile() function, we are passing the optimizer, the loss function, and the metrics that we would like to see. # Notice that the optimizer we have specified is called Adam. Adam is just a variant of SGD. model.compile(loss='categorical_crossentropy', optimizer= tf.keras.optimizers.Adam(learning_rate = 0.001), metrics=['accuracy']) return model ``` ### 8.4.4 Training ```python= # Create your model model_1 = initial_model() model_1.summary() # We now add batch size to the mix of training parameters # If you don't specify batch size below, all training data will be used for each learning step batch_size = 16 epochs = 20 # We fit our model to the data. Fitting the model to the data means to train the model on the data. # X_train is a numpy array consisting of the training samples. # y_train is a numpy array consisting of the corresponding labels for the training samples. # batch_size specifies how many training samples should be sent to the model at once. # epochs = how many times the complete training set (all of the samples) will be passed to the model. # verbose = 1 indicates how much logging we will see as the model trains. (other values are a.o. 0, 2) history_1 = model_1.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(X_val, y_val) ) # The output gives us the following values for each epoch: # Epoch number # Duration in seconds # Loss # Accuracy ``` ```python= # model_1 now contains the model at the end of the training run # We analyse the result: [train_loss, train_accuracy] = model_1.evaluate(X_train, y_train, verbose=0) print("Training set Accuracy:{:7.2f}".format(train_accuracy)) print("Training set Loss:{:7.4f}\n".format(train_loss)) [val_loss, val_accuracy] = model_1.evaluate(X_val, y_val, verbose=0) print("Validation set Accuracy:{:7.2f}".format(val_accuracy)) print("Validation set Loss:{:7.4f}\n".format(val_loss)) #Now we visualise what happened during training plot_history(history_1) ``` ![](https://i.imgur.com/hJtWi4S.png) ### 8.4.5 Final model and Analysis ```python= model_for_test = initial_model() model_for_test.summary() # We now add batch size to the mix of training parameters # If you don't specify batch size below, all training data will be used for each learning step batch_size = 128 epochs = 50 history_for_test = model_for_test.fit(X_train_all, y_train_all, batch_size=batch_size, epochs=epochs, verbose=1 ) ``` ```python= plt.figure(figsize = (12,4)) plt.subplot(1,2,1) plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.plot(history_for_test.epoch, np.array(history_for_test.history['accuracy']),'g-', label='Train accuracy') # g-: green solid line plt.legend() plt.subplot(1,2,2) plt.xlabel('Epoch') plt.ylabel('Loss minimised by model') plt.plot(history_for_test.epoch, np.array(history_for_test.history['loss']),'g-', label='Train loss') plt.legend() ``` ![](https://i.imgur.com/Q4oaSn1.png) ```python= [train_loss, train_accuracy] = model_for_test.evaluate(X_train_all, y_train_all, verbose=0) print("Training set Accuracy:{:7.2f}".format(train_accuracy)) print("Training set Loss:{:7.4f}\n".format(train_loss)) [test_loss, test_accuracy] = model_for_test.evaluate(X_test, y_test, verbose=0) print("Test set Accuracy:{:7.2f}".format(test_accuracy)) print("Test set Loss:{:7.4f}\n".format(test_loss)) Training set Accuracy: 1.00 Training set Loss: 0.0002 Test set Accuracy: 0.98 Test set Loss: 0.1490 ``` ### 8.4.6 Evaluation ```python= predictions = model_for_test.predict(X_test) # The first digit should be a 7 (shown as 1. at index 7) print(y_test[0]) # Check the probabilities returned by predict for first test sample # The function enumerate() receives and iterable and creates an iterator that, for each element, # returns a tuple containing the element's index and value for index, probability in enumerate(predictions[0]): print(f'{index}: {probability:.10%}') # Our model believes this digit is a 7 with nearly 100% certainty # Not all predictions have this level of certainty ``` ```python= # Locating the Incorrect Predictions images = X_test.reshape((10000, 28, 28)) incorrect_predicted_images = [] predicted_digits = [] expected_digits = [] for i, (p, e) in enumerate(zip(predictions, y_test)): predicted, expected = np.argmax(p), np.argmax(e) if predicted != expected: # prediction was incorrect incorrect_predicted_images.append(images[i]) predicted_digits.append(predicted) expected_digits.append(expected) ``` ```python= import matplotlib.pyplot as plt %matplotlib inline plt.figure() nrows,ncols=4,6 plt.subplots(nrows,ncols, figsize=(16, 12)) for i in range(nrows*ncols): # show first 24 digits plt.subplot(nrows,ncols,i+1) # i+1 is position of subplot in nrows x ncols table # show bitmap, interpret 0 as white and 255 as black (grayvalues) plt.imshow(incorrect_predicted_images[i].reshape(28,28), cmap=plt.cm.gray_r) plt.title(f'p: {predicted_digits[i]}; e: {expected_digits[i]}') plt.xticks([]) # no ticks on x axis plt.yticks([]) # not ticks on y axis ``` ## 8.5 [Convolutional Neural Networks (CNN)](https://https://colab.research.google.com/github/HOGENT-Databases/DB3-Workshops/blob/master/notebooks/CNN_Introduction.ipynb) ## 8.6 Image Classification with CNN ### 8.6.1 Setup ```python= # Check if gpu is used (optional) from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) ``` ```python= try: # %tensorflow_version only exists in Colab. %tensorflow_version 2.x except Exception: pass # TensorFlow and tf.keras import tensorflow as tf from tensorflow import keras print(tf.__version__) # Helper libraries import numpy as np import matplotlib.pyplot as plt import sklearn as sk import pandas as pd # fix random seed for reproducibility seed = 2020 np.random.seed(seed) import sklearn as sk from sklearn.model_selection import train_test_split from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from tensorflow.keras.optimizers import Adam from tensorflow.keras.constraints import max_norm from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint from tensorflow.keras.models import load_model from tensorflow.keras.layers import Conv2D, Dense, Flatten, MaxPooling2D ``` ```python= # helper functions for visualisation # plotting the loss functions used in this notebook # we plot the loss we want to optimise on the left (in this case: accuracy) def plot_history(history): plt.figure(figsize = (12,4)) plt.subplot(1,2,1) plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.plot(history.epoch, np.array(history.history['accuracy']),'g-', label='Train accuracy') plt.plot(history.epoch, np.array(history.history['val_accuracy']),'r-', label = 'Validation accuracy') plt.legend() plt.subplot(1,2,2) plt.xlabel('Epoch') plt.ylabel('Loss minimised by model') plt.plot(history.epoch, np.array(history.history['loss']),'g-', label='Train loss') plt.plot(history.epoch, np.array(history.history['val_loss']),'r-', label = 'Validation loss') plt.legend() ``` ### 8.6.2 Loading the data ```python= # load train and test data (X_train_all, y_train_all), (X_test, y_test) = mnist.load_data() ``` ```python= # let's print the shape before we reshape and normalize print("X_train_all shape", X_train_all.shape) print("y_train_all shape", y_train_all.shape) print("X_test shape", X_test.shape) print("y_test shape", y_test.shape) ``` ### 8.6.3 Data Preparation ```python= X_train_all = X_train_all.reshape((60000, 28, 28, 1)) print(X_train_all.shape) X_test = X_test.reshape((10000, 28, 28, 1)) print(X_test.shape) # some preprocessing ... convert integers to floating point and rescale them to [0,1] range # normalized data leads to better models X_train_all = X_train_all.astype('float32') X_test = X_test.astype('float32') X_train_all /= 255 X_test /= 255 # print the final input shape print("Train_all matrix shape", X_train_all.shape) print("Test matrix shape", X_test.shape) ``` ```python= # This data set contains a training set and a test set # we still need to split off a validation set # Number of test samples N_test = X_test.shape[0] # split off 10000 samples for validation N_val = 10000 N_train = X_train_all.shape[0] - N_val # now extract the samples into train, validate and test sets # set random state = 0 to make sure you get the same split each time X_train, X_val, y_train, y_val = train_test_split(X_train_all, y_train_all, test_size = N_val, random_state=0) ``` ### 8.6.4 Multiclass Classification ```python= y_train_all = keras.utils.to_categorical(y_train_all) y_train = keras.utils.to_categorical(y_train) y_val = keras.utils.to_categorical(y_val) y_test = keras.utils.to_categorical(y_test) # look at the new labels for the first sample print(y_train[0]) ``` ```python= num_classes = 10 # A typical convnet consists of # input layer that receives training samples # hidden layers that learn from training samples # output layer that produces predictions def initial_model(): # we create a variable called model, and we set it equal to an instance of a Sequential object. model = Sequential() # We'll start with a convolution layer # A convolutional layer uses the relationships between pixels in close proximity to learn useful features (or patterns) in small areas of each sample # These features become inputs to subsequent layers # Kernels typically are 3-by-3 (or 5-by-5 or 7-by-7) # Kernel-size is a hyperparameter # By looking at features near one another, the network begins to recognize features, like edges, straight lines and curves # Next, convolution layer moves kernel one pixel to the right (the stride) # Complete pass left-to-right and top-to-bottom is called a filter # For a 3-by-3 kernel, the filter dimensions will be two less than the input dimensions # For each 28-by-28 MNIST image, the filter will be 26-by-26 # Number of filters in the convolutional layer is commonly 32 or 64 for small images model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1))) # Input samples are 28-by-28-by-1—that is, 784 features each # Specified 64 filters and a 3-by-3 kernel for the layer, so the feature map size is 26-by-26-by-64 for a total of 26x26x64=43.264 features # Adding a pooling layer # Outputs maximum feature from each pool # Stride for a 2-by-2 pool is 2 # Every group of four features is reduced to one, so 2-by-2 pooling compresses number of features by 75% # Reduces previous layer’s output from 26-by-26-by-64 to 13-by-13-by-64 model.add(MaxPooling2D(pool_size=(2, 2))) # Convnets often have many convolution and pooling layers. # Input to the second convolution layer is the 13-by-13-by-64 output of the first pooling layer # Output of this Conv2D layer will be 11-by-11-by-128 # For odd dimensions like 11-by-11, Keras pooling layers round down by default (in this case to 10-by-10), so this pooling layer’s output will be 5-by-5-by-128 model.add(Conv2D(filters=128, kernel_size=(3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) # Flattening the results # Model's final output will be a one-dimensional array of 10 probabilities that classify the digits # To prepare for one-dimensional final predictions, need to flatten the previous layer’s output to one dimension # Flatten layer's output will be 1-by-3200 (5 × 5 × 128) model.add(Flatten()) # Adding a Dense Layer to Reduce the Number of Features # Layers before the Flatten layer learned digit features # Now must learn the relationships among those features to classify which digit each image represents # Accomplished with fully connected Dense layers # The following Dense layer creates 128 neurons (units) that learn from the 3200 outputs of the previous layer model.add(Dense(units=128, activation='relu')) # Adding Another Dense Layer to Produce the Final Output # Final Dense layer classifies inputs into neurons representing the classes 0-9 # The softmax activation function converts values of these 10 neurons into classification probabilities # Neuron with highest probability represents the prediction for a given digit image model.add(Dense(units=10, activation='softmax')) # Before we can train our model, we must compile it # To the compile() function, we are passing the optimizer, the loss function, and the metrics that we would like to see. # Notice that the optimizer we have specified is called Adam. Adam is just a variant of SGD. model.compile(loss='categorical_crossentropy', optimizer= tf.keras.optimizers.Adam(learning_rate = 0.001), metrics=['accuracy']) return model ``` ### 8.6.5 Training the Model ```python= # Create your model model_1 = initial_model() model_1.summary() # We now add batch size to the mix of training parameters # If you don't specify batch size below, all training data will be used for each learning step batch_size = 128 epochs = 20 # We fit our model to the data. Fitting the model to the data means to train the model on the data. # batch_size specifies how many training samples should be sent to the model at once. # epochs = how many times the complete training set (all of the samples) will be passed to the model. # verbose = 1 indicates how much logging we will see as the model trains. (other values are a.o. 0, 2) history_1 = model_1.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(X_val, y_val) ) # The output gives us the following values for each epoch: # Epoch number # Duration in seconds # Loss # Accuracy ``` ```python= # model_1 now contains the model at the end of the training run # We analyse the result: [train_loss, train_accuracy] = model_1.evaluate(X_train, y_train, verbose=0) print("Training set Accuracy:{:7.2f}".format(train_accuracy)) print("Training set Loss:{:7.4f}\n".format(train_loss)) [val_loss, val_accuracy] = model_1.evaluate(X_val, y_val, verbose=0) print("Validation set Accuracy:{:7.2f}".format(val_accuracy)) print("Validation set Loss:{:7.4f}\n".format(val_loss)) #Now we visualise what happened during training plot_history(history_1) ``` ![](https://i.imgur.com/yyIBZHG.png) ### 8.6.6 Final model and Analysis ```python= model_for_test = initial_model() model_for_test.summary() # We now add batch size to the mix of training parameters # If you don't specify batch size below, all training data will be used for each learning step batch_size = 128 epochs = 50 history_for_test = model_for_test.fit(X_train_all, y_train_all, batch_size=batch_size, epochs=epochs, verbose=1 ) ``` ```python= plt.figure(figsize = (12,4)) plt.subplot(1,2,1) plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.plot(history_for_test.epoch, np.array(history_for_test.history['accuracy']),'g-', label='Train accuracy') plt.legend() plt.subplot(1,2,2) plt.xlabel('Epoch') plt.ylabel('Loss minimised by model') plt.plot(history_for_test.epoch, np.array(history_for_test.history['loss']),'g-', label='Train loss') plt.legend() ``` ![](https://i.imgur.com/PLK6d4n.png) ```python= [train_loss, train_accuracy] = model_for_test.evaluate(X_train_all, y_train_all, verbose=0) print("Training set Accuracy:{:7.2f}".format(train_accuracy)) print("Training set Loss:{:7.4f}\n".format(train_loss)) [test_loss, test_accuracy] = model_for_test.evaluate(X_test, y_test, verbose=0) print("Test set Accuracy:{:7.2f}".format(test_accuracy)) print("Test set Loss:{:7.4f}\n".format(test_loss)) [train_loss, train_accuracy] = model_for_test.evaluate(X_train_all, y_train_all, verbose=0) print("Training set Accuracy:{:7.2f}".format(train_accuracy)) print("Training set Loss:{:7.4f}\n".format(train_loss)) [test_loss, test_accuracy] = model_for_test.evaluate(X_test, y_test, verbose=0) print("Test set Accuracy:{:7.2f}".format(test_accuracy)) print("Test set Loss:{:7.4f}\n".format(test_loss)) Training set Accuracy: 1.00 Training set Loss: 0.0000 Test set Accuracy: 0.99 Test set Loss: 0.0519 ``` ### 8.6.7 Evaluation ```python= predictions = model_for_test.predict(X_test) # The first digit should be a 7 (shown as 1. at index 7) print(y_test[0]) # Check the probabilities returned by predict for first test sample # The function enumerate() receives and iterable and creates an iterator that, for each element, # returns a tuple containing the element's index and value for index, probability in enumerate(predictions[0]): print(f'{index}: {probability:.10%}') # Our model believes this digit is a 7 with nearly 100% certainty # Not all predictions have this level of certainty ``` ```python= # Locating the Incorrect Predictions images = X_test.reshape((10000, 28, 28)) incorrect_predicted_images = [] predicted_digits = [] expected_digits = [] for i, (p, e) in enumerate(zip(predictions, y_test)): predicted, expected = np.argmax(p), np.argmax(e) if predicted != expected: # prediction was incorrect incorrect_predicted_images.append(images[i]) predicted_digits.append(predicted) expected_digits.append(expected) ``` ```python= import matplotlib.pyplot as plt %matplotlib inline plt.figure() nrows,ncols=4,6 plt.subplots(nrows,ncols, figsize=(16, 12)) for i in range(nrows*ncols): # show first 24 digits plt.subplot(nrows,ncols,i+1) # i+1 is position of subplot in nrows x ncols table # show bitmap, interpret 0 as white and 255 as black (grayvalues) plt.imshow(incorrect_predicted_images[i].reshape(28,28), cmap=plt.cm.gray_r) plt.title(f'p: {predicted_digits[i]}; e: {expected_digits[i]}') plt.xticks([]) # no ticks on x axis plt.yticks([]) # not ticks on y axis ``` # H9: Document Classification and NLP ## 9.1 Feature Engineering in NLP - The main problem is to convert plain text to a format that can be used by the standard classification methods, which is called feature engineering. - NLP is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine “read” text. - Part-speech-tagging is the process of determining to which word category a specific word belongs. Word categories can be for example verbs , nouns , pronouns , etc. ### 9.1.1 Detect Language ```python= import pandas as pd import numpy as np import nltk !pip install langdetect # install langdetect if necessary (required for google colab) from langdetect import detect detect("Should I wear a mask when I'm exercising outside?") detect("Lufthansa bietet wieder Urlaubsflüge an") ``` ### 9.1.2 Tokenization ```python= from nltk.tokenize import word_tokenize # install the necessary files nltk.download('punkt') # sample text for performing tokenization text = """Controversial trials in which volunteers are intentionally infected with Covid-19 could accelerate vaccine development, according to the World Health Organization, which has released new guidance on how the approach could be ethically justified despite the potential dangers for participants. So-called challenge trials are a mainstream approach in vaccine development and have been used in malaria, typhoid and flu, but there are treatments available for these diseases if a volunteer becomes severely sick. For Covid-19, a safe dose of the virus has not been established and there are no failsafe treatments if things go wrong.""" # Passing the string text into word tokenize for splitting the text into tokens. token = word_tokenize(text) print(token) ``` ### 9.1.3 Stop word removal ```python= nltk.download('stopwords') from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import string def remove_stopwords_en(text): stop_words_en = set(stopwords.words('english')) punctuations="?:!.,;<>/\+-" # turn the string into a list of words based on separators (blank, comma, etc.) word_tokens = word_tokenize(text.lower()) # create a list of all words that are neither stopwords nor punctuations result = [x for x in word_tokens if x not in stop_words_en and x not in punctuations] # create a new string of all remaining words seperator = ' ' return seperator.join(result) print(remove_stopwords_en(text)) ``` ### 9.1.4 Stemming Words like welcome and welcoming are essentially about the same topic and can better be combined into a single term for better document classification. This process is based on certain rules (like removing suffixes -e or -ing) and can cause some irrelevant results. ```python= # Stemming: examples from nltk.stem.snowball import SnowballStemmer englishStemmer=SnowballStemmer("english") stm = ["welcome", "welcoming"] for word in stm: print(word + ":" + englishStemmer.stem(word)) print() stm = ["ball", "balls"] for word in stm: print(word + ":" + englishStemmer.stem(word)) print() stm = ["waited", "waiting", "waits"] for word in stm: print(word + ":" + englishStemmer.stem(word)) print() stm = ["giving", "give", "given", "gave"] for word in stm: print(word + ":" + englishStemmer.stem(word)) print() ``` ```python= dutchStemmer=SnowballStemmer("dutch") stm = ["worden", "wordt"] for word in stm: print(word + ":" + dutchStemmer.stem(word)) print() stm = ["dader", "daders", "daad"] for word in stm: print(word + ":" + dutchStemmer.stem(word)) print() stm = ["las", "lezen", "gelezen", "lees"] for word in stm: print(word + ":" + dutchStemmer.stem(word)) print() ``` ```python= # Stemming: replace words by stem def stemming_en(text): word_tokens = word_tokenize(text.lower()) seperator = ' ' result = [englishStemmer.stem(x) for x in word_tokens] return seperator.join(result) print(stemming_en(text)) ``` ### 9.1.5 Lemmatization The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors. ```python= # Lemmatization nltk.download('wordnet') from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print("rocks:", lemmatizer.lemmatize("rocks")) print("corpora:", lemmatizer.lemmatize("corpora")) words = ["gone", "going", "went"] for word in words: print(word + ":" + lemmatizer.lemmatize(word)) ``` ```python= def lemmatizing_en(text): word_tokens = word_tokenize(text.lower()) seperator = ' ' result = [lemmatizer.lemmatize(x) for x in word_tokens] return seperator.join(result) print(lemmatizing_en(text)) ``` ### 9.1.6 Representing text documents for data mining #### 9.1.6.1 Bag-of-Words The Bag-of-Words model is simple: it builds a vocabulary from a corpus of documents and counts how many times the words appear in each document. ![](https://i.imgur.com/2O3FWEI.png) ```python= # finding the frequency distinct in the tokens # Importing FreqDist library from nltk and passing token into FreqDist text = """Controversial trials in which volunteers are intentionally infected with Covid-19 could accelerate vaccine development, according to the World Health Organization, which has released new guidance on how the approach could be ethically justified despite the potential dangers for participants. So-called challenge trials are a mainstream approach in vaccine development and have been used in malaria, typhoid and flu, but there are treatments available for these diseases if a volunteer becomes severely sick. For Covid-19, a safe dose of the virus has not been established and there are no failsafe treatments if things go wrong.""" # Passing the string text into word tokenize for splitting the text into tokens. token = word_tokenize(text) from nltk.probability import FreqDist fdist = FreqDist(token) fdist ``` To address this problem there is an advanced variant of the Bag-of-Words that, instead of simple counting, uses the **term frequency–inverse document frequency** (or Tf–Idf). Basically, the value of a word increases proportionally to count, but it is inversely proportional to the frequency of the word in the corpus. [See example](https://https://colab.research.google.com/github/HOGENT-Databases/DB3-Workshops/blob/master/notebooks/feature_engineering_in_nlp.ipynb#scrollTo=Nc5sAthN85i0) #### 9.1.6.2 Word Embedding #### 9.1.6.3 Global Vectors for Word Representation (GloVe) #### 9.1.6.4 Approach Here's how we will solve a classification problem: - convert all text samples in the dataset into sequences of word indices. A "word index" would simply be an integer ID for the word. We will only consider the top 20,000 most commonly occuring words in the dataset, and we will truncate the sequences to a maximum length. - prepare an "embedding matrix" which will contain at index i the embedding vector for the word of index i in our word index. - load this embedding matrix into a Keras Embedding layer, set to be frozen (its weights, the embedding vectors, will not be updated during training). - build on top of it a 1D convolutional neural network, ending in a softmax output over the number of categories. ## 9.2 Pretrained Word Embeddings - [Case Spam Detection](https://https://colab.research.google.com/github/HOGENT-Databases/DB3-Workshops/blob/master/notebooks/pretrained_word_embeddings.ipynb) - [Blog Gender Classification](https://https://colab.research.google.com/github/HOGENT-Databases/DB3-Workshops/blob/master/notebooks/pretrained_word_embeddings_blog_gender.ipynb) # H10: [Solutions](https://https://colab.research.google.com/github/HOGENT-Databases/DB3-Workshops/blob/master/index.ipynb)