# **Preclass Session 2: Linear Model** [toc] ## **Prediction Types** 1. Four statistical data types ```graphviz digraph Datatypes { A [label="Types of data"] B [label="Numerical or \n Quantitive data"] B1 [label="Discrete data \n (can count)"] B2 [label="Continuous data \n (can measure)"] C [label="Categorical or \n Qualitative data"] C1 [label="Nomial data \n (can brand)"] C2 [label="Ordinal data \n (can rank/order)"] A->B A->C B->B1 B->B2 C->C1 C->C2 } ``` 2. Two prediction types <center><img src=https://d1q4qwyh0q55bh.cloudfront.net/images/RwMftMVZBjde06KPVcgqrYoA69vggibULM9aPrSLvxH4AoX3eRRYgxDE0OGxjWMB.png?d=desktop-thumbnail></center> <br> - Predict a continuous value: [regression](https://en.wikipedia.org/wiki/Regression_analysis). For each input ${\bf x}^t$ the model (i.e., our `“fantastic function”`) predicts a real value ${\hat y}^t\in\mathbb{R}$. Optional reading: [Why the name regression?](https://blog.minitab.com/en/statistics-and-quality-data-analysis/so-why-is-it-called-regression-anyway) - Predict a discrete `class/type`: [classification](https://en.wikipedia.org/wiki/Statistical_classification) & [clustering](https://en.wikipedia.org/wiki/Statistical_classification). For each input ${\bf x}^t$ the model predicts a class label ${\hat y}^t$ from a set $\bf Y \ni {\hat y}^t$. For example, binary classification has $\bf Y=\{0,1\}$ or $\bf Y=\{-1,1\}$, and, for convenience, multiclass case has $\bf Y=\{1,\dots,K\}$. ## **Function & TEFPA** 1. Modeling `“fantastic functions”`: input ${\bf x}\overset{f_\theta}{\to} predictions {\bf {\hat y}}$ $$input {\bf x} \xrightarrow[(\phi_1\dots\phi_n)]{\text{features}} embedding coordinates {\bf z} =(z_1, \dots, z_n) \xrightarrow[\text{generators}]{\text{predictors}} output {\bf \hat{y}}$$ - The input/output pair $({\bf x},y)$ defines the specific task $\mathcal{T}$ of this `“fantastic”` function (in ML we call it a **model**). - A function can be parametrized using parameters ${\boldsymbol \theta}$ (called a parametric function). - Its chosen `structure/type` induces a function space $\mathcal{F}$ for these parametric functions to live in. - The type/structure is called `model class`. 2. Analogy: think of function as a machine. - Its parameters are just like knobs, switches and sliders adjusting the machine. - All possible configurations/settings of the knobs/parameters constitute a space of machines/functions of this particular type. - The model classes are like diamond-shaped, box-shaped, or pipeline-shaped machine types. <center><img src=https://d1q4qwyh0q55bh.cloudfront.net/images/VJEgPEbAixmSwzGvdTPClA9poWFexcRbTZWqMuvpCBEU54EXiSaUGXRZIli5EUik.png?d=desktop-thumbnail></center> <br> - Example: - Linear function $y=ax+b$ with input $\bf x$, output $\bf y$, params (knobs) ${\boldsymbol \theta} = (a,b)$. - Quadratic function $y=ax^2+b$ also has 2 parameters but with different type/structure. - Notation: $y=f_\theta(x) = f(x;\theta)$. - Coefficient a acts as a knob [rotating/reflecting](https://www.geogebra.org/m/hqPTmW83) the graph of $y$ about the origin 0 for linear case, and [curving/bending](https://www.geogebra.org/m/uXz7MEhY) the graph of y for quadratic case. - [Intercept $b$](https://en.wikipedia.org/wiki/Y-intercept) acts as a slider [translating](https://en.wikipedia.org/wiki/Translation_(geometry)) the graph of y, hence also known as `bias/offset` term. 3.Learning (or training; more details in coming lectures) is giving computer a sample dataset D for it to find (near)optimal parameters ${\boldsymbol {\hat \theta}}$. - Learning/training then means getting ${\boldsymbol {\hat \theta}} \xleftarrow[\text{optimize}]{\text{search}} (f_\theta,D)$ by `“fitting”` $f_\theta$ into dataset $D$. - The training dataset $D$ is thus a form of experience ${\mathcal E}$ for the computer to `“learn”`. - For **classification** and **regression**, we need to **annotate** or **label** the dataset, i.e., assigning output $y^t$ for each input ${\bf x}^t: D=\{({\bf x}^t,y^t)\}_{t=1}^N$, hence the name `supervised learning`. - We need a **performance measure** $\mathcal{P}$ (in ML we have metrics & losses; more in coming classes) to say how well a specific model $f_{\theta^k}$ performs on the given dataset $D$ using a specific set of parameters ${\boldsymbol \theta}^k$. - Usually it is a scalar number, i.e., ${\mathcal P}: {\boldsymbol \theta}^k \xrightarrow{(f_\theta,D)} \text{a score}\in\mathbb{R}$. - Essentially, performance measure $\mathcal{P}$ is a regression function with parameters ${\boldsymbol \theta}$ as its input. - To find a (near)optimal set of parameters ${\boldsymbol {\hat \theta}}$ we need to give computer a search, optimize (learn, train) algorithm ${\mathcal A}$ to move in the function space, i.e., changing from ${\boldsymbol \theta}^k$ to ${\boldsymbol \theta}^{k+1}$ for better performance. 4.The unified Machine Learning framework `TEFPA` ```mermaid graph LR Task --> Experience --> Function_Space --> Performance_measure --> Algorithm ``` 5.Our fantastic functions can be parametrized as $f_\theta = ({\boldsymbol \phi}_\alpha, p_w)$ for predictions and $f_\theta = ({\boldsymbol \phi}_\alpha, g_\beta)$ for generation. When we learn/train both the features/encoder ${\boldsymbol \phi}_\alpha$ and predictor $p_w$ (or generator/decoder $g_\beta$) simultaneously it’s called `end-to-end` learning/training. In this Session 2 we only consider training the predictor $p_w$ assuming the embeddings ${\bf z}$ are given by a pre-trained model ${\boldsymbol \phi}_\alpha$ for feature extraction. ## **Classifiers** <center><img src=https://th.bing.com/th/id/R.260b4b920f801cab5801558086493577?rik=f54u7RP6Q6JGqg&riu=http%3a%2f%2fmachinelearningcoban.com%2fassets%2f13_softmax%2fsoftmax_nn.png&ehk=KPBWyBFQLhIMBNMUutKadJlOxMeFpTYHP1E1Tp%2borTI%3d&risl=&pid=ImgRaw&r=0></center><br> Note: The above graphic uses notations different from below explanations: - Prediction score ${\hat z}_i\in\mathbb{R}$ $\rightarrow$ ${\hat y}_i\in\mathbb{R}$ - ${\bf {\hat z}} = ({\hat z}_1,\dots,{\hat z}_i,\dots,{\hat z}_C)$ $\rightarrow$ ${\bf {\hat y}} = ({\hat y}_1,\dots,{\hat y}_i,\dots,{\hat y}_K)$ - $C$ classes $\rightarrow$ $K$ classes **1. Multiclass Classfication** - We need to give a prediction score ${\hat y}_i\in\mathbb{R}$ for each class ID $i$. Thus, for the case of K classes, the prediction score is a vector ${\bf {\hat y}} = ({\hat y}_1,\dots,{\hat y}_i,\dots,{\hat y}_K)^\top\in\mathbb{R}^K$. - Now the output predicted label of $f_\theta$ is simply $\arg\max_k {\bf {\hat y}}$ (or the top-k scores if we want to predict several k classes with highest scores.) **2. Binary Classfication** - For the case of binary label $y\in\{0,1\}$ or `True/False`, or $y\in\{-1,1\}$, we can use a single score value ${\hat y}\in \mathbb{R}$ then compare it with a threshold $\delta$ to give predicted label, e.g., usually $\delta = 0.5$ for the case $y\in\{0,1\}$ - If ${\hat y}\geq\delta$ then output a predicted label of `1/True`; otherwise output a predicted label of `0/False` or `-1`. We can write compactly using indicator function: **predicted label** $=\mathbf{1}_{[\hat{y}\geq\delta]}$ - If $\delta$ = 0 and $y\in\{-1,1\}$, we can use $\text{sign}({\hat y})$ to output the predicted label. **3. Activation Functions** - We can convert the prediction scores ${\hat y}_i\in\mathbb{R}$ into a desired range using a **transfer** $s(~)$ also called [activation function](https://en.wikipedia.org/wiki/Activation_function) in ML. For example: - [Logistic sigmoid function](https://en.wikipedia.org/wiki/Logistic_function) $\sigma({\hat y})\in [0,1]$ for probability of True class. [Geogebra viz](https://www.geogebra.org/m/vegkdavv). - [Tanh function](https://paperswithcode.com/method/tanh-activation) $\tanh({\hat y}) \in [-1,1]$ can be converted into probability of True class as $\frac{1+\tanh({\hat y})}2$. Geogebra viz. - [Softmax function](https://en.wikipedia.org/wiki/Softmax_function) acts on each component ${\hat y}_k$ (element-wise) to convert prediction score vector ${\bf y}$ into a probability vector $\text{softmax}({\bf y})$ <center><img src=https://raphaelmcobe.github.io/dataSanJose2019_nn_presentation/activation_functions.png></center> ## **Tensorflow library** 1. Declare the type of model ```python from tensorflow.keras.models import Sequential model = Sequential() ``` 2. Declare layers ```python from tensorflow.keras.layers import Input, Dense # Layer parameters n_features = (784) n_outputs = 10 output_function = "softmax" # Initialize layers input = Input(shape=n_features) dense = Dense(n_outputs, activation=output_function) # Add layers into model model.add(input) model.add(dense) ``` 3. Compile ```python # Define loss, metrics and optimizer loss = "categorical_crossentropy" metrics = ["accuracy"] optimizer = "adam" model.compile(loss=loss, metrics=metrics, optimizer=optimizer) # Display the architecture of model model.summary() ``` 4. Fit (or train) ```python history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=20) ``` 5. Predict ```python model.predict(x_test) # Sigmoid Regression: apply the thresholding to determine the label. # Softmax Regression: apply the argmax to determine the label. ```