# Feature Engineering
This guide aims to motivate and introduce the basic concepts of feature engineering. This guide is **not** comprehensive; it is meant to supplement lecture, not replace it.
---
## What is a feature?
A **feature** is an attribute of the data that a model uses to make predictions. For example, a person's age might be a feature in a model that predicts salary. In a table of data, the features are the columns of the table.
Mathematically, when we propose a model
$$ \hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_p x_p$$
the $x_1 , ... , x_p$ are the features of the model.
----
Until this point in the class, we have been using the columns of the data we've been given (maybe with some minor tweaks) as the inputs to our models. But we aren't restricted to just the columns we've been given!
**Feature engineering** is the process of generating more features and choosing which features that we use as input to our model.
---
## Motivation for Feature Engineering
There's nothing wrong with only using the raw data as input to a model. But since we want our models to be as accurate as possible, we want to give our models as much information as possible to be able to make predictions. This is why we do feature engineering.
Below are some specific scenarios in which feature engineering is especially helpful.
----
### 1. Predicting Nonlinear Trends
Let's say the data we want to model looks like
![](https://i.imgur.com/PIq7WgN.png)
If we try to fit a model of the form $\hat{y} = \theta_0 + \theta_1 x$, we get the following:
![](https://i.imgur.com/dDbmXVu.png)
It makes sense that our model does poorly, because we're using a linear model to predict data that seems to be quadratic.
It seems like there's nothing we can do, since we can only use linear models.
But a linear model only means that we can write the model as $\hat{y} = \theta \cdot x$. We have complete freedom to choose what $x$ is, as long as $\theta$ does not change.
> Note: the first model is $\hat{y} = \theta_0 + \theta_1 x = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ x \end{bmatrix} = \theta \cdot x$, so it is linear.
Since we have the freedom to choose $x$ and we know the data looks quadratic, let's choose $x = \begin{bmatrix} 1 \\ x \\ x^2 \end{bmatrix}$ and use the model $\hat{y} = \theta \cdot x = \theta_0 + \theta_1 x + \theta_2 x^2$ with our updated $x$.
![](https://i.imgur.com/kvQnjqj.png)
Our new model seems to be fitting the data well! The model's improved performance is due to the fact that we ++engineered a quadratic feature++.
This example is indicative of a more general use case of feature engineering. When the data exhibits a nonlinear trend, it's often helpful to engineer nonlinear features.
### 2. Including Qualitative Data in Models
Sometimes we get data that isn't quantitative (e.g. Sex, Zip Code, City, etc.). As of now, we have no way of including such information in our models, so we just throw it away.
There are ways to engineer features out of qualitiative information, which ultimately enable models to utilize more information to make predictions.
## Common Feature Engineering Methods
Below is a list of some of the most common feature engineering techniques.
1. **Polynomial Features**
Engineering polynomial features is an extension of the previous quadratic feature example. Instead of (or in addition to) using a quadratic feature, we can use $x^3$, $x^4$, or any degree of $x$.
2. **One-Hot Encoding**
One-hot encoding is a method used to convert categorical data into numerical features that can be inputted into a model. I'll refer you to lecture for a more detailed explanation. ++One-hot encoding is one of the most useful feature engineering methods.++
3. **Bag-of-Words Encoding**
Bag-of-words encoding is a method used to convert text data (e.g. emails) into quantitative data that can be inputted into a model. Again, I'm going to refer you to lecture for a more detailed explanation.
4. **Dropping Features**
Nobody says you have to use all your data to make predictions! If you have good reason to believe one of the features in your raw data is useless for making predictions, it's OK to drop it! But make sure you really have a good reason to drop it.
This might not seem like feature engineering, but the process of choosing which features to include in the model is also part of feature engineering.
## Feature Functions
When you get raw data, each data point has some number of attributes (columns). Let's say the number of attributes in the raw data is $d$.
After doing feature engineering, you'll have a set of features you want to include in your model. Let's say that the number of features you have is $p$.
> For example, in our example quadratic model, d=1 and p=2.
In general, $p \neq d$. Most of the time, $p > d$, but this is not a requirement.
The main takeaway is that there is a difference between the raw data and the data after you've done feature engineering. Each raw data point is $d$-dimensional, and each data point after feature engineering is $p$-dimensional.
We call the $d$-dimensional vectors ++data vectors++ and the $p$-dimensional vectors ++feature vectors++. For each data vector, there is a corresponding feature vector.
The job of a feature function is to transform data vectors into feature vectors. Mathematically, the feature function is denoted $\phi$ and the feature vector corresponding to the data vector $x$ is $\phi(x)$.
Since the feature function $\phi$ only transforms one data vector $x$ into a feature vector $\phi(x)$ and we usually have $n$ data vectors $x_1, x_2, ... , x_n$, we have to apply $\phi$ to all $n$ data vectors to get $\phi(x_1), \phi(x_2), ... , \phi(x_n)$.
Then, in order to take advantage of the power of ~ linear algebra ~, we construct a matrix $\Phi = \begin{bmatrix}
& \phi(x_1) & \\
& \phi(x_2) & \\
& \vdots & \\
& \phi(x_n) &
\end{bmatrix}$. Note that since each $\phi(x_i)$ is $p$-dimensional, the $\Phi$ matrix is $n \times p$ dimensional.
> Side note: It is this $\Phi$ matrix you use when you call `model.fit()` in sklearn, not the data matrix which usually denoted $X$. I just wanted to state this explicitly since it can be a point of confusion for students.
### Feature Functions Example
All the above was very abstract, so let's take a look at an example. Let's imagine that you don't know the formula for the area of a circle and you're trying to discover it using linear regression. You measure the radius and area of some circles and gather the information in a table:
| radius | area |
| ------ | ---- |
| $3$ | $9 \pi$ |
| $7$ | $49 \pi$ |
| $10$ | $100 \pi$ |
| $20$ | $400 \pi$ |
You set up for linear regression, defining your data matrix $X = \begin{bmatrix} 3 \\ 7 \\ 10 \\ 20 \end{bmatrix}$ and $y = \begin{bmatrix} 9\pi \\ 49\pi \\ 100\pi \\ 400\pi \end{bmatrix}$. The raw data you gathered (the radius) is $d=1$-dimensional, so your data matrix $X$ is $n \times 1$ dimensional.
Excited that you're just a few lines of Python away from uncovering the formula for the area of a circle, you call `model.fit(X, y)` and `model.predict(X)` to see the predictions of your model. Much to your chagrin, you find that your model predicted $\hat{y} = \begin{bmatrix} 158 \\ 369 \\ 527 \\ 1055 \end{bmatrix}$, which is nowhere close to the areas you recorded.
**What went wrong?** You trained a model of the form $\hat{y} = \theta r$, which clearly was not good enough to predict the area. You realize your model included all the data you gathered yet was still unable to correctly predict the area, so all hope seems lost.
Determined to find the formula, you hypothesize that since the area of a square is proportional to the square of its side length, maybe the area of a circle is proportional to the square of its radius. To test out your hypothesis, you apply the feature function $\phi(x) = \begin{bmatrix} x \\ x^2 \end{bmatrix}$ to your raw data to get a new feature $\text{radius}^2$:
| $\text{radius}$ | $\text{radius}^2$ | $\text{area}$ |
| ------ | ---- | ---- |
| $3$ | $9$ | $9 \pi$ |
| $7$ | $49$ | $49 \pi$ |
| $10$ | $100$ | $100 \pi$ |
| $20$ | $400$ | $400 \pi$ |
You set up for linear regression again, this time with a new matrix that includes your brand new feature: $\Phi = \begin{bmatrix} 3 & 9 \\ 7 & 49 \\ 10 & 100 \\ 20 & 400 \end{bmatrix}$.
The feature vectors are 2-dimensional since $\phi(x)$ outputted a 2-dimensional vector. The $y$-vector remains unchanged, since you're still trying to predict the area. The new model is now $\hat{y} = \theta_1 r + \theta_2 r^2$
You call $\text{model.fit(} \Phi \text{, y)}$ and $\text{model.predict(} \Phi )$ to see the predictions of your model. This time, your model predicts $\hat{y} = \begin{bmatrix} 9\pi \\ 49\pi \\ 100\pi \\ 400\pi \end{bmatrix}$! This matches perfectly with the areas you recorded!
Looking at the weights found by linear regression, you see that $\theta_1 = 0$ and $\theta_2 = \pi$. You conclude that the area of a circle is $\text{area} = \pi r^2$.
> In real life, you'd want to test out this formula using some test data, but for this example we'll ignore that.
To reiterate, the example above uses $\phi(x) = \begin{bmatrix} x \\ x^2 \end{bmatrix}$ to transform the raw data which is 1-dimensional into a feature vector which is 2-dimensional. The linear regression model is then fit using the matrix that is composed of the feature vectors, NOT the data matrix.
This example shows the value of having feature functions. Without feature functions, we would be restricted to just using the raw data alone, which in many cases does not lead to good models (as we saw in this example).
### Clarifications about Feature Functions
1. It is not always necessary to use feature functions. In these cases, just to keep the notation consistent, we say $\phi(x) = x$ so the feature vectors are just the raw data.
==2. Feature vectors do not the include the $1$ that is necessary to have a bias term in the model. The reason for this is because ++the bias term is not considered a feature of the model.++==
## Mathematical Implications of Feature Engineering
Recall that when it comes time to train a least squares model, we need to solve the normal equations
$$ (\Phi^T \Phi) \theta^* = \Phi^Ty$$
Since feature engineering changes the $\Phi$ matrix, feature engineering has an impact on the normal equations. This section looks at those impacts.
### 1. Too many features
If you add too many features, the normal equations will no longer have a unique solution.
To see why, remember that the normal equations are really a system of $n$ linear equations. Also, remember that each feature has an associated parameter, so there are $p$ parameter values to solve for in the normal equations.
If $p$ > $n$, then you have more unknowns than equations and the system does not have a unique solution. There are infinite solutions to the normal equations, meaning there are an infinite number of choices of $\theta ^*$ that achieve the minimum value of loss on the training data.
### 2. Linearly dependent features
If you have features that are linearly dependent, then the normal equations will again not have a unique solution.
To see why, note that the unique solution to the normal equations $\theta ^* = (\Phi^T \Phi)^{-1} \Phi^Ty$ only exists if $(\Phi^T \Phi)^{-1}$ exists.
But $(\Phi^T \Phi)^{-1}$ exists only when $\Phi$ is full rank.
> If you want to see a proof of this, check out my discussion 9 notes from Fall 2019 on [my resources page](tinyurl.com/raguvirData100). You don't need to understand the proof, but you'll need to know that $(\Phi^T \Phi)^{-1}$ exists only when $\Phi$ is full rank.
So if $\Phi$ is not full rank (i.e. does not have linearly independent columns), then there is not a unique solution to the normal equations. Again, there are infinite solutions to the normal equations, meaning there are an infinite number of choices of $\theta ^*$ that achieve the minimum value of loss on the training data.
#### Why is infinite solutions bad?
If there are infinite solutions to the normal equations, then every time you run your code that trains a linear regression model (e.g. `model.fit(Phi, y)`), it can return a different $\theta ^*$.
This means that when you do `model.predict(Phi_test)`, you can get wildly different predictions even if you use the same training data. ++A model that doesn't reliably give the same predictions for the same inputs is not a model that can be trusted.++
## Overfitting
In the pursuit of getting more accurate models, we use feature engineering to add features for our model to use to get lower error on the training data. But it is possible to add features that are really good at predicting the training data but aren't good at predicting the test data.
Models that use such features are said to ++overfit++ to the training data. This problem occurs a lot, and in fact happens so often that there is an entire section of this course devoted to it.
That's what we'll be focusing on in the coming weeks.
# Feedback
If you have any feedback about this guide or suggestions on how to make it a better resource for you (e.g. what to add, explain more, etc.), please fill out my [feedback form](https://tinyurl.com/raguvirTAfeedback).
If you have questions about the content of this guide, email me at rkunani@berkeley.edu!