# Public Lecture Notes for UVA neural networks Author: Hadi Daneshmand (dhadi at virginia dot edu) Insturctor: Hadi Daneshmand Citation: @book{hadi2025, title={UVA Lecture Notes for "Neural Networks: A Theory Lab"}, author={Hadi Daneshmand}, year={2013} } **Presentation structure:** Lectures feature puzzles and coding exercises designed to engage students and deepen their understanding of the theoretical background. This approach is inspired in my discussion with [**Professor Leslie Kaelbling**](https://www.csail.mit.edu/person/leslie-kaelbling) at MIT, who shared examples of combining experiments with theory to effectively teach machine learning concepts. I am developing this course to enhance students' grasp of the theoretical foundations of machine learning with experimental observation. # Universality: Bridging kernel methods and neural networks $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\sign}{\mathrm{sign}}$ There is an important motivation behind deep learning: "automatic feature extraction from data". To eleborate on this motivation, I breifly review the traditional methods used before deep learning to process data. Let's look at one of UCI dataset for machine learning, called the [heart disease dataset](https://archive.ics.uci.edu/dataset/45/heart+disease) released in 1988. This dataset aims at predicting a patient has heart disease or not based on features such as age, serum cholestoral, maximum heart rate and other 10 features. Traditional data processing and statistics heavily relies on engineering of features: How to extract features that can well predict heart diseases? There is an important motivation behind deep learning: **"automatic feature extraction from data."** To elaborate on this motivation, I briefly review the traditional methods used before deep learning to process data. Let's look at one of the UCI datasets for machine learning, called the [Heart Disease Dataset](https://archive.ics.uci.edu/dataset/45/heart+disease), released in 1988. This dataset aims to predict whether a patient has heart disease based on features such as age, serum cholesterol, maximum heart rate, and 10 other features. Traditional data processing and statistics heavily relied on feature engineering: *How do we extract features that can accurately predict heart disease?* Collecting the right features for the right task was very challenging. A traditional method in statistics was to collect as many features as possible with expert opinion and then prune and remove unrelated features using methods like LASSO. Feature extraction remains crucial, especially in medical applications where extracting meaningful features is very expensive. The following figures shows the cycle of conventional feature engineering. ![image](https://hackmd.io/_uploads/SJcJO18wkg.png) > Question 1: Which features can help to predict skin cancer? Do you think these features are different from heart disease? When thinking about the above questions, we will quickly notice how time consuming and domain specific is feature engineering. Deep learning aims at automatically extract features from raw data without labor intensive feature engineering. The main goal of machine learning is to automate this feature extraction. But how? In this chapter, we will introduce the important notion of universality which allow us to better grasp the mechnasim of feature extraction in neural networks and connect it with kernel methods. Remarkably, neural networks can be defined completely different using a biologoical inspired approach or blackbox method of defining inputs and outputs of neural networks. However, we take a different novel approach to reason why neural networks are defined in a specific way. # Kernel methods # Lab: Observation **Notations.** Let $\mathbb{R}$ denote real numbers. $x\in \mathbb{R}^d$ means that $x$ is a $d$-dimensional vector whose elements are real numbers. **Regression (an informal recap).** You probably are familiar with the regression in machine learning: Given a set of points $\{(x_1, y_1), (x_2,y_2) \dots (x_n,y_n)\}$ where $x_i \in \R^{d}, y_i \in \R$ where $x_i$ are indepedent variables and $y_i$s depends on $x_i$, namely there is an unkown function such that $y_i = f(x_i)$. Our goal is to estimate this unkown function such that we can predict $y_{n+1} = f(x_{n+1})$. To find $f(x_i) = y_i$, one can define a general parameteric form for $f$ as $f(x) = \sum_{i=1}^m w_i \varphi_i(x)$ where $\varphi_i:\R^d \to \R$ can be any non-linear function and parameters $w_1,\dots, w_n$ can be chosen carefully. We call $\varphi_1, \dots, \varphi_m$ features. Consider simple coordinate feature $\varphi_i(x)=[x]_i$ where $[x]_i$ is the $i$-coordinate of vector $x$. This choice of features is effective when data $y_i$ can be approximated with the projection of $x_i$. <img src="https://hackmd.io/_uploads/BJCgdDgW1l.png)" alt="drawing" width="300" style='float:right'/> > Q2: Design features $\varphi_i$ for the dataset in the right figure such that $y_i = \sum_{j} w_j \varphi_j(x_i)$. It is easy to check that $\varphi_i(x) = (x_i-1)^2$ is a good feature for the data. But, what if $y_i = (x_1 - 1)^2 + x_2^3 + e^{x_3+x_4} + noise$? Notably, we do not have access to underlying function generating $y_i$s from $x_i$. Thus, we can not manually adapt the model to the data. Is it possible to design features $\varphi_j(x)$ that can approximate $y_i$ effectively, regardless of its complex dependency on $x_i$? Moreover, can such features be versatile enough to approximate multiple functions simultaneously? For instance, consider the following 3 functions. Can we design $\varphi_j(x)$ to approximate all functions well? I call such features $\varphi_1, \dots, \varphi_n$ universal since enable fitting many nonlinear functions of $x$ are referred. Notalbly, the notion of "univerality" is coming from *function approximation theory*. ![square_10](https://hackmd.io/_uploads/SkIPl3Uvyg.png =29%x)![square_100](https://hackmd.io/_uploads/ry0Dx3IwJx.png =29%x)![square_1000](https://hackmd.io/_uploads/SJcYg38D1l.png =29%x) :::warning [In-Classroom Exercise] Generate $\varphi_1(x),\dots, \varphi_n(x)$ such that we can approximate the above functions by solving least-squares on generated features $\phi_1(x),\dots, \phi_n(x)$: [code](https://colab.research.google.com/drive/1M-nRBdhg1XJiV8sy4wMkUsfPJLKKSCB0?usp=sharing) ::: # Theory: Representer Theorem To solve the above exercise, we delve into the elegant *Representer Theorem* formulated by Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola in their seminal paper [A Generalized Representer Theorem](https://alex.smola.org/papers/2001/SchHerSmo01.pdf). --- ### **Kernel Density Estimation** In 1977, Tukey investigated how to recover a probability density from samples $x_1, \dots, x_n$ drawn i.i.d. from the distribution $p$. Kernel density estimation is an intuitive approach for this problem: putting a small bump on each point and computing the average of bumps, namely: $$ \widehat{p}(x) = \frac{1}{n} \sum_{i=1}^n k(x_i, x) $$ where $k(x, y)$ is called a **kernel**. An example of a kernel is the **Gaussian kernel** defined as: $$ k(x, y) = e^{-\|x - y\|^2 / 2}. $$ Indeed, kernel smoothing resolves the non-smooth structure of the empirical distribution on i.i.d. samples. The following plot shows an example of density estimation with a Gaussian kernel. ![image](https://hackmd.io/_uploads/HyKmWRrvJx.png) **Reproducing Property: Asymptotics of Density Estimation** For some distributions, the kernel density estimation becomes exact in the asymptotic regime $n \to \infty$, where the sample size becomes larger and larger. The following figure shows kernel density estimation for different sample sizes (values of $n$). We observe that as the sample size increases, the kernel density estimation converges to the true data distribution, which in this example is a mixture of two Gaussian distributions. ![image](https://hackmd.io/_uploads/B16w-RBvyx.png) Let $\widehat{p}(x)$ denote the kernel density estimation, and $p$ is the true underlying data distribution. When $n\to \infty$, an application of law of large numbers concludes $$ \widehat{p}(x) = \lim_{n\to \infty} \frac{1}{n} \sum_{i=1}^n k(x,x_i) = \int_{-\infty}^{\infty} k(x,y)p(y) dy. $$ Thus, the estimation $\widehat{p}$ become exact when $p(x) = \widehat{p}(x)=\int_{-\infty}^{\infty} k(x,y)p(y) dy$. This property is called reproducing property. Intiutively speaking, reproducing property enurse recovery of density value at $x$, given a weighted combinations of density at all $y\neq x$. **Representer theorem.** The Representer Theorem is a remarkable result, implying that nonlinear features can cast the approximation of a rich family of functions to a linear regression problem. :::info [[Representer Theorem]](https://alex.smola.org/papers/2001/SchHerSmo01.pdf) (oversimplified) Consider the following regression problem: $$ f^* = \arg\min_{g} \sum_{i=1}^n (g(x_i) - y_i)^2 + \lambda \int_{-\infty}^\infty f(x)^2 dx $$ subject to $g$ obeying reproducing property with a posititve semi definite kernel $k$: $g(x) = \int k(x,y)g(y) dy$. $f^*$ can be **represented** as $$ f^*(x) = \sum_{i} \alpha_i k(x,x_i), \text{ where } \alpha_1, \dots, \alpha_n \in \mathbb{R}. $$ Notably, representer theroerm holds only for positive semi-definite kernels $k$. ::: Just as a note, a kernel is positive semi-definite if for all $n$, $x_i,\dots, x_n$ and $c_1,\dots, c_n \in \mathbb{R}$ $$ \sum_{i=1}^n \sum_{j=1}^n k(x_i,x_j) c_i c_j \geq 0 \text{ holds.} $$ We will discuss this property in greater detail later. According to representer theorem, a linear combination of features $\phi_j(x_i) = k(x_i,x_j)$ is sufficient to approximate many functions (obeying reproducing property) at the same time. Indeed, $\phi_i(x) = k(x,x_i)$ are universial features. We adapted the code in in-class exercise to eleborate on this important fact. Using the same features, we are able to approximate three different functions as shown in the following plots (implementation is avaliable in the last cell of [colab](https://colab.research.google.com/drive/1M-nRBdhg1XJiV8sy4wMkUsfPJLKKSCB0?usp=sharing)) ![square_10](https://hackmd.io/_uploads/B1xv8THvJe.png =29%x)![square_100](https://hackmd.io/_uploads/S1O_8arvJx.pn =29%x)![square_1000](https://hackmd.io/_uploads/HybbOaHw1l.png =29%x) We used Gaussian kerenl in the above plots. :::warning [Takehome Exercise 1] In the right plot above, we see that the peformance of kernel regression is considerably worse compared to the other two plots. This may be due to the fact that the underlying function in this plot may not obey the reproducing property with Gaussian kernel. How can we prove this function do not obey the property? You do not need to submit your solution now. We will collect takehome exerices in a pdf with more details and upload it to Canvas. Then, you can sumbit your solution to Canvas. ::: To estimate $\alpha_1,\dots, \alpha_n$, one can use linear regression as you will practice it in the following excercise. :::warning [Takehome Exercise 2] Given the representation $f^* = \sum_{i} \alpha_i k(x,x_i)$, explain how to find $\alpha_1, \dots, \alpha_n$ cast to a linear regression. ::: Intutively speaking, representer theorem implies that observations $x_1, \dots, x_n$ are sufficient to extract the optimal solution for regression. Without representer theorem, one may require $k(x,x_+)$ for $x_+$ not in observations ($x_+\in x_1,\dots, x_n$) to obtain a good regressor, so we may need to extrapolate beyond our observations. Remarkably, representer theorem is much stronger than the statement above. However, we state a weak version of the theorem for simplicity. **Limitations of kernel methods: A need for neural networks.** What is the accuracy of kernel methods on the ImageNet benchmark for image classification? Do they perform better or worse than neural networks? Surprisingly, this question remains unanswered due to the computational limitations of kernel methods. To find parameters $\alpha_1, \dots, \alpha_n$, we need to construct matrix a kernel matrix including $k(x_i,x_j)$ which is an $n\times n$ matrix. For ImageNet dataset, $n=14000000$ ,hence the size of kernel matrix is $O(10^{14})$ which is beyond our memory and computational limits. In general, kernel methods have three main limitations compared to neural networks: - High computational and memeory cost. The next lecture reviews methods to reduce the cost. - Curse of dimensionality (for function approximation) is a more fundemental issue of kernel methods. We will explain later that neural networks do not suffer from this issue. - Designing an appropriate kernel for various applications requires domain knowledge. In other words, kernel methods do not fully automate feature extraction. Later, we will discuss how neural networks address this limitation. # Random features As previously mentioned, applying kernel methods to large datasets with numerous samples, such as the ImageNet dataset, is practically infeasible due to computational limitations. Over the past couple of decades, there has been an ongoing pursuit to scale up kernel methods effectively. In this context, we introduce one of the most fundamental breakthroughs in this vein—a development that bridges the gap between kernels and neural networks. # Lab Observations We revisited the in-class exercise explored in the kernel methods lecture. Recall that our objective was to design features capable of transforming a non-linear regression problem into a linear regression problem such that we can fit the following three functions. ![square_10](https://hackmd.io/_uploads/B1xv8THvJe.png =29%x)![square_100](https://hackmd.io/_uploads/S1O_8arvJx.pn =29%x)![square_1000](https://hackmd.io/_uploads/HybbOaHw1l.png =29%x) We discussed the representer theorem, which allows us to design \( n \) features for \( n \) samples to approximately fit all three of the following functions using a linear combination of the same features. Now, we aim to reduce the number of features from \( n \) to 20. :::warning [In-Classroom Exercise] Generate $\varphi_1(x),\dots, \varphi_{\color{red}{30}}(x)$ where $\varphi_i: \mathbb{R}\to \mathbb{R}$ such that we can approximate the above functions by solving least-squares on generated features $\phi_1(x),\dots, \phi_{\color{red}{30}}(x)$, namely by finding $w_1,\dots, w_{30}$ such that $y\approx \sum_{i=1}^{30} w_i \varphi_i(x)$. [Colab code to practice](https://colab.research.google.com/drive/1M-nRBdhg1XJiV8sy4wMkUsfPJLKKSCB0?usp=sharing) is available where you need to implement a function called get_features(x) which takes numpy array of size $n$ containing $x_1, \dots, x_n$ and returns an $n\times 30$ matrix $M$ such that $M_{ij}= \phi_j(x_i)$ (check task 2). ::: # Random Gaussian features To solve the above puzzle, we introduce [random features]((https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf)). :::info [Random Gaussian features.](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf) Let $v_1, \dots, v_{m} \in \mathbb{R}^d$ are vectors whose elements are i.i.d. Gaussian and $b_1, \dots, b_m \in \mathbb{R}$ are uniform over. Then, $\varphi_j(x) = \cos(v_j^\top x+ b_j)$ are random features. ::: **Solution for in-class exercise.** We assess the performance of random Gaussian features in solving the in-class exercise. Implementing random Gaussian features is straightforward and can be done using the following code (the solution is included in [colab](https://colab.research.google.com/drive/1M-nRBdhg1XJiV8sy4wMkUsfPJLKKSCB0?usp=sharing)) ```python m =30 # the number of features v = np.random.randn(m) # b = 2*np.pi*np.random.rand(m) def get_features(x): # returns random Gaussian features return np.exp((-1*np.outer(x**2,np.ones(n))-1*np.outer(np.ones(n),x**2)+2*np.outer(x,x))/2.0) ``` **Striking universality of random features.** The following plots show the peformance of random features. ![square_10](https://hackmd.io/_uploads/rkfzgMyu1g.png =29%x)![square_100](https://hackmd.io/_uploads/By6QefyOJg.png =29%x)![square_1000](https://hackmd.io/_uploads/HkarxM1_1l.png =29%x) Why can we use the same features to fit 3 different functions? Why the approximation is not good for the right plot? To address these questions, we present theoretical background behind random features. We will see that random features can approximate a wide family of functions connected to Gaussian kernel its associated reproducing property. # Theory behind random features **PSD matrices.** Recall a square matrix $M \in \mathbb{R}^{d\times d}$ is PSD if for all $x_1,\dots, x_d \in \mathbb{R}$, $$\sum_{ij} M_{ij} x_i x_j \geq 0.$$ PSD matrices have non-negative eignvalues, hence can represented as $$M = \sum_i \lambda_i e_i e_i^\top$$ where $e_i$ are eigenvectors and $\lambda_i \geq 0$ are eigenvalues. **PSD kernels and Mercer's theorem** $k$ is PSD if for all $x_1, \dots, x_n$ and $\alpha_1, \dots, \alpha_n$, $\sum_{ij} \alpha_i \alpha_j k(x_i,x_j)\geq 0$ holds. Mercer theorem provides the following representations for PSD kernels $k(x,y) = \sum_{i} \lambda_i e_i(x) e_i(y)$ where $\lambda_i>0$ and $e_i$ is call an eigenfunction. A consequence of Mercer's theorem is that a PSD kernel can represented as inner product. In particular, let $\phi(x) = [\sqrt{\lambda_1} e_1(x), \sqrt{\lambda_2} e_2(x), \dots ]$ which is an infinite dimensional vector. Then, it is easy to check that $k(x,y)= \langle \phi(x), \phi(y) \rangle$. **Kernel methods and linear regression.** The inner product representation of PSD kernels allows us to interpert kernel methods as a linear regression on features $\phi(x)$. Recall representer theorem implies the following representation for optimal function (fitting a set of observations) $$f^*(x) = \sum_i \alpha_i k(x,x_i) $$ According to Mercer's theorem, $k(x,x_i) = \langle \phi(x), \phi(x_i) \rangle$. Replacing this into the above equation yields $$f^*(x) = \langle \phi(x), \underbrace{\sum \alpha_i \phi(x_i)}_{w^*} \rangle $$ We can see finding optimal function $f^*$ can be seen as finding optimal $w^*$ as an infinite dimensional vector, hence solving linear regression with an infinite dimensional features $\phi(x)$. We do not know how to solve infinite dimensional regression. Kernel methods cast this infinite dimensional problem to a regression on $n$-variables. But, what if we could approximate the kernel with a finite number of features? **Hypothetical random features.** Assume that we are given functions $e_i(x)$ and $\lambda_i$ in Mercer's representation for PSD kernel $k$ as $k(x,y) = \sum_i \lambda_i e_i(x) e_i(x)$ and assume $\sum_{i=1} \lambda_i = 1$. Then, define random integer $r$ where $p(r=i) = \lambda_i$. We can use this random variable to reconstruct $k$ by random $\phi_r(x) = e_r(x)$. According to the definition of $r$, we have $$\mathbb{E}[ \phi_r(x) \phi_r(y)] = \sum_{m} p(r=m) e_m(x) e_m(x) = \sum_{m} \lambda_m e_m(x) e_m(y) = k(x,y).$$ Constructing the kernel in expectation allows us to reduce variance by increasing the feature size. Let's say, we can draw i.i.d. $r_1, \dots, r_m$ and define $\sqrt{m}\phi(x) = [e_{r_1}(x), \dots, e_{r_m}(x)]$. Then, it is easy to check $$\mathbb{E}[ \langle \phi(x), \phi(y) \rangle ] = k(x,y).$$ with a variance vanishing with $m$. The main problem with Mercer's theorem is that it is not constructive and do not specifies and an explicit form for functions $e_i(x)$ and $\lambda_i$. Bochner's Theorem provides an explicit form for random features that can be used to approximate shif invariant kernels. Shift invariant kernels are kernel functions that only depends on difference between two points, hence are denoted by $k(x-y)$ instead of $k(x,y)$. Gaussian (RBF) kernel is an example of shift invariant kernels since $k(x,y) = k(x-y) = e^{-\|x-y\|^2/2}$. :::info [Bochner's Theorem.](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf) A PSD shift invariant kernel $k$ can be represented as $k(\Delta) = \int e^{-2\pi i \langle \Delta, w \rangle} p(w) dw$ where $p(w)>0$. ::: The following table shows the explicit form for $p(w)$ associated with 3 imporant kernels. Interestingly, $p$ is a probability density function for these kernels, i.e. $\int p(w)dw = 1$. ![image](https://hackmd.io/_uploads/r1horEkd1x.png) **Random fourier features.** A straightfoward complex analysis on top of the Bochner's representation obtains random features: Given $w$ is random vector drawn from $p(w)$ and $b\sim\text{uniform}[0,2\pi]$, a PSD shift invariant kernl can be represented as $$k(x,y) = \mathbb{E} \left[\cos(\langle w, x \rangle + b) \cos(\langle w, y \rangle + b) \right] $$. :::warning [Takehome Exercise 3] Derive the above equation from Bochner's Theorem. ::: While we defined Gaussian features, it is also possible to draw random vectors $v_1, \dots, v_m$ from a density function $p$ associated with other kernels. Since random features are constructed based on the probability density $p(w)$, derived from Bochner's representation (the Fourier transform) of a kernel, they can effectively estimate a specific function class that adheres to the reproducing property associated with $k$, with an additional constraint. For more details on the universality of random features, refer to [Random Kitchen Sinks](https://people.eecs.berkeley.edu/~brecht/papers/08.rah.rec.nips.pdf). **Kerenls -- Random Features -- Neural networks** Random features bridge kernel methods with neural networks. When the number of random features ($m$) in the definition) is significantly smaller than $n$, they can reduce the computational and storage complexity of kernel methods while still approximating a similar family of functions. However, the performance of random features heavily depends on the choice of \(p\), which is often challenging to determine. Moreover, both random features and kernel methods are affected by the "curse of dimensionality," a limitation we will explore later. Neural networks, however, address these issues effectively. # Attachements Code is available on colab at https://colab.research.google.com/drive/1Ub98Pnapr6SNlp17w4Tr903YuwhVLhdU?usp=sharing ```python %pylab %matplotlib inline import numpy as np from scipy import signal from matplotlib import pyplot as plt # function repository def input_peicewise_linear(n): # returns n samples (x,y) of a peicewise linear function x = 12*np.random.rand(n) x = np.asarray(sorted(x)) y = np.zeros(n) inds = x<3 y[inds] = x[inds] inds = np.logical_and(x>3, x<6) y[inds] = -x[inds]+6 inds = np.logical_and(x>6, x<9) y[inds] = x[inds]-6 inds = x>9 y[inds] = -x[inds]+12 return x, y def input_sin(n): # returns n samples (x,y) of sin function x = np.random.rand(n) #np.asarray(range(256))/10 x = np.asarray(sorted(x)) y = np.sin(4*pi*x) return x, y ## generating tricky dataset def triangular_signal(n): x = range(n) x = np.asarray(x) x = 2*x/n-1 ticks = [-1,-.9,-.8,-.7,-.2,0.3,0.4,0.5,0.8,1] # ticks = [-4,-.5,0] start = 0 def getvalue(xi,ticks,start=0,alpha = 1): sign = 1 start = start for i in range(len(ticks)-1): if ((xi>= ticks[i]) & (xi< ticks[i+1])): # print(xi, sign, start,-ticks[i],'--------' ) return alpha*sign*(xi-ticks[i]+start) start = -1*(ticks[i+1]-ticks[i]+start) sign = -1*sign # print('---------') return 0 y = [] for xi in x: y.append(getvalue(xi,ticks,start)) y = np.asarray(y) # plt.plot(x,y) return x, y def smooth_function(n): x = 2*(np.random.rand(n)-0.5) y = 0.1*(x-0.5)**2 -0.1*(x)**3 + 0.2*np.sin(x*3) return x, y ## generating tricky dataset def square_signal(n): x = range(n) x = np.asarray(x) x = 2*x/n-1 ticks = [-1,-.8,-.2,0.3,1] # ticks = [-4,-.5,0] start = 0 def getvalue(xi,ticks,start=0,alpha = 1): sign = 1 start = start for i in range(len(ticks)-1): if ((xi>= ticks[i]) & (xi< ticks[i+1])): # print(xi, sign, start,-ticks[i],'--------' ) return 2*(i % 2-.5) start = -1*(ticks[i+1]-ticks[i]+start) sign = -1*sign # print('---------') return 0 y = [] for xi in x: y.append(getvalue(xi,ticks,start)) y = np.asarray(y) # plt.plot(x,y) return x, y from scipy.stats import laplace, cauchy functions = [smooth_function,triangular_signal,square_signal] n = 600 for function in functions: x, y = function(n) # generates samples x1, dots, xn and y1 dots yn # x and y are vectors of length n m =20 # the number of features v = 20*np.random.randn(m) # b = np.random.rand(m) def get_features(x): return np.exp((-1*np.outer(x**2,np.ones(n))-1*np.outer(np.ones(n),x**2)+2*np.outer(x,x))/2.0) # this line is the solution of in-class exercise. inds = np.random.permutation(n) traing_inds = inds[0:160] test_inds = inds[160:n] features = get_features(x) train_features = features[traing_inds,:] train_features = train_features[:,traing_inds] w= np.linalg.pinv(train_features) @ y[traing_inds] test_features = features[:,traing_inds] test_features = test_features[test_inds,:] ypred = test_features @ w plt.figure() plt.scatter(x[test_inds],ypred,label='least squares',color='r',alpha=0.3) plt.scatter(x[test_inds],y[test_inds],label='y',color='b',alpha=0.3) plt.legend() ```