owned this note
owned this note
Published
Linked with GitHub
# INF367 Selected Topics in Artificial Intelligence
Shared notes for INF367 Selected Topics in Artificial Intelligence. Feel free to add your notes & check a correctness of things here so far. If you're not sure about things your writing, just write it there.
[TOC]
## Unsupervised Learning (topics.pdf)
1. Density Estimation (Fromsa)
* Histograms, including number of bins
* Kernel density estimation, including kernels, binwidth, derivatives
* Supervised density estimation
* Density validation
* Curse of dimensionality
2. Anomaly Detection (Ondrej)
* Anomaly scoring and validation
* Density-based anomaly detection
* Lightweight online detector of anomalies (LODA)
* Local outlier factor (LOF)
* One-class support vector machine
* Isolation forest
* Representation-learning based approaches
* Anomaly detection benchmarks
3. Cluster Analysis (Håkon & Stian)
* Hill Climbing, including MeanShift, MedianShift, QuickShift++, ToMaTo
* k-means
* Hierarchical clustering, including single, complete, average, and Ward linkage
* Spectral clustering
* Density-based spatial clustering of applications with noise (DBSCAN)
* Hierarchical DBSCAN (HDBSCAN)
* Kleinberg impossibility theorem
* Validation
* Number of clusters (elbow / silhouette / gap)
* External validation (confusion / pair counts / information theoretic)
* Internal validation (silhouette / Dunn / Davies-Bouldin / SD)
* Noise
4. Representation Learning (Jonas)
* Self-organizing maps
* Matrix factorization
* Classical principal component analysis (PCA)
* Sparse PCA
* Nonnegative matrix factorization
* Kernel PCA
* Multidimensional scaling
* Number of dimensions
* Neighborhood-graph-based methods
* Isomap
* Spectral embedding
* Locally linear embedding (LLE)
* t-distributed stochastic neighborhood embedding (tSNE)
* Uniform manifold approximation and projection (UMAP)
* Autoencoders
* Standard autoencoder
* Sparse autoencoder
* Denoising autoencoder
* Contractive autoencoder
5. Other topics (Naphat)
* Gaussian mixture models, EM algorithm
* Generative models
* Variational autoencoders
* Generative adversarial models (GAN)
* Topological data analysis
* Filtered simplicial complexes
* Covers
* Čech and Rips complexes
* Persistence diagrams and images
* Semisupervised learning
* Laplace regularization
# Notes
## 1. Density Estimation
_Responsible: Fromsa_
- GOAL:
- Given data $X$ sampled from $f$ with an unknown underlying density, find a funtion $\hat{f}$ that approximates $f$.
- Useful For:
- Univarite density esitmation for visualization
- <b><i>Clustering</i></b>, <b><i>anomaly detection</i></b>, etc. <b style="color:RED">TODO:explain!</b>
- <b><i>Supervised learning</i></b> .<b style="color:RED">TODO:explain!</b>
### Histograms, including number of bins
- Used On
- Discrete data.
- Straight forward. Just choose bins.
- Multidimensional discrete data
- Straight forward. Just choose bins for each dimension.
- Continues data
- Use the derivative of the empirical cumulative distrubution $f_n(x)$[^1] to create a histogram density estimate $\hat{f}(x)$[^2], given bins $B_k$
- It is the discrete uniform density over the data with probability $\frac{1}{n}$ at each point $x\in X$.
- Obviously not a good estimation of a continuous density distribution.
- Here we assume all bins to have the same width $h$ and $v_k$ is the number of points in bin $k$.
- How To Choose Bins:
- Struges' rule [^8]
- Results in a reasonable number of bins if the data is normally distributed.
- If data has heavy tails, more bins are needed
- not recommeneded.
- Scott's rule [^9]
- Given normally distributed data, use it as a reference and sample standard deviation to get the rule.
- Freedman-Diaconis rule [^10]
- Replace the sample standard deviation with the interquartile range in Scott´s rule to get a more robust rule.
### Kernel density estimation, including kernels, binwidth, derivatives
- GOAL
- Combine the smoothness of GMM with the simplicity of histograms. Histogram is not smooth. GMM is parametric, the number of components.
- IDEA [here is a good visual explanation](https://mathisonian.github.io/kde/)
- Using a kernel and a bandwith that scales the kernel, create a density function that is estimated by the kernel. See animation for a good explanation
- KERNEL
- Definision
- $\hat{f}(x)$ [^11]
- $K_h(x) = \frac{K(x/h)}{h}$
- $h >0$ is the bandwidth.
- A kernel $K$ must have these properties
- $K(x) > 0$ for all $x$
- $\int K(x) dx = 1$
- $\int xK(x) dx = 0$
- $\int x^2K(x) dx = \sigma_k\in(0,\infty)$
- K is usually
- $K(x)=K(-x)$ i.e. symmetric.
- $K(0) \geq K(x)$ for all $x$ i.e. centered.
- Typical kernels:
- Uniform $K(x) = 0.5$ if $-1 \leq x \leq 1$
- Triangle $K(x) = \max\{0, 1-|x|\}$
- Normal $K(x) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{1}{2}x^2\right\}$
- Epanechnikov $K(x) = \max\left\{0, \frac{3}{4}(1-x^2)\right\}$

- Bandwidth Selection
- Given a normal kernel, the reference rule is $h^∗=\frac{1.06\hat{\sigma}}{n^{1/5}}$.
- Best way is Cross validation
- Density Derivative Estimate
- Given a kernel density estimate $\hat{f}(x)$ [^11], a natural density derivative estimate $\hat{f}'(x)$[^12] falls out.
### Supervised density estimation
- IDEA
- Given data $X$ from density $f$ and reference density $f_0(x)$, estimate $f$ using $f_0(x)$
- Process
1. Random sample $X_0$ of size $N_0$ from $f_0$
2. Combine to get sample from mixture distribution $\frac{f+f_0}{2}$
3. Assign the value $Y=1$ to points drawn from $f$ and $Y=0$ from points drawn from $f_0$
4. Use supervised methods to estimate $\hat{\mu(x)} = E[Y|x] = \frac{f(x)}{f(x)+f_0(x)}$
5. Find density estimate $\hat{f}(x) = \frac{\hat{\mu}(x)}{1-\mu(x)}$
- How to choose Reference
- In prinsiple arbitrary
- In practice, depends on what we want to learn:
- Departure from uniformity.
- Depature from independence (could get a sample from $f_0$ by applying random permutations to each variable).
- Increase accuracy ($f_0$ should be chosen such that $\mu$ and $f$ can easily be approximated).
- etc.
### Density validation
- GOAL:
- Given <b>training data $X$</b> and <b>Validation data $X^{val}$</b>, validate the <b>Density estimate $\hat{f}$</b>.
- How:
- The $MISE\left(\hat{f}\right)$ [^3], approximated by $\hat{J}\left(\hat{f}\right)$[^4], is usually used.
- Error Measures:
- Kullback-Leibler divergence $D_{KL}\left(\hat{f},f\right)$[^5]
- A measure of how one probability distributionis different from a second, reference probability distribution.
- Mean Squared error $MSE\left(\hat{f}\right)$[^6]
- Lp-distance $l_p\left(\hat{f}\right)$ [^7]
- Mean integrated squared error$MISE\left(\hat{f}\right)$, i.e. intergated L2 distance. [^3]
### Curse of dimensionality
- IDEA:
- Local neigbourhoods in higher dimensions are almost surely empty
- The number of points needed to make a prediction with a given error grows exponentially with the dimension.
- Proof:
1. What is the side length $s$ of a smaller cube, in a d dimensional unit code, that contains a proportion p of the data? $$s(d,p) = p^{1/n}$$

2. Consider the ==multivariate standard normal distribution==. Intuition suggests ==most points== of the multivariate normal ==are close to the origin== and therefore ==have a relatively small norm==. But the ==distribution of the norms== of multivariate normal points ==follows a chi-distribution== with ==degree of freedom equal to the dimension== and therefore ==the modes grow with the square-root of the dimension==. Evident in this image. 
---
## 2. Anomaly Detection
_Responsible: Ondrej_
- Anomalies are data points $x\in X$ that are different from nominal points
> “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” (Hawkins 1980)
**Idea:** Data consists of a mixture between "nominal" and "anomaly" data.
**Approaches:**
- Labeled data (classification)
- Clean data (novelty detection)
- Mixed unlabeled data (outlier detection)
### Anomaly scoring and validation
Score:
- Distance from nominal points
- Probability of anomaly
- Ranking from most to least anomalous
Validation:
- Receiver Operating Characteristics curve (Supervised, e.g. ROC/AUC)
- true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
- $TPR = \frac{TP}{P}$
- $FPR = \frac{FP}{N}$
### Density-based anomaly detection
1. Naive marginal density estimation
- **Assumption:** columns of data independent, so
- $f(x_1, \dots, x_d) = \prod_{i=1}^d f_i(x_i)$
- Use 1/density as an anomaly score
2. Mixture model anomaly detection
- log likelihood statistic as an anomaly score
- works better for multi-dimensional data
- must optimize hyperparameter (number of mixture components)
### Lightweight online detector of anomalies (LODA)
- Collection of $k$ one-dimensional histograms $\{h_i\}^k_{i=1}$.
- Each apporoximating the probability density at input data projection onto a single projection vector $\{w_i \in R^d\}^k_{i=1}$.
- Projection vectors diversify individual histograms.
- LODA's output $f(x)$ on sample $x$ is average of the logarithm of probabilities estimated on individual projection vectors.
- **Random projections**
- each random vector is created during initialization of ceresponding histogram
- first randomly select $d^{\frac{-1}{2}}$ non-zero features.
- then randomly generate non-zero items according to $N(0; 1)$
- In high-dimensional spaces Loda can be related to a PCA based detector.
- Because projection vectors will be approximately orthogonal.
- **Advantages**
- Very fast and scalable (ensemble of simple models)
- Resistant to missing features (if the feature is missing it will choose vectors, where this feature is 0)
- Online and updatable on the go. It uses unequal histogram bin width (width is greater on sparse parts).
### Local outlier factor (LOF)
- $k$-nearest neighbor and density-based outlier detection
- set of $k$ nearest neighbors
- $N_k(x) = \left\{ y \mid d(x, y) \le kNN(x) \right\}$
- $k$-reachability distance
- $R_k(x , y) = \max\left\{d(y, kNN(y)), d (x, y)\right\}$
- $k$-local reachability density of $x$
- $\mbox{lrd}_k(x) = \frac{1}{\left({\frac {\sum _{y\in N_{k}(x)}R_k(x,y)}{|N_{k}(x)|}}\right)}$
- $k$-local outlier factor
- ${\mbox{LOF}}_{k}(x) = \frac{\sum _{y\in N_{k}(x)}{\mbox{lrd}}(y)}{|N_{k}(x)|{\mbox{lrd}}(x)}$
- **Advantages**
- Local density measure
- If the point is far away from cluster with low inter-cluster distance is more anomalous than point at same distance from a cluster with high inter-cluster distance
- **Disadvantage**
- Local anomaly score
### One-class support vector machine
- Typically, support vector machines are used for binary classification problems, where they divide the training sample into two categories linearly separated by as wide a gap as possible in a high-dimensional implicit feature space. However, they can also be used to separate a small region containing the normal class from the low-density anomaly regions.
- This can then be solved as usual using Lagrangian multipliers and the dual formulation similar to standard support vector machine.
- **Idea:** Anomalies are isolated points and can be separated from the other points in an implicit feature space.
- **Setup:**
- Data $X = x_1, \dots, x_n \in \mathbb{R^d}$.
- Feature map $\Phi: \mathbb{R}^d \to F$
- kernel function $k: \mathbb{R^d}\times \mathbb{R^d} \to \mathbb{R}$, with $k(x, y) = \langle \Phi(x) , \Phi(y) \rangle$.
- **Strategy:** Separate the data from the origin with maximum margin.
- **Result:** Decision boundary
- Kernel trick - no need to compute $\Phi$ we only need to compute $k$
- Separating points from $(0;0)$
- Feature space (anomaly detection) 
- Feature space (classification) 
### Isolation forest
- does not rely on a distance or density measure
- anomalies are very different from normal instances, i.e. isolated.
- basic building blocks are isolation trees: binary tree with random splits
- Ensamble of simple isolation trees
- **Isolation tree**
- Ensamble of binary decision tree with fixed depth, where in every step decision boundary is randomly chosen on random feature between its min and max.
- Anomaly points are easy separable, so they on average get out of a tree quickly
- Anomaly score is average depth of example in trees (lower == more anomalous)
- **Advantages**
- Very fast and scalable
- Very good perfomance
### Representation-learning based approaches
- Autoencoders. *Probably posible with other RL methods as well but I'm not sure.*
- Because we don't have that many anomalies + they are different from all the other examples, RL methods tends to learn representation of more common examples a.k.a. non anomalous points
- Anomaly score is reconstruction error
### Anomaly detection benchmarks
Measures
- quality metrics
- training and inference performance metrics
- dependency on problem dimensions for representative problems
To produce good anomaly detection benchmarks, datasets have to vary in their construction across several dimensions
that are important to real-world applications:
1. point difficulty (pd)
- If it's easy to separate the anomaly (not sure, just guess)
2. relative frequency of anomalies (rf)
- Percentage of anomalies in the data
3. clusteredness of anomalies (cl)
- If anomalous points are forming clusters of some kind
4. irrelevance of features (fi)
- Feature is not usefull for anomaly detection (noise, uniform, etc.)
Standard methods to create anomaly detection benchmark datasets are
- take data from a specific application area,
- create synthetic data, or
- treat some classes in a supervised classification problem as anomalies.
**We should generate data by combining two different real-world data generating processes.**
Each of these approaches have some problems. Data from application areas are often not publicly available and are difficult to find. It is difficult to generate synthetic data that does not underestimates the complexity of real world data. On the other hand, it is difficult to test all problem dimensions without creating synthetic data. The classes in datasets for supervised classification problems do not have the properties we expect from anomalies.
The most important paradigm to construct benchmark datasets for anomaly detection is that we should generate data by
combining two different real-world data generating processes.
**A note on time series**
- $y_{t} = L + T_{t} + S_{t} + N_{t}$
- Level $L$: The average value in the time series.
- Trend $T_t$: The increasing or decreasing slope in the time series.
- Seasonality $S_t$: The repeating cycle in the time series.
- Noise $N_t$: The random variation in the time series
Decompose before analysis!
---
## 3. Cluster Analysis
_Responsible: Håkon and Stian_
#### What is a cluster:
- Clusters are *homogeneous group*
- *Hard vs soft(fuzzy)* type of clustering.
- Similarity / dissimilarity measures.
- Common dissimilary measures:
- Euclidean distance
- squared Euclidean distance
- Minkowski distance
- Angular distance
- Mahalanobis distance
- Hamming distance
- Jaccard dissimilarity
- Canberra distance
- Hausdorff distance
- Edit distance
- symmetrized Kullback-Leibler divergence
- Clustering models, see below.
### Hill Climbing
TODO: Add stuff
### Variants of Hill Climbing
#### MeanShift
Initial points: **Random**
Density estimator: **KDE**
Update step: **Weighted mean**
Stop criterion: **Zero gradient**
Post-processing: **Get rid of saddles. Merge nodes with dist < low**
<b style="color:red">TODO: Add more info/illustration</b>
#### MedianShift
Initial points: **Representative points (medians of LSH (Locality-sensitive hashing) collisions)**
Density estimator: **Depth**
Update step: **Weighted median**
Stop criterion: **Median does not change**
Post-processing: **None**
<b style="color:red">TODO: Add more info/illustration</b>
#### QuickShift++
Initial points: **All points**
Density estimator: **k-NN**
Update step: **Closest point with higher density**
Stop criterion: **1 iteration**
Post-processing: **Combine points whose path end in cluster core**
<b style="color:red">TODO: Add more info/illustration</b>
#### ToMaTo
Initial points: **All points**
Density estimator: **Any**
Update step: **kNN with highest density**
Stop criterion: **1 iteration**
Post-processing: **Combine clusters with low prominence**
<b style="color:red">TODO: Add more info/illustration</b>
### k-means
### Hierarchical clustering, including single, complete, average, and Ward linkage
### Spectral clustering
HOW:
1. Transform data into a **similarity graph**
2. Calculate graph Laplacian $L = D-A$
- $A$ is the graphs *affinity matrix*
- $D$ is the graphs *degree martix*
3. Define a cut-type objective function
- Minimization problem $$\operatorname*{argmin}_B Cut(B, \bar{B}) = \operatorname*{argmin}_u u^TLu$$
- $L$ is symmetric, positive semi-definite
- The multiplicity of the eigenvalue $0$ of $L$ equals the number of connected components in the graph $G$
4. Solve it
- Lagrange multipliers $\implies L u =\lambda u$
- Smallet eigenvalue gives the minimum cut.
- *Normalize to avoid trivial cut.*
5. Take $k$ (number of clusters) smallest eigenvectors of $L$
6. Cluster using standard clustering algorithm (usually k-means)
### Density-based spatial clustering of applications with noise (DBSCAN)
### Hierarchical DBSCAN (HDBSCAN)
### Kleinberg impossibility theorem
Additional article on the subject: http://alexhwilliams.info/itsneuronalblog/2015/10/01/clustering2/
> For each $n\leq2$, there is no clustering function $f$ that satisfies scale-invariance, richness, and consistency. This holds where $d(x,y)=0 \implies x=y$
#### Clustering Properties
- **A clustering function $f$** takes a dissimilarity function $d$ on $X$ and returns a partition $C$ of $X$.
- **Richness**: $f$ can have any partition $C$ as a possible output.
- **Scale-invariance**: For any dissimilarity function $d$ and any $\alpha>0$, we have $f(d) =f(\alpha d)$.
- **Consistency**: Let the dissimilarity $d$ result in a clustering $f(d) = C$. Let $d'$ be a dissimilarity with the properties that for points $x$ and $y$ in the same cluster $d'(x, y) \le d(x, y)$ and for points $x$, $y$ from different clusters $d'(x, y) \ge d(x, y)$. The clustering function $f$ is consistent if $f(d') = C$.
### Validation
- All data inside a cluster should be highly similar (*intra-cluster*)
- Data between clusters should be highly dissimilar (*inter-cluster*)
#### Number of clusters (elbow / silhouette / gap)
#### External validation (confusion / pair counts / information theoretic)
#### Internal validation (silhouette / Dunn / Davies-Bouldin / SD)
#### Noise
### Håkons points about clustering
### Kmeans
- Takes hyperparameters K: amount of centers
- training step: (takes training data X)
- Picks initial cluster centers(random, etc)
- Compute distance to all other cluster centers(using euclidian distance, etc)
- assign all data points to the cluster closest to it
- compute new cluster center: average of all points assigned to that cluster.
- repeat for i iterations or until convergence.
Note: final clusters depend on initial clusters picked.
- Classification step: (takes input data X with same amount of features as training data)
- For each datapoint compute distance to cluster centers
- For each datapoint pick cluster center with minimum distance, c, and give each datapoint the corresponding label c
### GMM(Gaussian Mixture Model)
Gaussian mixture model is a generalization of KMeans, here we have a model thats compromised of many gaussian functions. Each gaussian function has parameters $\mu$(describing the center of the gaussian) and $\sigma$(describing it's variation) in addition we also have have a strength variable $\phi$(describing the strength of the gaussian, these always sum to 1). so to evaluate the GMM we simply sum the strength of all the the gaussian timed by the strength parameter $\sum_{j=0}^k \phi_j * gaussian(x; \mu_j, \phi_j)$
- Takes hyperparameters K: amount of centers
- training step: (takes training data X)
- Pick initial gaussian centers, usually randomly from domain of data, randomly from datapoints, or by using KMeans
- Pick initial gaussian variance, usually the variance from the original data
- Pick initial strength for gaussians: $\phi = \frac{1}{k}$
- Iteration consists of two steps, expecation and maximization.
- Expectation: This can be
-
### Hierachical clustering
Hierarchical clustering is different from the other clustering algorithms in the way that it gives you a tree of clusters so after running the clustering algorithm you are able to choose how many clusters you want. it does this by building a dendrogram, which represents this clustering tree, this contains information about how and when the points are merged together to form clusters
#### Divisive clustering
all points start out as the same cluster, find the seperation of clusters that maximizes some criteria. This was not really covered in depth in class so it's probably not important, but works similarly to agglomerative clustering.
#### Aggolemrative clustering
Opposite off divisive clustering: all points start out as their own cluster
we then merge together the two clusters that have the highest similarity score, this score can be one of many.
##### Different types of similarity scores
- *single linkage clustering:*
$$d_{SL}(G, H) = \min_{x \in G, y \in H} d(x, y)$$
- *complete linkage clustering:*
$$d_{CL}(G, H) = \max_{x \in G, y \in H} d(x, y)$$
- *average linkage clustering:*
$$d_{AL}(G, H) = \frac{1}{|G|\,|H|} \sum_{x \in G} \sum_{y \in H} d(x, y)$$
- *Ward linkage clustering:*
$$ d_{W}(G, H) =
\frac{1}{|G|+|H|} \sum_{x \in G} \sum_{y \in H} d_e(x, y)^2 -
\frac{1}{|G|} \sum_{x \in G} \sum_{y \in G} d_e(x, y)^2 -
\frac{1}{|H|} \sum_{x \in H} \sum_{y \in H} d_e(x, y)^2 =
\frac{|G|\,|H|}{|G|+|H|} d_e(m_G, m_H)^2
$$
### Kleinbergs impossibility
States three properties for a desired clustering.
Scaling: Clustering should remain the same after scaling the data

Consistency: Clustering should remain the same for different distance functions

Richness: The clustering function should be able to to represent all possible clusterings of the data, with different distance measures

proves that you can achieve at most two of these properties at the same time, or relaxed versions of these

### Cluster validation
Methods for measuring how good/bad your clustering is. Used to for example to choose the amount of clusters in your clustering. or the stopping criteria for your clustering
#### Internal validation
##### wss elbow method
compute the within cluster sum of squares for all clusters, take the average value of this for all clusters, plot against k number of clusters, a bend in this plot corresponds to a good value for k.
Does not take into account cluster seperation
Validation without external labels
- average silhouette score: take into account distance between clusters as well as distances within a cluster
For point $x$, with $x \in C_x$, it's silhouette $s(x)$ is
$$ a(x) = \frac{1}{|C_x|} \sum_{y \in C_x} d(x, y) $$
$$ b(x) = \min_{C \neq C_x} \left\{ \frac{1}{|C|} \sum_{y \in C} d(x, y) \right\}$$
$$ s(x) = \frac{b(x) - a(x)}{\max\{a(x), b(x)\}} $$
a is a cohesion measure and b is a seperation measure


^ from lecture slides
average silhouette is just the average sillhouette for all datapoints
Plot this against the number of clusters or some other clustering hyperparameters
the location of a maximum indicates the appropriate number of clusterings
- Gap statistic: Compares the total within cluster variation with their expected values under a null reference distribution of the data.
#### External validation
Validation with external labels
##### Confusion based approaches:
count the number of points in the rights groups
##### Pair count approaches
- $N_{11}$: Number of pairs of points that are in the same cluster under both $\mathcal{C}$ and $\mathcal{C}'$
- $N_{00}$: Number of pairs of points that are in different clusters under both $\mathcal{C}$ and $\mathcal{C}'$
- $N_{10}$: Number of pairs of points that are in the same cluster under $\mathcal{C}$, but in different clusters in $\mathcal{C}'$
- $N_{01}$: Number of pairs of points that are in different clusters under $\mathcal{C}$, but in the same cluster in $\mathcal{C}'$
- Jaccard index
$$ J(\mathcal{C}, \mathcal{C}') = \frac{N_{11}}{N_{11} + N_{01} + N_{10}}$$
---
## 4. Representation Learning
_Responsible: Jonas_
### Self-organizing maps
- Combined manifold learning and clustering
- Neural network
- Locally similar to kmeans
- Connect cluster centers to a low-dimensional discrete output space
Algorithm:
- Initialize centers $m_i$ in original space (using e.g. PCA), corresponding to $l_i$ on the grid.
- For each point $x \in X$:
- Find the closest center $m_j$
- For all $k$, update $m_k = m_k + \alpha h( \Vert l_k - l_j \Vert)(x - m_k)$
Pros:
- Find clusters in large amounts of data (dimensionality reduction)
- Combine diverse datasets to find patterns
- Powerful visualizations
Cons:
- Might be computationaly heavy with lots of data.
TL;DR;
- 2 layer NN where sencond layer is coresponding to cluster centers ($w$). On top of this layer there is a one or two dimensional grid. This grid is used from SOM training.
- SOM without grid is basicaly k-Means.
- Upon training for each training example $x$ find the closest center $w$. This center is updated, so it will come "closer". Also all the centers adjacent to this center in the grid are updated as well. By this we could find a "shape" of data
**Validation**
- Reconstruction error $\sum_{i=1}^{n} \Vert x_i - m(x_i) \Vert$
- Reconstruction error is always larger than for k-Means (because SOM is a constrained version of K-means)
- Should not be too much higher.
### Matrix factorization
- **Setup**: Data $X \in \mathbb{R}^{N\times D}$
- **Idea**: $X \approx UV$, with $U \in \mathbb{R}^{N\times d}$ and $V \in \mathbb{R}^{d\times D}$
- **General matrix factorization model**:
$$ \sum_{i=1}^{N} \sum_{j=1}^{D} \operatorname*{Loss}\left(X_{ij}, (UV)_{ij}\right)$$ subject to some constraints
##### Classical principal component analysis (PCA)
- Loss / reconstruction error:
$$ \sum_{i=1}^{N} \sum_{j=1}^{D} \left(X_{ij} - (UV)_{ij} \right) ^2 $$
- Constraints: None
Classical PCA has a closed form solution and maximizes the variance for each principal component.
MNIST example using PCA:

PCA tries to minimize the lost variance or maximize the remaining variance after projecting the data. This is illustrated in the figure below.

#### Sparse PCA
- Loss / reconstruction error:
$$ \sum_{i=1}^{N} \sum_{j=1}^{D} \left(X_{ij} - (UV)_{ij} \right) ^2 $$
- Constraints: $\Vert U_k \Vert_2 = 1$ for all $k$ and $\Vert U \Vert_0 \le k$
Interpretation of constraints:
1. The first constraint specifies that $U$ is a unit vector.
2. In the second constraint, ${\displaystyle \left\Vert U\right\Vert _{0}}$ represents the L0 norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension $D$.
One usually tend to use sparse PCA when we would like to interpret the principal components. Sparse PCA's principal components is determined by only a few variables as opposed to PCA which includes all and might be better for visualization purposes for example.
MNIST example using Sparse PCA:

#### Nonnegative matrix factorization (NMF)
For NMF we learned about two loss functions. The first one being the frobenius loss (Euclidean loss):
- Loss / reconstruction error:
$$ \sum_{i=1}^{N} \sum_{j=1}^{D} \left(X_{ij} - (UV)_{ij} \right) ^2 $$
- Constraints: $U_{ij} \ge 0$ and $V_{ij} \ge 0$
And the second being the KL-loss:
- Loss / reconstruction error:
$$ \sum_{i=1}^{N} \sum_{j=1}^{D} (UV)_{ij} - X_{ij} \log \left((UV)_{ij} \right) $$
- Constraints: $U_{ij} \ge 0$ and $V_{ij} \ge 0$
The KL-loss corresponds to a model in which data $X_{ij}$ has a Poisson distribution with mean $(UV)_{ij}$.
NMF has a constraint that all values in $U$ and $V$ has to be equal or greater than zero. This is useful when you are working with data that can not be negative in some sense (think image data).
MNIST example using NMF:

#### Kernel PCA (KPCA)
- Loss / reconstruction error:
$$ \sum_{i=1}^{N} \sum_{j=1}^{D} \left(\Phi(X_{ij}) - (UV)_{ij} \right) ^2 $$
- Constraints: None
- Idea: $$ k(x, y) = \Phi(x)^T \Phi(y) $$
Why do we need Kernel PCA? PCA is linear, and cannot classify non-linear data effectively. We add the kernel method to PCA to classify non-linear data better.
The kernel method maps the original data to higher dimensions where we might have linear data patterns. In other words, data becomes linearly separable in the new feature space. We use a kernel function (gaussian, polynomial, etc.) to embed our data, and then apply PCA on the embedded data to get our final projection.
Kernel PCA is related to spectral clustering. For example for the radial basis kernel 𝑘(𝑥,𝑦)=exp{−𝛾‖𝑥−𝑦‖2}, finding the largest eigenvalues of K corresponds to finding the smallest eigenvalues of 1−𝐾. This is very similar to the Laplacian 𝐿=𝐷−𝐾.
MNIST example using KPCA (polynomial kernel):

#### Multidimensional scaling (MDS)
MDS is usually mimized using gradient descent methods.
We learned two different versions of MDS, one without similarities and one with. The version with similarities is actually equivalent to principal component analysis if the similarities are inner products in an Eucidean space.
Without similiarities:
- Loss / reconstruction error:
$$ \sum_{i=1}^{N} \sum_{j=1}^{N} \left(d(X_i, X_j) - d(U_i, U_j) \right) ^2 $$
- Constraints: None
With similarities (computing covariances):
- Loss / reconstruction error:
$$ \sum_{i=1}^{N} \sum_{j=1}^{N} \left( (X_i-\bar{X}_i)^T (X_j-\bar{X}_j) - (U_i-\bar{U}_i)^T (U_j-\bar{U}_j) \right) ^2 $$
- Constraints: None
MNIST example using MDS:

#### To summarize (matrix factorization)
To perform matrix factorization, we first have to select a lower dimension $d$, then which loss function and which constraints we would to apply. We have looked at a numerous amount of matrix factorization methods, and they all have their benefits and drawbacks.
#### Number of dimensions
To select the number of dimensions in matrix factorization methods, one either (a) uses 2-3 dimensions for visualization purposes or (b) have to just test and see which number of dimensions work the best for the particular use-case. For instance, if we would like to find the best number of dimensions using PCA, we can reduce the data to a number of dimensions $d$ s.t. we explain 90% of the variance in the original data. This is a good thumbrule, but must be used with caution. Like we said earlier, it depends on what data we are working with. Sometimes it is best to plot some value on the y axis (i.e. PCA: singluar values) and the number of dimensions on the x axis and use the elbow rule to find the best number of dimensions.
### Neighborhood-graph-based methods
- **Setup**: Data $X \in \mathbb{R}^{N\times D}$
- **Goal**: Find $Y \in \mathbb{R}^{N\times d}$, such that locally $X$ and $Y$ 'behave similarly'.
With neighborhood graph based methods we don't factorize our data into two matrices, but rather try to find similarities locally and reconstruct our data using those similarities in lower dimension.
For instance, lets say that we have data like so:

By using the local distanced between the points we can construct a connected graph of our data:

We can then use this graph to further embed the data to our likings, for instance.
#### Isomap
- **Graph**: k-nearest neighbors
- **Weights**: distance
- **Embedding**:
+ build complete graph weighted by shortest path length
+ adjacency matrix is a distance matrix
+ compute multidimensional scaling (MDS)
Isomap uses a similar graph like the one above and weights them by the shortest path length. Next, the adjacency matrix is then fed into MDS to get the desired result.
MNIST example using Isomap:

#### Spectral embedding
- **Graph**: k-nearest neighbors
- **Weights**: similarities, e.g. kernel
- **Embedding**:
+ calculate Graph Laplacian
+ eigenvectors corresponding to the smallest eigenvalues (removing the smallest one)
Spectral embedding does essentially the same as spectral clustering, except for the last part (K-means ran on the eigenvectors etc.).
MNIST example using Spectral embedding:

#### Locally linear embedding (LLE)
- **Graph**: k-nearest neighbors of $x_i$ (not including $x_i$), called $N_i$.
- **Weights**: weights $w_{ij}$, minimizing
$$ \sum_{i=1}^{N} \left\Vert x_i - \sum_{j \in N_i} w_{ij}x_j \right\Vert^2, $$
subject to $\sum_{j \in N_i} w_{ij} = 1$ for all $i$ (=> lagrange multipliers)
- **Embedding**: $y \in \mathbb{R}^d$, minimizing
$$ \sum_{i=1}^{N} \left\Vert y_i - \sum_{j \in N_i} w_{ij}y_j \right\Vert^2,$$
subject to $\bar{Y} = 0$ and $Y^T Y = 1$ (=> lagrange multipliers & eigenvectors)
Locally linear embedding (LLE) seeks a lower-dimensional projection of the data which preserves distances within local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally compared to find the best non-linear embedding.
MNIST example using LLE:

#### t-distributed stochastic neighborhood embedding (tSNE)
<iframe width="560" height="315" src="https://www.youtube.com/embed/NEaUSP4YerM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
t-SNE first builds a directed graph using a RBF kernel. Then it normalizes the weights such that the sum of outgoing edge weights is 1. Following, it converts the directed graph toa an undirecetd graph by averaging and normalizes so that all weights sum up to one.
- Given low-dimensional representation, build graph as described before, except
+ t-distribution kernel
+ fixed bandwidth
+ undirected (since bandwidth is fixed)
- Minimize reconstruction error $$ \sum_e w_h(e) \log\left(\frac{w_h(e)}{w_l(e)} \right)$$
Force directed layout:
- Attractive force: high-dimensional weights
- Repulsive force: normalization to 1
KL-divergence as an error measure:
- Large high-dimensional weights result in large low-dimensional weights
- Small high-dimensional weights result in any low-dimensional weights
What is important to take away is that t-SNE preserves local structure in the data.
MNIST example using t-SNE:

#### Uniform manifold approximation and projection (UMAP)
<iframe width="560" height="315" src="https://youtu.be/nq6iPZVUxZUf" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
- **Graph**:
- k-nearest neighbor
- **Weights**:
- $\vec{w}(x, y) = \exp\left(-\max\{0, d(x, y) - d(x, nn(x))\}/\sigma\right)$
- $w(x, y) = \vec{w}(x, y) + \vec{w}(y, x) - \vec{w}(x, y)\vec{w}(y, x)$
- **Embedding**:
- Given a low-dimensional representation apply the same process to get a weighted graph, except
+ Manifold is euclidean space
- Minimize distance between graphs using cross-entropy
$$ \sum_{a \in A} \mu(a) \log\left(\frac{\mu(a)}{\nu(a)} \right) + (1-\mu(a)) \log\left(\frac{1-\mu(a)}{1-\nu(a)} \right)$$
- Attractive force (same as t-SNE): high-dimensional weights
- Repulsive force: global repulsive force
UMAP claims to preserve both local and most of the global structure in the data.
This means with t-SNE you cannot interpret the distance between clusters A and B at different ends of your plot. You cannot infer that these clusters are more dissimilar than A and C, where C is closer to A in the plot. But within cluster A, you can say that points close to each other are more similar objects than points at different ends of the cluster image.
With UMAP, you should be able to interpret both the distances between / positions of points and clusters.
MNIST example using UMAP:

### Autoencoders
- $X \subset \mathbb{R}^{N \times D}$
- Function $f \colon \mathbb{R}^D \to \mathbb{R}^d$
- Function $g \colon \mathbb{R}^d \to \mathbb{R}^D$
$$X \approx g(f(X)) $$
Autoencoders are composed of encoding functions $f$ and decoding functions $g$ which we would like to optimize such that the reconstruction error is mitimized. One usually selects some non-linear function for $f$ and $g$, otherwhise we would not learn anything about our data.
One can structure an autoencoder using a neural network, and they learn through backpropagation.
Such a neural network can be illustrated like so:

In the neural network above, we can see that we try to create a three dimensional representation of the original data.
Autoencoders can also be used to generate new data from exsisting data as we will see in the next couple of sections.
#### Standard autoencoder
A (simple) standard autoencoder consists mainly of the input layer, encoded layer and decoded (lossy) layer.
Here we can see a 2D reconstruction of the MNIST data using a standard autoencoder:

While this reconstruction is not particularily interesting, we put our focus onto some reconstructed images using the decoding layer:

Here we can see that the autoencoder did not do very well, mostly because we have a too simple autoencoder.
If we increase the number of dimensions to 10 in the encoding layer, we get the following result which arguably is better:

#### Sparse autoencoder
- Loss $\mathcal{L}(X, g(f(X))) + \Omega(h)$, where $h$ is the hidden representation layer.
+ Typically $\Omega$ is the lasso-penalty $\Omega(h) = \lambda \sum_{j=1}^{d} \vert h_j \vert$
- Sparse autoencoders can have $d > D$
MNIST example using sparse autoencoders ($d=10$):

Comparing the result to the standard autoencoder with $d=10$, we see a slight improvement in the reconstructed images. This might be due the fact that the images have several sparse areas (black pixels).
#### Denoising autoencoder
- Corrupt input $\tilde{X} \sim q(x \mid X)$ for some noise distribution $q$
- Minimize loss $\mathcal{L}(X, g(f(\tilde{X})))$
With a denoising autoencoder, we first apply some noise to our data (corrupting input) and then try to minimize the loss between the original data and the reconstructed corrupted data. This makes our autoencoder sensitive to noise and most likely will try to remove noise datapoints as much as possible in the recreation.
#### Contractive autoencoder
- Learn a function that is robust to slight variations of input values
- Loss $\mathcal{L}(X, g(f(X))) + \lambda \sum_{j=1}^{d} \Vert \nabla_{X}h_{j} \Vert^{2}$
A contractive autoencoder tries to avoid uninteresting solutions by adding an explicit term in the loss that penalizes such solutions. We wish to extract features that only reflect variations observed in the training set.
Both the denoising and contractive autoencoder perform well.
---
## 5. Other topics
_Responsible: Naphat_
### Gaussian mixture models (GMM), EM algorithm
- Used In
- density estimation
- Anomaly detection
- Cluster Analysis
- Representation learning
- Generative models
- Unsupervised Learning
- GOAL
- Given data $X\in \mathbb{R}^{nxd}$, n samples & d dimension, get the distribution of data $X$ approximated by the GMM.
- GMM [^13] is a model with
1. A multinomialy distributed latent variable ==$Z \in \{1\dots k\}^n$==, with mixture weights ==$\phi_i$ for $i \in \{1, \dots, k\}$==, and ==$\sum_{i=1}^{k}\phi_i = 1$==
2. ==$k$ multivariant normal distributions== [^14] with means ==$\mathbf{\mu}_i$== and covariances ==$\mathbf{\Sigma}_i$ for $i \in \{1, \dots, k\}$==.
- HOW
- EM algorithm
- Expectation–Maximization algorithm
- iterative method to find maximum likelyhood
- E-Step
- Creates a function for the expectation of the log-likelihood [^15] evaluated using the current estimate for the parameters
- M-step
- Computes parameters maximizing the expected log-likelihood found on the E step
 The red curve corresponds to the observed data log-likelihood
- QUESTIONS
- How many components should the model include?
- Use methods in clustering.
- How can we validate the results?
- Use cluster internal and external validation methods.
- How can we handle data that comes from different distributions?
- <b style="color:RED">TODO:explain!</b>
- How well do Gaussian mixture models work in high dimensions?
- Parameter estimation can be challenging because of the large number of parameters that need to be estimated.
- Using regularisation to enforce sparcity might help reduce the number of dimensions. (L.Ruan 2011)
- How much data do we need to successfully use Gaussian mixture models?
- <b style="color:RED">TODO:explain!</b>
### Generative models
As the name suggests, generative models generates new data. More specifically, the models attempts to generate samples that are similar to the data they were trained on. Intuitively one could estimate the probability distribution of the data then simply sample new points from the estimate probability distribution. Formally, we want to generate a sample $x$ from distribution $p_{model}$ which is similar to the distribution of the data; $p_{model} \approx p_{data}$. However, the methods that we are going to use is to sample without having an explicit model for the probability distribution.
##### [Variational autoencoders (VAEs)](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73)
*Latent space in this context means the domain of the encoded data.*
To have some context, lets briefly sum up what autoencoders are. Autoencoders consists of an encoder and decoder.

Say we have a trained autoencoder, then the encoded representation of whatever is fed to the encoder contains information such that the decoder can reconstruct the encoded information in a way that resembles the original.
Intuitively if we would want to generate new data, we could feed the encoder some values that are close to the encodings that were used for training, hoping maybe it produces something that looks like whatever it was trained for. More formally, we hope that the latent space is in a sense continuous (also called regular in this context). Unfortunately said method does not work very well. The problem is that the distribution of "valid" inputs (inputs that will give some reasonable decoding) to the decoder is not well organized in the latent space. We say that the latent space is irregular.

Simple autoencoders does not have any guarantees that the latent space should be regular. After all, the training process is to **encode and decode with as minimal loss as possible, regardless of how the encodings are organized in the latent space**.

The method used to get the wanted property of the latent space is to further constrain the training process as well with some modifications. The encoder of a VAE encodes an input to a distribution over the latent space, opposed to a point (the motivation behind this will be covered below). The model will then be trained as follows:
- Encode batch of inputs to distributions over the latent space.
- Sample a point from each distribution.
- Decode to original space, then calculate loss.
- Update network by backpropagation.

The motivation behind encoding into distributions instead of points is that it forces the encoder to designate whole "fuzzy" areas in the latent space for different classes of inputs. Of course, if one would only use MSE as a loss function, the neural network can still just place the distributions really far apart and make them have a very small covariance, effectively just encoding into points again which does not guarantee the wanted properties. A solution to force the network to give us the continuous properties in the latent space is to consider the latent distributions in the loss function. In practice we add a measure of how different a latent distribution is from $\mathcal{N}(0,1)$ to the loss function. The measure used for our course is the Kullback-Leibler divergence, which in conjunction with the MSE reconstruction loss gives the complete loss function:
$$\mathcal{L} = ||y-\hat{y_x}|| + KL[\mathcal{N}(\mu_x, \Sigma_x), \ \mathcal{N}(0,I)]$$
High values for KL-divergence between distribution $p$ and $q$ means a large difference between the distributions. The latent space of a VAE will be "continuous" will have the desired properties.

The general architecture for a VAE is as follows:

###### Reparameterization trick
The encoder of a VAE returns a probability distribution (i.e. the parameters). A point is then sampled from said distribution, then passed onto the decoder. The problem with that is; how do you do backpropagation when you have the random sampling part in the network? The solution is called the *reparameterization trick*, which is a simple alternative way to sample from a Gaussian distribution. Let $(\mu_x, \Sigma_x)$ be the parameters for some encoded distribution, then we can sample said distribution by:
$$z = \Sigma_x \Phi + \mu_x \qquad \Phi \sim \mathcal{N}(0,I)$$

The reason this makes backpropagation possible opposed to the naïve way is that backpropagation requires taking the derivative with respect to the parameters, which are $\mu_x$ and $\Sigma_x$. So a more specific schema of a VAE is:

###### Visual example of latent space
Here we can see the resulting points from encoding unseen images from MNIST using a trained encoder. Observe the that data is centered around 0.

Here are images sampled from the same domain as the points above.

##### Generative adversarial models (GAN)
Another way to generate new data is to use GANs. Generally, a GAN is a structure that consists of two neural networks. One network generates data, while the other tries to tell whether its input is a real input or generated input. Their respective names are the generator and the discriminator.

###### Discriminator
- Standard binary classifier
- Input: Real or generated data
- Output: Label indicating whether input was real or generated
###### Generator
- Input:
- Random point, often sampled from uniform distribution
- Dimension of input is usually less than the dimension of the data
- Output: Generated data
###### Alternating training
The idea of training the networks is simple. Simply alternate the training, for example let the discriminator train for an epoch, then let the generator train for an epoch and so on.
###### Loss functions
**Minmax losses**:
- Discriminator loss:
$$ E_x \left[ \log(D(x)) \right] + E_z \left[ 1-\log(D(G(x))) \right] $$
- Generator loss:
$$ E_z \left[ 1-\log(D(G(x))) \right] $$
- Alternative generator loss:
$$ - E_z \left[ \log(D(G(x))) \right] $$
- $E_x$ is the expected value over all real data instances.
- $E_z$ is the expected value over all noise instances.
- Discriminator loss is binary cross-entropy.
- The generator can't directly affect the $\log(D(x))$ term in the function, so, for the generator, minimizing the loss is equivalent to minimizing $\log(1 - D(G(z)))$.
###### Complications
- Train both generator and discriminator:
+ Easier to discriminate, risk of vanishing generator network gradient.
+ Identify convergence: When do we know that the generator has reached its max potential?
+ Mode collapse: Generator learns one really good trick to fool discriminator and "overfits" to the good trick. Once the discriminator manages to overcome the trick, the generator is kind of stuck in a local optimum.
### Topological data analysis
#### Why topological data analysis?
The motivation behind we we want to do topological data analysis stams from the fact that it is hard to know which type of analysis we should perform on data. Let's say we have some data in 10 dimensions. How do we know what analysis to perform on it? Clustering? Regression? We can not simply rely on dimensionality reduction techniques. Topological data analysis helps us to answer that question.
For example, in the plots below, we could clearly see that in the first plot we should perform clustering and in the second we should perform some kind of regression analysis, but this is mainly because the data is in 2D and is easy to interpret.

Now, take a look at these plots below. Which type of analysis should we perform? It is simply not easy to tell, especially for the second plot.

#### Simplicies

In topology, we have something called simplicies. They might remind us of vertices and edges from graph theory because the behave kind of similarly. Conjunctions of simplicies form a simplicial complex, which would have been a graph with verticies and edges.
A 0-simplex is simply a point. If we connect two 0-simplicies we get a 1-simplex. So far, nothing special.
Now, for the 2-simplex we add one more vertex to the "graph" and fill it in. If we can fill the graph, we can call it a 2-simplex. A 3-simplex behaves similarly; we add one more 0-simplex to the 2-simplex and fill the new part. We also have 4-, 5-, ..., n-simplicies which follow the same structure.
#### Simplicial complex
> A family $K$ of non-empty subsets of $X$ is an *abstract simplicial complex*, if for every set $\sigma \in K$, every non-empty subset $\tau$ of $\sigma$ is also in $K$.
A simplicial complex consists of many n-complexes.
$\sigma \in K \text{ and } \tau \subseteq \sigma \implies \tau \in K$

#### Filtered simplicial complexes

#### Covers

#### Čech and Rips complexes

#### Persistence diagrams and images
How to create a persistence diagram:

Then, creating a persistence image from a persistence diagram:

#### Semisupervised learning
Semisupervised learning is the case when you have a mix of unlabeled and labeled data. What we can do is to approximate the labels of the unlabeled data using the labeled ones.
##### Laplace regularization
https://www.futurelearn.com/courses/advanced-machine-learning/0/steps/49579
Most of this text is a shameless copy-paste from lecture slides.
Given a dataset $X = X_1, \ldots, X_n, \ldots, X_N$, where you have the labels $y_1,\ldots, y_n$, then the goal is to estimate a function $f$ that minimizes
$$\frac{1}{n} \sum_{i=1}^n Loss(f(X_i), y_i) + \gamma ||f||_K^2, $$
for some $\gamma > 0$ and some norm $||\cdot||_K$
We want to use a more fancy norm called the Laplacian norm. This method assumes that the data comes from a manifold $M$. That the manifold has a gradient $\nabla M$, and that the data is distributed to a probability density $P_x$. The Laplacian norm is then as follows:
$$||f||^2_I = \int_M ||\nabla_M f(x)||^2 dPx(x)$$
Then by Stoke's theorem,
$$||f \Vert^2_I = \int_M ||\nabla_M f(x) \Vert^2 dP_X(x) = \int_M f(x) \Delta_M f(x) dP_X(x)$$
But we still can't compute this. So we use graph Laplacian as a proxy measure instead.
> Theorem: Let the data $k$ data points $X = (x_1, \dots, x_k)$ be sampled from the uniform distribution over the embedded d-dimensional manifold $M$. Let $\varepsilon = k^{-\alpha}$ for $0 < \alpha < \frac{1}{d+2}$. Then for all $f \in C^{\infty}$ and $x \in X$, there is a constant $C$, such that
$$\lim_{k \to \infty} C \frac{\varepsilon^{\frac{d+2}{2}}}{k} \mathcal{L}(f)(x) = \Delta_M f(x)$$
in probability, where $\mathcal{L}$ is the graph Laplacian. Somehow this shows that
$$||f||^2_I \approx f^T L f$$
- Minimize
$$\frac{1}{n} \sum_{i = 1}^{n} Loss(f(X_i), y_i) + \gamma (f(X_1), \dots, f(X_N))^T L (f(X_1), \dots, f(X_N)),$$
for some $\gamma > 0$
###### Representer theorem
Given the function
$$f(x) = \sum_{i=1}^n c_i K(x,x_i), $$
and MSE loss, the optimal solution is
$$ c = M^{-1} y, $$
for
$$ M = JK + \gamma \frac{n^2}{N^2} LK, $$
where
$J$ is the diagonal matrix with $n$ 1's and $N-n$ 0's on the diagonal and $y$ is $0$ for unlabeled data.
How I did it:
Given $X_{labeled}$, $y_{labeled}$ and $X_{unlabeled}$
1. Concatenate all labeled and unlabeled samples together such that the labeled points are on top of the array.
2. Let $n_{unlabeled}$ be the number of unlabeled samples.
3. Let $n_{labeled}$ be the number for labeled samples.
4. L = GraphLaplacian(X) (for example the same Graph Laplacian from Spectral Clustering)
5. K = pairwise_distances(X, X), just a pairwise distance matrix
6. K = $\exp (- K / \gamma_l)$, calculate weights
7. J = diagonal matrix where first $n_{labeled}$ points are one, rest is zero.
8. Y = vector where the first $n_{labeled}$ points are $y_{labeled}$.
9. $c = ((JK + \gamma_l LK)^-1) Y$
$c$ are parameters for the function $f(x) = \sum_{i=1}^n c_i K(x,x_i)$.
Then to you use the function to "classify" the samples, which effectively assigns a label to each sample.
---
[^1]: $f_n(x) = \frac{d}{dx}F_n(x) = \frac{1}{n}\sum_{i=1}^n \delta(x-x_i)$, $\delta$ is dirac delta function.
[^2]: $\hat{f}(x) = \frac{1}{nh}\sum_{i=1}^n I_{B_k}(x_i)=\frac{v_k}{nh}, x \in B_k$
[^3]: $MISE\left(\hat{f}\right) = E\left[\int\left(\hat{f}(x) - f(x)\right)^2\right]dx$
[^4]: $\hat{J}\left(\hat{f}\right) = \int\hat{f}(x)^2 - \frac{2}{n}\sum_{i=1}^n\hat{f}\left(X_i^{val}\right)$
[^5]: $D_{KL}\left(\hat{f},f\right) = \int\hat{f}(x)\log\left(\frac{\hat{f}(x)}{f(x)}\right)$
[^6]: $MSE\left(\hat{f}(x)\right) = E\left[\hat{f}(x) - f(x)\right]^2 = Var\left(\hat{f}(x)\right) + Bias^2\left(\hat{f}(x)\right)$
[^7]: $l_p(\hat{f}) = \left[\int|\hat{f}(x)-f(x)|^p dx\right]^{1/p} = ||\hat{f}-f||_p$
[^8]: $k = 1+\lceil \log_2(n) \rceil$
[^9]: $h = \frac{3.5 \hat{\sigma}}{n^{1/3}}$
[^10]: $h = \frac{2\mathop{IQR}}{n^{1/3}}$
[^11]: $\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K\left(\frac{x - X_i}{h}\right) = \frac{1}{n} \sum_{i=1}^{n} K_h\left(x - X_i\right)$
[^12]: $\hat{f}'(x) = \frac{1}{nh^2} \sum_{i=1}^{n} K'\left(\frac{x - X_i}{h}\right)$
[^13]: $p(x) = \sum_{i=1}^k \phi_i\mathcal{N}(x|\mu_i,\Sigma_i)$
[^14]: $\mathcal{N(\mu,\Sigma)}=\sqrt{(2\pi)^k |\Sigma|} e^{\frac{-(x-\mu)^T\Sigma^{-1}(x-\mu)}{2}}$, where $\mu \in \mathbb{R}^k$ and only when $\Sigma \in \mathbb{R}^{kxk}$ is positive-definate
[^15]: $\mathcal{l}(\phi, \mu, \Sigma) = \sum_{j=1}^{n} \log P(X_j \mid \phi, \mu, \Sigma) \\= \sum_{j=1}^{n} \log \sum_{i=1}^{K} P(X_j \mid Z_j, \mu_i, \Sigma_i) P(Z_j \mid \phi_i)$
---
# REFERENCE
L. Ruan & M. Yuan & H. Zou (2011) ["Regularized Parameter Estimation in High-Dimensional Gaussian Mixture Models"](http://www.columbia.edu/~my2550/papers/gmmlasso.final.pdf)