# UE Machine Learning: Unsupervised Techniques Exam 2021
###### tags: `Exam`
**Which statements about estimation theory are true?
Multiple answers are possible!**
- [ ] The Fisher information quantifies how much information the model parameter $w$ carries about the variance of the parametrized data distribution.
- [ ] The Cramér-Rao lower bound equals the Fisher information (matrix).
- [ ] Under certain reasonable regularity conditions, the Fisher information is the expected value of the derivative of the log-likelihood.
- [ ] Denoting with $\hat{w}$ the estimator of the true parameter w, the mean squared error can be written as the sum of variance and bias: mse(w,$\hat{w}$)=Var(w)+Bias($\hat{w}$,w).
- [ ] Under certain reasonable regularity conditions, the Fisher information is the negative curvature of the log-likelihood.
- [x] An unbiased estimator is called efficient if its variance equals the Cramér-Rao lower bound.
- [x] An estimator is called unbiased, if, in expectation over the data distribution, it will yield the true parameter.
- [x] The Cramér-Rao lower bound is a lower bound for the variance of an unbiased estimator.
---
**Which statements about Principal Component Analysis (PCA) are true?
We keep the notation of the exercise slides: X is the data matrix. U is the matrix of eigenvectors of $X^TX$.
Multiple answers are possible!**
- [x] PCA transforms data into a new coordinate system such that the data has the largest variance along the first coordinate (first principal component), the second largest variance along the second coordinate, and so on.
- [x] The first principal component corresponds to the largest eigenvalue in the eigendecomposition of the sample covariance matrix.
- [ ] If there are n data samples, each of which is d-dimensional, the matrix U has dimension n×d.
- [ ] The sample covariance matrix is orthogonal and positive-semidefinite.
- [x] The singular values from the singular value decomposition of the data matrix X are equal to the square root of the eigenvalues of the eigendecomposition of $X^TX$.
- [ ] The eigenvalues obtained from the eigendecomposition of $X^TX$ are all positive and sum up to 1.
- [x] PCA is commonly used to visualize data by downprojecting it to a smaller number of dimensions, keeping the most relevant information.
- [ ] The matrix U is a symmetric matrix.
---
**Mark the correct statements about kernel PCA below (multiple might be correct).**
- [ ] The kernel function k(⋅,⋅) corresponds to a cross product in feature space.
- [ ] Kernel PCA is a non-linear classification method.
- [x] Due to the kernel trick the feature map ϕ(⋅) need never be computed explicitly.
- [ ] Kernel PCA considers an eigenvalue problem in feature space.
---
**Mark the correct statements about independent component analysis (ICA) below (multiple might be correct).**
- [ ] The components of ICA are ranked according to their variance.
- [ ] Independent components are always orthogonal.
- [x] FastICA looks for non-Gaussianity in the data, which is measured via the kurtosis.
- [x] For ICA we need at least as many signals as components.
---
**Mark the correct statements about factor analysis (FA) below (multiple might be correct).**
- [ ] The model parameters of FA are the factor loadings U and the factors y themselves.
- [x] FA can be viewed as a decomposition of the covariance matrix of the data.
- [x] FA is usually solved using the expectation maximization algorithm.
- [ ] FA is usually solved using gradient descent on the likelihood function.
---
**Tick the correct statements about scaling and projection methods (several may be correct). We adapt to the notation from the slides and the lecture, i.e. we collect the original data points in a matrix X, and the mapped (e.g. down-projected) points in a matrix Y.**
- [x] The t-disribution has heavier tails than the Gaussian distribution, thus in t-SNE the joint distributions qij between yi and yj are computed using a t-distribution in order to alleviate the crowding problem.
- [x] Isomap tries to approximate the geodesic distance between two points on a manifold by constructing a neighborhood graph.
- [ ] In SNE and t-SNE, the perplexity term is computed using the entropy of the similarities $q_{j|i}$ of the mapped data points $y_i$ and $y_j$.
- [ ] In SNE and t-SNE, the objective is the mean squared distance between $x_i$ and $y_i$.
- [ ] Locally linear embedding and multidimensional scaling both are probabilistic methods that rely on minimizing the KL-divergence between the distributions of X and Y.
- [x] In SNE and t-SNE, the computation of the similarity $p_{j|i}$ between $x_i$ and $x_j$ relies on assuming a Gauss distribution centered at $x_i$.
---
**Which statements about k-means and affinity propagation are correct?
Multiple answers are possible!**
- [X] The k-means objective it to minimize the within-cluster sum of squared deviations.
- [ ] In affinity propagation: The larger the preference the smaller is the number of cluster centers.
- [ ] Affinity propagation: The larger the damping factor, the faster the convergence.
- [X] In affinity propagation, cluster centers are data points.
- [X] Affinity propagation: The similarity between two points is typically chosen as their negative squared distance.
- [ ] k-means always converges to the global optimum.
- [ ] The parameter k in k-means is automatically chosen by the algorithm to best fulfill the optimization objective.
- [X] k-means is sensitive to initialization of the cluster centers.