owned this note
owned this note
Published
Linked with GitHub
---
disqus: hackmd
---
{%hackmd j3B-lCmVRBe5-hFP1kq-iA %}
# DSP Final Project
<p style="text-align:center;text-indent:0">Team members: b09902064 楊冠柏, b09902067 易冠廷</p>
<br>
<br>
<br>
<br>
<br>
<br>
## 1 Statement
This project and its report do not have any double assignment or any completed work before this semester. It is completely done in this semester.
## 2 Abstract
We read 3 paper literatures related to matrix factorization mentioned in class, then summarize their contents and make some discussion.
## 3 Introduction
The first paper [^1] discusses the use of matrix factorization techniques for creating recommender systems. The authors also discuss various extensions and variations of matrix factorization to improve the accuracy of recommendations.
The second paper [^2] describes a solution for track 1 of KDDCup 2011, which involves predicting music ratings for users based on their past ratings. The solution includes four key components: well-developed individual models, validation-set non-linear blending, test-set linear ensemble with offline RMSE, and post-processing techniques. The success of the solution is attributed to the integration of various types of information retrieved from the dataset, such as hidden factors, statistical features, and observations from the data collection system, through feature engineering and blending techniques.
The third paper [^3] proposes an optimization algorithm for online dictionary learning and prove its property of convergence. Then, the authors of this paper further extended their algorithm to other matrix factorization and sparse coding problems. Finally, some experiments are done and showed that their algorithm do perform well comparing to other optimization algorithm then.
We will then give a brief summarization of all these paper, and then discuss matrix factorization mentioned in these 3 papers and some other related topics.
<div style="page-break-after: always; break-after: page;"></div>
## 4 Paper Summarization
### 4.1 Matrix Factorization Techniques for Recommender Systems
This paper explores the use of matrix factorization techniques for building recommender systems. The authors introduce the two main approaches for creating these systems: content-based filtering and collaborative filtering. Content-based filtering involves creating profiles for each user and product, which can be used to match users with products based on their characteristics. Collaborative filtering, on the other hand, relies on past user behavior and relationships between users and products to identify new user-item associations. Matrix factorization, a technique used in collaborative filtering, has gained popularity in recent years due to its scalability, predictive accuracy, and ability to incorporate additional information such as implicit feedback.
```graphviz
digraph G {
graph [rankdir="LR"]
node [shape=box,fontname=Courier,fontsize=15,penwidth=2]
edge [style=solid,arrowsize=0.6,penwidth=2,arrowhead=none]
"Recommender\nSystems"->"Content-Based\nFiltering"
"Recommender\nSystems"->"Collaborative\nFiltering"->"Neighborhood\nMethods"
"Collaborative\nFiltering"->"Latent Factor\nModels"->"Matrix\nFactorization"
}
```
Matrix factorization is a technique used in latent factor models for creating recommender systems. It involves mapping users and items to a latent factor space, where the dot product of the vectors representing each user and item approximates the user's overall interest in the item:
$$
\hat{r}_{ui} = q_i^Tp_u
$$
This model is similar to singular value decomposition, but recent matrix factorization models have focused on directly modeling the observed ratings and using regularization to prevent overfitting. The constant that controls the amount of regularization is usually determined through cross-validation.
The authors show two main approaches for minimizing the error in a matrix factorization model: stochastic gradient descent and alternating least squares (ALS). In stochastic gradient descent, it loops through all ratings in the training set, predicts the rating for each example, and then modifies the parameters based on the prediction error. This approach is relatively fast and easy to implement, but may not be as effective as ALS in some cases. ALS involves fixing one set of parameters and optimizing the other, and then alternating between the two. This ensures that each step decreases the error until convergence. ALS is useful when the system can use parallelization, or when it is based on implicit data, as it can efficiently handle dense training sets.
Next, the authors explain that biases can be added to the matrix factorization model to deal with systematic tendencies in ratings between users and items. These biases can be incorporated into the model by extending MF to include individual biases for each user and item. This allows each bias to explain only the part that is relevant to it, and the model can be learned by minimizing the squared error function. Moreover, additional sources of information, such as implicit feedback and known user attributes, can be used to improve the model's ability to make recommendations for users with few ratings, a problem known as the cold start problem. This is achieved by enhancing the user representation through the incorporation of these additional sources of information.
The authors also explain that temporal dynamics, the fact that preferences and popularity change over time, can be incorporated into the matrix factorization model by treating certain parameters, like the item biases, user biases, and user preferences, as functions of time. This allows the model to deal with the dynamic nature of user-item interactions and can significantly improve accuracy. Additionally, not all observations should be given equal weights, so attaching a confidence score to the estimated preference can be useful, based on the frequency of actions or recurring events.
### 4.2 A Linear Ensemble of Individual and Blended Models for Music Rating Prediction
In this paper, the team presents a solution to predict music rating for users based on past ratings $\mathcal{D}$. They visualize their workflow as follows to achieve their goal.
```graphviz
digraph G {
graph [rankdir="LR"]
node [shape=box,fontname=Courier,fontsize=15,penwidth=2]
edge [style=solid,arrowsize=0.6,penwidth=2]
"Data-Set"->"Individual\nModels"->"Test-Set\nBlending"->"Post-\nProcessing"
"Individual\nModels"->"Validation-Set\nBlending"
"Data-Set"->"Validation-Set\nBlending"->"Test-Set\nBlending"
}
```
For the individual models, the authors use various approaches for building collaborative filtering models using matrix factorization (MF), Restricted Boltzmann Machines (RBMs), k-Nearest Neighbors (k-NN), Probabilistic Latent Semantic Analysis (pLSA), and Probabilistic Principle Component Analysis (pPCA). The authors describe different variants and techniques for improving the performance of these models, and also discuss how to incorporate additional information, such as artist information and temporal information, into the models. The authors also mention the methods for training and optimizing these models, including stochastic gradient descent (SGD) and the Expectation Maximization (EM) algorithm. Finally, the authors discuss how these models can be combined in an ensemble model to improve performance. For example, they applied k-NN to the residuals of other models rather than the original ratings, and found that the resulting residual k-NN models improved the performance of the MF and RBM models significantly.
<div style="page-break-after: always; break-after: page;"></div>
|Model|Variants|
|:---:|--------|
|MF|Regularized MF, Biased MF, Binned Biased MF, Biased MF with Time Deviation, Compound SVD, Non-Negative MF, Kernelized MF, Ordinal Regression MF|
|RBM|Conditional RBMs (cRBMs), Factorized RBMs (fRBMs), Gaussian RBM (gRBM)|
|k-NN|Item-based k-NN, Time Decay k-NN, Rating Decay k-NN, Polynomial Transforms of Weights|
|pLSA|pLSA with Binary Ratings, pLSA with Soft Evidence|
|pPCA|Inversed pPCA|
|Regression|Linear/Ridge Regression, Neural Network, and Gradient Boosting Regression Trees|
In MF, they apply novel ideas to improve the MF models for this competition, such as Day/Hour Biased MF (by adding $\mu_{d,h}^{time}$ term), as they observe that users may rate differently between weekdays and weekends. Additionally, they assign larger weights to newer examples when calculating the gradient during SGD, as newer examples may contain more relevant information for predicting ratings. Both approaches improve the performance. In k-NN, the authors also proposed two new correlation functions: Temporal Common User Support and Normalized Common Count. Moreover, they found that incorporating artist information in the taxonomy and emphasizing same-time neighbors can also improve the performance of k-NN.
To prevent overfitting, the pLSA model can compress the rating space to binary values and use each original rating as a soft evidence with a certain evidence level. In addition, the validation set can be emphasized in the M-step of EM in order to focus more on examples that are closer in time to the test set. Also, to make the model computationally feasible, the authors propose a sparse formulation that only considers the rated pairs during the E-step of the EM algorithm and skips the unrated pairs. Finally, the authors used supervised regression algorithms on features extracted from the data set and from other collaborative filtering models to make ratings predictions. They used the random subspace method to reduce the memory consumption of the Neural Network and Gradient Boosting Regression Trees models and found that the k-NN features worked the best for the regression models.
The validation-set blending part combines the predictions of multiple models trained on a validation set to produce a new, more accurate model. This is done through the use of supervised regression algorithms, such as AdaBoost.RT and Neural Networks, which take the predictions of the individual models as features and learn a regressor that combines them in a nonlinear way. The goal of this process is to introduce diversity and better performance into the final ensemble of models that will be used to make predictions on the test set. The authors also describe the use of the Binned Linear Regression (BLR) algorithm, which partitions the validation set into bins based on certain criteria and applies a local linear regression model to each bin. They also propose a technique called Multiple-feature BLR, which uses greedy search to locate suitable bins based on multiple features, and a technique called Multi-stage Regression, which merges the blended models through multiple stages of linear regression to further improve performance.
The test-set ensemble part combines the individual models and the validation-set-blended models using the test-set blending technique. This technique uses RMSE to estimate the linear blending weights, which are used to compute the ensemble prediction $\hat{r}$. The authors also developed a technique to estimate the RMSE of the blended model before actually submitting it to the leaderboard, allowing them to make offline choices on which models to include in the final ensemble.
Finally, in post-processing, the authors apply two techniques to adjust their predictions based on observations of the data collection system. The first technique involves adjusting ratings given during a period when the rating system was a four-star system, by mapping the four-star rating to a value of 90 instead of the usual 100. The second technique involves setting the ratings of all tracks within an album to zero if the album is rated zero by a user, as this may have been done automatically by the data collection system. These techniques are expected to improve the Test1 RMSE slightly.
Overall, the authors concluded that the techniques of feature engineering and blending were important for their winning solution.
### 4.3 Online Learning for Matrix Factorization and Sparse Coding
In this paper, the authors proposed an online learning algorithm that focus on solving sparse coding and large scale matrix factorization, including dictionary learning, non-negative matrix factorization (NMF) and principal component analysis (PCA).
The authors first focused on dictionary learning problem, and then extended their algorithm to the rest of the problems, including sparse coding, NMF and PCA. In brief, dictionary learning problem aims to find a sparse approximation of a signal $\mathbf{x} \in \mathbb{R}^m$ over a dictionary $\mathbf{D} \in \mathbb{R}^{m\times k}$, where each column of $\mathbf{D}$ is called a *atom*, i.e. dictionary learning is to find some of the atoms in $\mathbf{D}$ and represent signal $\mathbf{x}$ in linear combination of these atoms.
The training of dictionary learning usually comes with a training set of signal $X=[x_1, x_2, ..., x_n]\in \mathbb{R}^{m\times n}$, and the goal is to optimize the empirical cost
$$
f_n(\mathbf{D}) = \frac{1}{n}\sum_{i=1}^{n} l(\mathbf{x_i}, \mathbf{D})
$$
where
$$
l(\mathbf{x_i}, \mathbf{D}) = \min_{\mathbf{\alpha}\in \mathbb{R}^{k}} {\frac{1}{2} ||\mathbf{x}-\mathbf{D}\mathbf{\alpha}||^2_2 + \lambda||\mathbf{\alpha}||_1}
$$
In addition, to prevent $\mathbf{D}$ from having large values, which might make $\mathbf{\alpha}$ have small values, the authors set a constraint $\mathcal{C}$ that all column of $\mathbf{D}$ to have a L2 norm less than or equal to one. The definition of $\mathcal{C}$ is shown below.
$$
\mathcal{C} = \{ \mathbf{D}\in \mathbb{R}^{m\times k}\ s.t.\ \forall j=1,..,k,\ \mathbf{d}_j^T\mathbf{d}_j \le 1 \}
$$
In the paper, the authors pointed out from past literatures that we usually do not minimize the *empirical cost* $f_n(\mathbf{D})$ directly, instead, we usually want to minimize the *expected cost* $f(\mathbf{D})=\mathbb{E}_{\mathbf{x}}[l(\mathbf{x_i}, \mathbf{D})] = \lim_{n\rightarrow\infty}{f_n{\mathbf{D}}}$.
Hence, the authors proposed a stochastic approximation online learning algorithm for the optimization of dictionary learning problem. The pseudo code provided by the authors of this algorithm is shown below.
<figure><img width=600 src="https://i.imgur.com/nCK0guY.png"><figcaption></figcaption></figure>
<!-- https://i.imgur.com/2l6uvlN.png -->
<figure><img width=600 src="https://i.imgur.com/DyskBSZ.png"><figcaption></figcaption></figure>
<!-- https://i.imgur.com/X7IDpro.png -->
In their algorithm, they assumed the training set signals are composed of i.i.d samples of a distribution $p(\mathbf{x})$. The goal of this algorithm is to minimize
$$
\hat{f}_t(\mathbf{D}) = \frac{1}{t} \sum_{i=1}^{t} ( \frac{1}{2} ||\mathbf{x}-\mathbf{D}\mathbf{\alpha}||^2_2 + \lambda||\mathbf{\alpha}||_1 )
$$
Therefore, in each iteration, the algorithm will draw one element $\mathbf{x}_t$ at a time, and fix $\mathbf{D}_{t-1}$, then calculate $\mathbb{\alpha}_t$ using LARS (least angle regression) and $\mathbf{A}_t, \mathbf{B}_t$ are updated. Then, the algorithm will update the dictionary $\mathbf{D}$ using Algorithm 2. After $T$ iterations, the algorithm can learn the dictionary $D_T$. In this algorithm, dictionary $\mathbf{D}$ and $\mathbf{A}, \mathbf{B}$ are fixed and updated alternately. In addition, to prevent saving all $\mathbf{x}_i$ and $\mathbf{\alpha}_i$, which if inefficient, the authors use the matrices $\mathbf{A,B}$ instead of storing all past information of $\mathbf{x}_i$ and $\mathbf{\alpha}_i$. Also, it is mentioned in this paper that this algorithm has the advantage of no parameter and no learning rate tuning. Besides, the proof of the convergence to the stationary point of this algorithm is also provided in this paper by the authors.
After proposing the algorithm above, the authors further proposed some possible optimization for their algorithm, including handling fixed-size data sets, scaling past data, mini-batch extension and slowing down the first iteration. In handling fixed-size data sets, the authors proposed using $\mathbf{A}_t \leftarrow \mathbf{A}_{t-1} + \mathbf{\alpha}_t\mathbf{\alpha}_t^T - \mathbf{\alpha}_0\mathbf{\alpha}_0^T$ and $\mathbf{B}_t \leftarrow \mathbf{B}_{t-1} + \mathbf{x}_t\mathbf{\alpha}_t^T - \mathbf{x}_0\mathbf{\alpha}_0^T$ to replace updating of $\mathbf{A,B}$ in the algorithm where the same $\mathbf{x}_i$ is drawn in both $0,t$-th iteration, since if training set is finite, then same $\mathbf{x}$ might occurs multiple times, and thus we have to replace "old" information with new one. Similar idea also applied in scaling past data optimization, where the authors propsoed to use a new paramter $\beta_t=(1-\frac{1}{t})^{\rho}$ to adjust the scale of $\mathbf{A_{t-1}}$ and $\mathbf{B_{t-1}}$ instead of just using previous value, so that the influence of past information can be controlled. In mini-batch, the authors proposed to use multiple training data at a time and use their mean value to update instead of just one element at a time. As for slowing down the first iteration, it is to prevent updating the paramters with large step and cause large deviation from initial dictionary.
The all above mentioned focused on dictionary learning, but in this paper, the authors also extended their algorithms to matrix factorization. First, the authors proposed several alternative method of regularizer for $\mathbf{\alpha}$, such as Tikhonov regularization, elastic net and group Lasso, and different constraints for $\mathbf{D}$, which are used in extension of their algorithm to matrix factorization. The following are the objective funtion that need to be minimized in non-negative matrix factorization (NMF), non-negative sparse coding (NNSC) and sparse principal component analysis (SPCA) respectively. Note that for regualr version of matrix factorization and sparse coding, we can simply ignore the constraint of being non-negative. Since these objective functions are highly related to the one in dictionary learning, the algorithm above can be extended to these problems as well.
<div style="page-break-after: always; break-after: page;"></div>
NMF:
$$
\min_{\mathbf{D\in\mathcal{C}}, \mathbf{\alpha}\in\mathbb{R}^{k\times n}} \sum_{i=1}^{n} {\frac{1}{2} ||\mathbf{x}_i-\mathbf{D}\mathbf{\alpha}_i||^2_2}\ s.t.\ \mathbf{D}\ge 0, \forall i, \mathbf{\alpha}_i \ge 0
$$
NNSC:
$$
\min_{\mathbf{D\in\mathcal{C}}, \mathbf{\alpha}\in\mathbb{R}^{k\times n}} \sum_{i=1}^{n} ({\frac{1}{2} ||\mathbf{x}_i-\mathbf{D}\mathbf{\alpha}_i||^2_2} + \lambda\sum_{j=1}^{k} \mathbf{\alpha}_i[j]) \ s.t.\ \mathbf{D}\ge 0, \forall i, \mathbf{\alpha}_i \ge 0
$$
SPCA:
$$
\min_{\mathbf{\alpha}\in\mathbb{R}^{k\times n}} \sum_{i=1}^{n} ({\frac{1}{2} ||\mathbf{x}_i-\mathbf{D}\mathbf{\alpha}_i||^2_2} + \lambda||\mathbf{\alpha}_i||_1) \ s.t.\ \forall j=1,...,k,\ ||\mathbf{d}_j||_2^2 + \gamma ||\mathbf{d}_j||_1 \le 1
$$
The authors of this paper then implemented their algorithm and conducted some experiments. In the performance of dictionary learning, they compare their online algorithm with batch method and stochastic gradient descent (SGD), and their algorithm can converge faster and better than all three of batch method settings, and was also very close or even slightly better than SGD. In addition, they also compare the performance on NMF and NNSC with some previous works. The experiment results showed that their algorithm outperformed those previous works on converging speed.
Finally, they demonstrated their online dictionary learning algorithm on task of removing text from image, and get a very positive result. Therefore, the authors claimed that their algorithm can actually be applied to some realistic or non-trivial tasks.
## 5 Discussion
### 5.1 Matrix factorization in 3 papers
Since our topic is about matrix factorization, we must discuss matrix factorization mentioned in all these 3 papers. First of all, the first paper gives us a brief idea of matrix factorization and its usage on recommandation system, including its learning algorithm such as SGD and ALS that mentioned in class, adding biases and etc. Then, second paper shows us multiple variants of matrix factorization and its pratical usage in a competition and how the authors adapted the models to the competition. As for the third paper, it not only provides us with an useful optimization algorithm that might help training our matrix factorization model faster, but also allow us to further understand the relation between matrix factorizatoin and some other methods such as dictionary learning, PCA and etc.
In general, all these 3 papers help us understand a lot more about matrix factorization, and also allow us to learn the engineering and mathematics foundation of this method.
### 5.2 Additional information we learned
From our summarization above, we can apparently find out that many of the methods mentioned in first and third paper are actually taught in our DSP lecture, including RBM, LSA, PCA, k-NN, EM, SGD and etc. Due to the lack of time in lecture, we might not able to fully understand all these useful methods during class, but by reading these 3 papers, we can actually learn these methods more detailedly.
### 5.3 Combination of these papers
After reading all 3 papers, one of the combination we could think of is to utilize optimization algorithm provided by third paper, and try to combine with those matrix factorization variants used in second paper. Originally in the second paper, they use SGD for optimization. However, in the experiment shown in third paper, we saw that the algorithm they proposed can actually have similar performance as SGD, but without the need of tuning some hyperparameters such as learning rate. Hence, we think using their algorithm could not only decrease the complexity of training and impelmentation, but might also get a better performance.
## 6 Conclusion
All three of these papers provide a wealth of information beyond the basics of matrix factorization. In the first paper, we learn advanced techniques such as incorporating biases, using additional input sources, modeling temporal dynamics, and handling inputs with varying confidence levels. The second paper teaches us about diverse models and how to combine them using an ensemble method, as well as how to adjust the models to fit the desired dataset. The third paper delves into the relationship between matrix factorization and methods like sparse coding, dictionary learning, and PCA, and also covers the proof of convergence for the objective function and the details of designing and implementing algorithms.
## 7 Appendix
### 7.1 Teamwork
| Member | Contribution |
| :----: | ------------ |
| 楊冠柏 | Reading Paper 1 & 2, Report |
| 易冠廷 | Reading Paper 3, Report |
## Reference
[^1]: Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30-37, 2009.
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5197422
[^2]: A Linear Ensemble of Individual and Blended Models for Music Rating Prediction. In JMLR W&CP, volume 18, 2011.
https://dl.acm.org/doi/pdf/10.5555/3000375.3000379
[^3]: J Mairal, F Bach, J Ponce, G Sapiro, Online learning for matrix factorization and sparse coding, The Journal of Machine Learning, 2010
https://arxiv.org/pdf/0908.0050.pdf