SystemML Algorithms Reference - Factorization Machines

--- layout: global title: SystemML Algorithms Reference - Factorization Machines displayTitle: <a href="algorithms-reference.html">SystemML Algorithms Reference</a> ---  # 7. Factorization Machines ### Description The Factorization Machine (FM), that is a general predictor like SVMs but is also able to estimate reliable parameters under very high sparsity. The factorization machine models all nested variable interactions (compared to a polynomial kernel in SVM), but uses a factorized parameterization instead of a dense parameterisation like in SVMs. ## Core Model ##### 1. Model Equation: <img src="https://tex.s2cms.ru/svg/%20%5Chat%7By%7D(x)%20%3D%0Aw_0%20%2B%0A%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%20w_i%20x_i%20%2B%0A%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%20%5Csum_%7Bj%3Di%2B1%7D%5E%7Bn%7D%20%5Cleft%20%3Cv_i%2C%20v_j%20%5Cright%20%3E%20x_i%20x_j%0A" alt=" \hat{y}(x) = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \left <v_i, v_j \right > x_i x_j " /> where the model parameters that have to be estimated are: <img src="https://tex.s2cms.ru/svg/%0A%20w_0%20%5Cin%20R%2C%0A%20W%20%20%20%5Cin%20R%5En%2C%0A%20V%20%20%20%5Cin%20R%5E%7Bn%20%5Ctimes%20k%7D%0A%20" alt=" w_0 \in R, W \in R^n, V \in R^{n \times k} " /> and <img src="https://tex.s2cms.ru/svg/%0A%20%20%20%5Cleft%20%3C%5Ccdot%2C%20%5Ccdot%20%5Cright%20%3E%0A%20" alt=" \left <\cdot, \cdot \right > " /> is the dot product of two vectors of size <img src="https://tex.s2cms.ru/svg/k" alt="k" />: <img src="https://tex.s2cms.ru/svg/%0A%20%5Cleft%20%3Cv_i%2C%20v_j%20%5Cright%20%3E%20%3D%20%5Csum_%7Bf%3D1%7D%5E%7Bk%7D%20v_%7Bi%2Cf%7D%20%5Ccdot%20v_%7Bj%2Cf%7D%0A%20" alt=" \left <v_i, v_j \right > = \sum_{f=1}^{k} v_{i,f} \cdot v_{j,f} " /> A row <img src="https://tex.s2cms.ru/svg/%20v_i%20" alt=" v_i " /> with in <img src="https://tex.s2cms.ru/svg/%20V%20" alt=" V " />describes the <img src="https://tex.s2cms.ru/svg/i" alt="i" />th variable with <img src="https://tex.s2cms.ru/svg/k%20%5Cin%20N_0%5E%2B" alt="k \in N_0^+" /> factors. <img src="https://tex.s2cms.ru/svg/k" alt="k" /> is a hyperparameter, that defines the dimensionality of factorization. * <img src="https://tex.s2cms.ru/svg/%20w_0%20" alt=" w_0 " /> : global bias * <img src="https://tex.s2cms.ru/svg/%20w_j%20" alt=" w_j " /> : models the strength of the ith variable * <img src="https://tex.s2cms.ru/svg/%20w_%7Bi%2Cj%7D%20%3D%20%5Cleft%20%3Cv_i%2C%20v_j%20%5Cright%3E%20" alt=" w_{i,j} = \left <v_i, v_j \right> " /> : models the interaction between the <img src="https://tex.s2cms.ru/svg/i" alt="i" />th & <img src="https://tex.s2cms.ru/svg/j" alt="j" />th variable. Instead of using an own model parameter <img src="https://tex.s2cms.ru/svg/%20w_%7Bi%2Cj%7D%20%5Cin%20R%20" alt=" w_{i,j} \in R " /> for each interaction, the FM models the interaction by factorizing it. ##### 2. Expressiveness: It is well known that for any positive definite matrix <img src="https://tex.s2cms.ru/svg/W" alt="W" />, there exists a matrix <img src="https://tex.s2cms.ru/svg/V" alt="V" /> such that <img src="https://tex.s2cms.ru/svg/W%20%3D%20V%20%5Ccdot%20V%5Et" alt="W = V \cdot V^t" /> provided that <img src="https://tex.s2cms.ru/svg/k" alt="k" /> is sufficiently large. This shows that an FM can express any interaction matrix <img src="https://tex.s2cms.ru/svg/W" alt="W" /> if <img src="https://tex.s2cms.ru/svg/k" alt="k" /> is chosen large enough. ##### 3. Parameter Estimation Under Sparsity: In sparse settings, there is usually not enough data to estimate interaction between variables directly & independently. FMs can estimate interactions even in these settings well because they break the independence of the interaction parameters by factorizing them. ##### 4. Computation: Due to factorization of pairwise interactions, there is not model parameter that directly depends on two variables ( e.g., a parameter with an index <img src="https://tex.s2cms.ru/svg/(i%2Cj)" alt="(i,j)" /> ). So, the pairwise interactions can be reformulated as shown below. <img src="https://tex.s2cms.ru/svg/%0A%5Csum_%7Bi%3D1%7D%5En%20%5Csum_%7Bj%3Di%2B1%7D%5En%20%5Cleft%20%3Cv_i%2C%20v_j%20%5Cright%20%3E%20x_i%20x_j%0A" alt=" \sum_{i=1}^n \sum_{j=i+1}^n \left <v_i, v_j \right > x_i x_j " /> <img src="https://tex.s2cms.ru/svg/%0A%3D%20%7B1%20%5Cover%202%7D%20%5Csum_%7Bi%3D1%7D%5En%20%5Csum_%7Bj%3D1%7D%5En%20x_i%20x_j%20-%20%7B1%20%5Cover%202%7D%20%5Csum_%7Bi%3D1%7D%5En%20%5Cleft%20%3Cv_i%2C%20v_j%20%5Cright%20%3E%20x_i%20x_i%0A" alt=" = {1 \over 2} \sum_{i=1}^n \sum_{j=1}^n x_i x_j - {1 \over 2} \sum_{i=1}^n \left <v_i, v_j \right > x_i x_i " /> <img src="https://tex.s2cms.ru/svg/%0A%3D%20%7B1%20%5Cover%202%7D%20%5Cleft%20(%20%5Csum_%7Bi%3D1%7D%5En%20%5Csum_%7Bj%3D1%7D%5En%20%5Csum_%7Bf%3D1%7D%5Ek%20v_%7Bi%2Cf%7D%20v_%7Bj%2Cf%7D%20-%20%5Csum_%7Bi%3D1%7D%5En%20%5Csum_%7Bf%3D1%7D%5Ek%20v_%7Bi%2Cf%7Dv_%7Bi%2Cf%7D%20x_i%20x_i%20%5Cright%20)%0A" alt=" = {1 \over 2} \left ( \sum_{i=1}^n \sum_{j=1}^n \sum_{f=1}^k v_{i,f} v_{j,f} - \sum_{i=1}^n \sum_{f=1}^k v_{i,f}v_{i,f} x_i x_i \right ) " /> <img src="https://tex.s2cms.ru/svg/%0A%3D%20%7B1%20%5Cover%202%7D%20%5Cleft%20(%20%5Csum_%7Bf%3D1%7D%5Ek%20%5Cright%20)%20%5Cleft%20(%20%5Cleft%20(%5Csum_%7Bi%3D1%7D%5En%20v_%7Bi%2Cf%7D%20x_i%20%5Cright%20)%20%5Cleft%20(%5Csum_%7Bj%3D1%7D%5En%20v_%7Bj%2Cf%7D%20x_j%20%5Cright%20)%20-%20%5Csum_%7Bi%3D1%7D%5En%20v_%7Bi%2Cf%7D%5E2%20x_i%5E2%20%5Cright%20)%0A" alt=" = {1 \over 2} \left ( \sum_{f=1}^k \right ) \left ( \left (\sum_{i=1}^n v_{i,f} x_i \right ) \left (\sum_{j=1}^n v_{j,f} x_j \right ) - \sum_{i=1}^n v_{i,f}^2 x_i^2 \right ) " /> <img src="https://tex.s2cms.ru/svg/%0A%7B1%20%5Cover%202%7D%20%5Csum_%7Bf%3D1%7D%5Ek%20%5Cleft%20(%20%5Cleft%20(%20%5Csum_%7Bi%3D1%7D%5En%20v_%7Bi%2Cf%7D%20x_i%20%5Cright%20)%5E2%20-%20%5Csum_%7Bi%3D1%7D%5En%20v_%7Bi%2Cf%7D%5E2%20x_i%5E2%20%5Cright%20)%0A" alt=" {1 \over 2} \sum_{f=1}^k \left ( \left ( \sum_{i=1}^n v_{i,f} x_i \right )^2 - \sum_{i=1}^n v_{i,f}^2 x_i^2 \right ) " /> ##### 5. Learning Factorization Machines <img src="https://tex.s2cms.ru/svg/%0A%25%20%3C!%5BCDATA%5B%0A%7B%20%5Cdelta%20%5Cover%20%5Cdelta%20%5Ctheta%20%7D%20%5Chat%7By%7D(x)%20%3D%0A%5Cbegin%7Bcases%7D%0A1%20%26%20%5Ctext%7Bif%20%7D%20%5Ctheta%20%5Ctext%7B%20is%20%7D%20w_0%20%5C%5C%0Ax_i%20%26%20%5Ctext%7Bif%20%7D%20%5Ctheta%20%5Ctext%7B%20is%20%7D%20w_i%20%5C%5C%0Ax_i%20%5Csum_%7Bj%3D1%7D%5E%7Bn%7D%20v_%7Bj%2Cf%7D%20%5Ccdot%20x_j%20-%20v_%7Bi%2Cf%7D%20%5Ccdot%20x_i%5E2%20%26%20%5Ctext%7Bif%20%7D%20%5Ctheta%20%5Ctext%7B%20is%20%7D%20%5Ctheta_%7Bi%2Cf%7D%0A%5Cend%7Bcases%7D%20%25%5D%5D%3E%0A" alt=" % <![CDATA[ { \delta \over \delta \theta } \hat{y}(x) = \begin{cases} 1 & \text{if } \theta \text{ is } w_0 \\ x_i & \text{if } \theta \text{ is } w_i \\ x_i \sum_{j=1}^{n} v_{j,f} \cdot x_j - v_{i,f} \cdot x_i^2 & \text{if } \theta \text{ is } \theta_{i,f} \end{cases} %]]> " /> ##### 6. Factorization Models as Predictors: * **Regression:** <img src="https://tex.s2cms.ru/svg/%5Chat%7By%7D(x)" alt="\hat{y}(x)" /> can be used directly as the predictor and the optimization criterion is the minimal least square error on <img src="https://tex.s2cms.ru/svg/D" alt="D" />. _**Usage:**_ The `train` function in the [fm-regression.dml](https://github.com/apache/systemml/blob/master/scripts/staging/fm-regression.dml) script, takes in the input variable matrix and the corresponding target vector with some input kept for validation during training. ``` train = function(matrix[double] X, matrix[double] y, matrix[double] X_val, matrix[double] y_val) return (matrix[double] w0, matrix[double] W, matrix[double] V) { /* * Trains the FM model. * * Inputs: * - X : n examples with d features, of shape (n, d) * - y : Target matrix, of shape (n, 1) * - X_val : Input validation data matrix, of shape (n, d) * - y_val : Target validation matrix, of shape (n, 1) * * Outputs: * - w0, W, V : updated model parameters. * * Network Architecture: * * X --> [model] --> out --> l2_loss::backward(out, y) --> dout * */ ... # 7.Call adam::update for all parameters [w0,mw0,vw0] = adam::update(w0, dw0, lr, beta1, beta2, epsilon, t, mw0, vw0); [W, mW, vW] = adam::update(W, dW, lr, beta1, beta2, epsilon, t, mW, vW ); [V, mV, vV] = adam::update(V, dV, lr, beta1, beta2, epsilon, t, mV, vV ); } ``` Once the `train` function returns the weights for the `fm` model, these are passed to `predict` function. ``` predict = function(matrix[double] X, matrix[double] w0, matrix[double] W, matrix[double] V) return (matrix[double] out) { /* * Computes the predictions for the given inputs. * * Inputs: * - X : n examples with d features, of shape (n, d). * - w0, W, V : trained model parameters. * * Outputs: * - out : target vector, y. */ out = fm::forward(X, w0, W, V); } ``` _**running with dummy data:**_ The [fm-regression-dummy-data.dml](https://github.com/apache/systemml/blob/master/scripts/nn/examples/fm-regression-dummy-data.dml) file can be a nice template, to extend. ``` > spark-submit ./SystemML.jar -f ./scripts/nn/examples/fm-regression-dummy-data.dml ``` * **Binary Classification:** The sign of <img src="https://tex.s2cms.ru/svg/%5Chat%7By%7D(x)" alt="\hat{y}(x)" /> is used & the parameters are optimized for the hinge loss or logit loss. _**Usage:**_ The `train` function in the [fm-binclass.dml](https://github.com/apache/systemml/blob/master/scripts/staging/fm-binclass.dml) script, takes in the input variable matrix and the corresponding target vector with a some input kept for validation during training. This script also contain `train()` and `predict()` function as in the case of regression. _**running with dummy data:**_ The [fm-regression-dummy-data.dml](https://github.com/apache/systemml/blob/master/scripts/nn/examples/fm-regression-dummy-data.dml) file can be a nice template, to extend. * **Ranking:** The vectors are ordered by score of <img src="https://tex.s2cms.ru/svg/%5Chat%7By%7D(x)" alt="\hat{y}(x)" /> and the optimization is done over pass of an instance vectors <img src="https://tex.s2cms.ru/svg/%20(x(a)%2C%20x(b))%20%5Cin%20D%20" alt=" (x(a), x(b)) \in D " /> with a pairwise classification loss. Regularization terms like <img src="https://tex.s2cms.ru/svg/L2" alt="L2" /> are usually applied to the optimization objective to prevent overfitting.