###### tags: `Social Media Mining`
# LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation
Reference link: https://hackmd.io/vv39m5QXSDuCeI3ki4F36Q?view
## Abstract
GCN 雖然也會被用在 state-of-the-art for collaborative filtering 但效果沒有很好,原因是他是被設計 for graph classification tasks and equipped with many neural network
operations.
他們實驗發現兩個 GCN 最常見的設計,feature transformation & nonlinear activation 對 collaborative filtering 沒有很大的幫助。
> Collaborative Filtering (CF) vs Content-based Filtering:
>
> 
他們提出一個 LightGCN 包括 GCN 最重要的 component - neighborhood aggregation - for collaborative filtering
Specifically, LightGCN learns user and item embeddings by **linearly propagating** them, and uses the weighted sum of the embeddings learned at all layers as the final embedding.
(about 16.0% relative improvement on average) over Neural Graph Collaborative Filtering (NGCF) under exactly the same experimental setting.
## 1. INTRODUCTION
這裡比較像在講古 xd
The most common paradigm for CF is to learn **latent features** (a.k.a. embedding) to represent a user and an item, and perform prediction based on the embedding vectors.
Matrix factorization is an early such model, which directly projects the single ID of a user to her embedding.
> Matrix factorization 李弘毅的課程有
> Reference: https://wenwu53.com/matrix-factorization/
後面講到 SVD++ 和 Neural Attentive Item Similarity (NAIS)
these improvements can be seen as coming from using the subgraph structure of a user — more specifically, her one-hop neighbors — to improve the embedding learning.
To deepen the use of subgraph structure with high-hop neighbors, Wang et al. [39] recently proposes NGCF and achieves state-of-the-art performance for CF.
NGCF following the same propagation rule to refine embeddings:
- feature transformation
- neighborhood aggregation
- nonlinear activation.
> refine: to make a substance pure by taking other substances out of it
雖然她的表現不錯,但很 heavy and burdensome
GCN is originally proposed for node classification on attributed graph, where each node has rich attributes as input features
> 在講一次 GCN 是 for classification,而這樣的圖,通常每個 node 有豐富的 attributes
但是在 user-item 的 graph 裡,node 只會被用 one-hot ID 描述,於是若使用這樣的 ID embedding 在經過 multiple layers of nonlinear feature transformation (the key to the success of modern neural networks) 是沒有幫助,甚至會造成訓練的難度提高。
這段大致在講他們經過 ablation study 後,發現 feature transformation and nonlinear activation 對網路沒幫助,甚至把他們移除還讓效果變好
LightGCN, including the most essential component of GCN — neighborhood aggregation — for collaborative filtering.
首先,propagate the embeddings on the user-item interaction graph
接著,combine the embeddings learned **at different propagation layers** with a weighted sum to obtain the final embedding for prediction
> 可能有 1層, 2層, 3層
唉呦,又 diss 另一篇論文: better than the state-of-the-art methods like Mult-VAE
Good,整篇文章可以用三段話總結
- We empirically show that two common designs in GCN, feature transformation and nonlinear activation, have no positive effect on the effectiveness of collaborative filtering.
(feature transformation & nonlinear activation 這兩個東西沒用)
- We propose LightGCN, which largely simplifies the model design by including only the most essential components in GCN for recommendation.
(LightGCN 是個簡化的 model)
- We empirically compare LightGCN with NGCF by following the same setting and demonstrate substantial improvements. In-depth analyses are provided towards the rationality of LightGCN from both technical and empirical perspectives.
(比較 LightGCN 和 NGCF,然後有 in-depth analyses)
## 2. PRELIMINARIES
他說首先會介紹 NGCF

### 2.1 NGCF Brief

如果不懂這個公式,詳細說明可以看最上面 Reference link 的連結
$W_1$ and $W_2$ are trainable weight matrix to perform **feature transformation** in each layer.
> 文章一直說很沒用的東西 xd
By propagating $L$ layers, NGCF obtains $L + 1$ embeddings to describe a user $(e_u^{(0)}, e_u^{(1)}, ..., e_u^{(L)})$ and an item $(e_i^{(0)}, e_i^{(1)}, ..., e_i^{(L)})$
It then **concatenates** these $L + 1$ embeddings to obtain the final user embedding and item embedding, using inner product to generate the prediction score.
- nonlinear activation function: $\sigma$
- feature transformation matrices $W_1$ and $W_2$
### 2.2. Empirical Explorations on NGCF
Since the core of GCN is to refine embeddings by propagation, we are more interested in the embedding quality under the same embedding size. Thus, we change the way of obtaining final embedding from concatenation (i.e., $e_u^* = e_u^{(0)} || ... || e_u^{(L)}$) to sum (i.e., $e_u^* = e_u^{(0)} + ... + e_u^{(L)}$).
> 說因為他們關注的是 embedding quality,因此他們改變了得到 final embedding 的方法,從 concatenation $\rightarrow$ sum
Note that this change has little effect on NGCF’s performance, but makes the following ablation studies more **indicative** of the embedding quality refined by GCN.

- NGCF-f, which removes the feature transformation matrices W1 and W2.
- NGCF-n, which removes the non-linear activation function σ.
- NGCF-fn, which removes both the feature transformation matrices and non-linear activation function
> 這個結果也太酷,竟然真的只是拿掉(-fn)就真的效果變好
結論是,the deterioration of NGCF 是因為 training difficulty 而不是 overfitting. **Theoretically** speaking,NGCF has higher representation power than NGCF-f, since setting the weight matrix $W1$ and $W2$ to identity matrix $I$ can fully recover the NGCF-f model. However, **in practice**, NGCF demonstrates higher training loss and worse generalization performance than NGCF-f.
> 理論上,NGCF 會比 NGCF-f 好,因為它有加入 W1 & W2 去協助模型
> 但是實際上卻不是如此
## 3. METHOD
在 diss 完後,終於開始介紹 Light Graph Convolution Network (LightGCN) model
### 3.1. LightGCN

> only the normalized sum of neighbor embeddings is performed towards next layer 這句話看不懂 (又好像沒有那麼不懂)
> > 其實應該公式就表達這句話的意思了
GCN 的概念是學習 node representation by performing graph convolution iteratively, i.e., aggregating the features of neighbors
這樣的方法可以表示成 $e_u^{(k+1)} = AGG(e_u^{(k)}, {e_i^{k}} : i \in N_u).$
#### 3.1.1. Light Graph Convolution (LGC)

前面 symmetric normalization term 那一項是為了 avoid the scale of embeddings increasing with graph convolution operations, 其他的作法也可以應用,像是 $L_1$ norm (不過它們發現 symmetric normalization 的表現不錯)
文章講我才發現,他們沒有 aggregate 自己,只有 neighbors ㄟ!!
下一 subsection 會介紹的 layer combination operation, captures the same effect as self-connections. 所以 LGC (Layer Graph Convolution) 不需要 include self-connections
> 可是這個公式到底要學甚麼阿
> $e_i$ 不是已經訂死的嗎??
#### 3.1.2 Layer Combination and Model Prediction
LightGCN 裡面唯一可以被訓練得只有 embeddings at the 0-th layer, i.e., $e_u^{(0)}$ for all users and $e_i^{(0)}$ for all items.
> 可是 $e_i$ 不是已經訂死了嗎??
當他們確定後,高層的 LGC 可以被計算經由上面的公式(3),他們更進一步結合 embeddings obtained at each layer 去 form the final representation

$\alpha_k \geq 0$ 代表 k-th 的重要性 (權重),它可以被當作 hyper-parameter 經由手動調整,或是 model parameter to be optimized automatically.
在他們實驗中,發現設定 $\alpha_k$ uniformly as $\frac{1}{(K+1)}$ 的效果很好
他們使用 layer combination 的原因有 3 個 (three-fold)
1. 層數的增加情況下,the embeddings will be over-smoothed. 因此只用最後一層會有問題
> over-smoothed 是甚麼??
> > 我猜是因為疊越多層,原始的那個資料的特徵就會被稀釋的越多
2. 不同層的作用不一樣,考慮多層一點的話,representation 會更 comprehensive (全面的)
3. 結合不同層的 embeddings with weighted sum 可以取得 graph convolution with **self-connections** 的效果
model prediction:

#### 3.1.3. Matrix Form
user-item interaction matrix be $R \in R^{M \times N}$ $M$ 是 user 的數量,$N$ 是 item 的數量
$R_{ui}$ is 1 if $u$ has interacted with items $i$ otherwise 0.
Then, the adjacency matrix is

Let the 0-th layer embedding matrix be $E^{(0)} \in R^{(M+N) \times T}$ where $T$ is the embedding size. Then the matrix equivalent form of LGC as:

where $D \in R^{(M + N) \times (M + N)}$ is a diagonal matrix, in which each entry $D_{ii}$ denotes the number of nonzero entries in the i-th row vector of the adjacency matrix A (also named as degree matrix).
> $D$ 是一個對角矩陣,他的對角線上的每個元素,就是與那一列的 node 相連的總 node 個數

> 這裡比較難,數學有點多
### 3.2. Model Analysis
First,討論和 Simplified GCN (SGCN) 的關係, which is a recent linear GCN model that integrates self-connection into graph convolution
這個 analysis 的目的 是要說 LightGCN 具有 self-connection 的能力
Then, 討論和 Approximate Personalized Propagation of Neural Predictions (APPNP) 的關係, which is recent GCN variant that addresses oversmoothing by inspiring from Personalized PageRank
這個 analysis 的目的是要說,他們 propagating long-range with controllable oversmoothing
Lastly, 他們分析 second-layer LGC 去展示 how it smooths a user with her second-order neighbors, providing more insights into the working mechanism of LightGCN.
#### 3.2.1. Relation with SGCN.
SGCN removing nonlinearities and collapsing the weight matrices to one weight matrix.

> 可以看到有 $I$,也就是 self-connection

可以看到最後的結果和公式(8)很像,(雖然我不知道怎麼變出來的...)
#### 3.2.2. Relation with APPNP
propagate long range without the risk of oversmoothing
就是防止 oversmoothing,APPNP complements each propagation layer with the starting features (i.e., the 0-th layer
embeddings)
The propagation layer in APPNP is defined as:

> $\beta$ 就是拿來調整重要性的
last layer:

跟公式(8)來對照,根本就超像的阿
#### 3.2.3 Second-Order Embedding Smoothness.
the second layer smooths users that have overlap on the interacted items. More concretely, we have:

> $u$ 和 $v$ 都是 user
如果 v 有 co-interacted 和 u,the smoothness strength of v on u is measued by the coefficient (otherwise 0):

This coefficient is rather interpretable: the influence of a second-order neighbor v on u is determined by
1. the number of co-interacted items, the more the larger
2. the popularity of the co-interacted items, the less popularity (i.e., more indicative of user personalized preference) the larger
| popularity 是啥意思,跟 number 有啥差啊
3. the activity of v, the less active the larger
> 而且第二個 sum 那裏,意思是 u 和 v 共同喜歡的物品 i 的數量,照理來說不是越多要越相似嗎??
物品也可以用相似的作法來看相似度
### 3.3. Model Training
LightGCN 可以訓練的參數只有 embeddings of the 0-th layer, i.e., $\Theta = {E^{(0)}}$,也就是說,它的 complexity 和 standard matrix factorization (MF) 是一樣的。
他們使用 *Bayesian Personalized Ranking* (BPR) loss, 是一個 pairwise loss that (encourages the prediction of an observed entry to be higher than its unobserved counterparts) 這句看不懂:

where $\lambda$ controls the $L_2$ regularization strength
他們使用 Adam optimizer
We are aware of other **advanced negative sampling strategies** which might improve the LightGCN training, such as the hard negative sampling [31] and adversarial sampling [9]. We leave this extension in the future since it is not the focus of this work.
> 不知道這甚麼鬼方法, advanced negative sampling??
> > 水哦,幸好它這裡沒用到 xd
Note, 他們沒有使用 dropout,雖然通常 GCNs and NGCF 會使用,但因為他們沒有用到 feature transformation weight matrices,因此 enfocing $L_2$ regularization on the embedding layer 就足夠預防 overfitting 了
而且 NGCF 需要去 tune two dropout ratios (node dropout and message dropout) and normalize the embedding of each layer to unit length
it is technically viable(可行的) to also learn layer combination coefficients $\{\alpha_k\}_{k=0}^{K}$, 或是 parameterize them with an attention network,但是他們發現 learning $\alpha$ on training data 並沒有幫助,原因可能是 training data 沒有足夠的 signal 去學出一個好的 $\alpha$ 可以 generalize to unknown data(ex: validation data);他們也試過 learn $\alpha$ from **validation data** 是由 [5] 所啟發,效果微微進步 (小於 1%)。他們將這個問題當作 future work
> learn $\alpha$ from validation data 這個做法也太神奇!!
## 4. EXPERIMENTS
ablation studies and embedding analyses in Section 4.4.
hyper-parameter study is finally presented in Section 4.5.
### 4.1. Experimental Settings
dataset:

#### 4.1.1. Compared Methods
主要是跟 NGCF 比,NGCF 自己已經先跟很多模型比過了 (他們就不用在比較一次),不過此外,他們又多新增 two relevant and competitive CF models
- Mult-VAE [28]
This is an item-based CF method based on the variational autoencoder (VAE)
- GRMF [30]
This method smooths matrix factorization by adding the graph Laplacian regularizer.
#### 4.1.2. Hyper-parameter Settings
Same as NGCF, the embedding size is fixed to 64 for all models and the embedding parameters are initialized with the Xavier method [12] (深度學習介紹有教過,忘了 xd)
default learning rate: 0.001
default mini-batch size: 1024 (on Amazon-Book: 2048)
The L2 regularization coefficient λ is searched in the range of $\{ 1e^{-6}, 1e^{-5}, ..., 1e^{-2} \}$,而在大部分的 cases 中最 optimal 是 $1e^{-4}$
The layer combination coefficient $\alpha_k$ is **uniformly** set to $\frac{1}{1+K}$ where K is the number of layers. We test K in the range of 1 to 4, and **satisfactory performance** can be achieved when K equals to 3.
他們測試 K 從 1 to 4,而 satisfactory performance 是 K = 3 (並不是最好,只是 satisfactory performance)
> 也難怪他們會疊三層
Typically, 1000 epochs are sufficient for LightGCN to converge.
### 4.2. Performance Comparison with NGCF

- In all cases, LightGCN outperforms NGCF
- 將 Table 4 和 Table 1 結合,LightGCN 比 NGCF-fn 還要好,因為 NGCF-fn 還是有比較多的 operations (e.g., self-connectin, the interaction between user embedding and item embedding in graph convolution, and dropout)
- Increasing the number of layers can improve the performance, but the benefits diminish.
白話就是,越多層效果越好,但增加幅度遞減,看 Table 3 就知道了
- LightGCN 持續得到 lower training loss,且這個現象也成功 transfers to testing accuracy,indicating the strong generalization power of LightGCN
雖然增加 learning rate of NGCF 可以減少 training loss (even lower than that of LightGCN),但是 teesting recall 不能進步
### 4.3. Performance Comparison with State-of-the-Arts

且可以再被 improved by tuning the $\alpha_k$ (see Figure 4 for an evidence),不過這裡僅使用上面提到的 uniform setting
### 4.4. Ablation and Effectiveness Analyses
ablation studies by showing how **layer combination** and **symmetric sqrt normalization** 可以影響 performance
To justify the rationality of LightGCN as analyzed in Section 3.2.3, we further investigate the effect of embedding smoothness — the key reason of LightGCN’s effectiveness.
#### 4.4.1. Impact of Layer Combination
LightGCN-single 是指使用最後一層,沒有 layer combination

可以發現第四層的效果會突然驟降
> ㄟ,可是在 Amazon-Book 上面 LightGCN-single 在第二層的表現超好ㄟ
> 不過 LightGCN 就算層數一直增加,表現的效果都是可以穩定 (smoothing)
OK, 它自己也有講,在 Gowalla 的表現好,但在 Amazon-Book 和 Yelp2018 的表現不好,Regarding this phenomenon, two points need to be noted before we draw conclusion (要來找理由了 xd):
1. LightGCN-single 是一個 special case of LightGCN,將 $\alpha_K$ 設為 1,其他都為 0 就是了
2. 他們沒有 tune $\alpha_k$
> 好吧,這個理由給過,不過這樣也是很無敵,因為其他人都是你們的 special case,都被包含在你們裡面,所以只要你們想要就可以跟他們一樣厲害囉(?)
#### 4.4.2. Impact of Symmetric Sqrt Normalization
symmetric sqrt normalization: $\frac{1}{\sqrt{|N_u|} \sqrt{|N_i|}}$
他們測試只留左邊的 normalization (i.e., the target node’s coefficient),或只留右邊的 normalization (i.e., the neighbor node’s coefficient)
他們也測試 $L_1$ normalization, i.e., removing the square root. Note, 如果 removing normalization, training 變得 numerically unstable 且 suffers from not-a-value (NAN) issues, 所以他們沒有 show 這個 setting

#### 4.4.3. Analysis of Embedding Smoothness
在公式(14)有說明 the smoothing strength between two users $c_{v \rightarrow v}$
To verify this, we first define the smoothness of user embeddings as:

matrix factorization (i.e., using the $E^{(0)}$ for model prediction) and the 2-layer LightGCN-single (i.e., using the $E^{(2)}$ for prediction).

> 我還是不懂 smoothness loss 怎麼算出來,還有意義是甚麼
### 4.5. Hyper-parameter Studies
除了 learning rate,最重要的 hyper-parameter to tune is the $L_2$ regularization coefficient $\lambda$

其實不會差很多,但不要超過 $1e^{-3}$ 就好
## 5. RELATED WORK
### 5.1. Collaborative Filtering
Collaborative Filtering (CF) is a prevalent technique in modern recommender systems [7, 45].
將 users, items 都 parameterize 化 變成 embeddings,藉由重建 user-item interactions 的歷史資料,來去學習這個 embedding parameters,CF models like matrix factorization (MF) 就是這樣;最近的 neural recommender models like NCF [19] and LRML [34] 也是,while enhance the interaction modeling with neural networks.
以上的作法的 embedding 僅考慮 ID information,也有一些作法考慮 pre-existing features of a user 像是 FISM [21] 還有 SVD++ [25] 將 user 接觸過的 item embeddings 做加權後當成 user embedding。
近期,又發現 historical items 對 user interest 有不同的貢獻,所以有了新的方法叫 attention mechanisms,像是 ACF and NAIS,去自動學習重要程度
### 5.2. Graph Methods for Recommendation
Another relevant research line is exploiting the user-item graph structure for recommendation. 有像是 ItemRank [13],使用 label propagation mechanism,也就是 encouraging connected nodes to have similar labels. (鼓勵相連的 node 有相似的 label)
最近,GNNs 變得突出,特別是在 high-hop neighbors。早期的研究將 graph convolution 定義在 spectral domain (光譜??),像是 Laplacian eigen-decomposition [1] and Chebyshev polynomials [8],但他們的計算量很大;接著,GraphSage and GCN 將 graph convolution 定義為在 spatial domain (空間),(i.e., aggregating the embeddings of neighbors to refine the target node’s embedding)
有幾篇研究提供 deep insights into GNNs [22, 27, 40]
特別提到 SGCN (Simplified GCN): removing nonlinearities and collapsing multiple weight matrices into one.
而 SGCN 和 LightGCN 最主要的差別是 "目的性",SGCN 是 for node classification;LightGCN 是 for CF (collaborative filtering)
## 6. CONCLUSION AND FUTURE WORK
#### Conclusion
兩個最重要的 components
- light graph convolution
- layer combination
#### Future work
還有 personalize the layer combination weights $\alpha_k$