Amazon user-item recommender with TGN and Memgraph

--- title: Amazon user-item recommender with TGN and Memgraph description: Over the course of the last few months, we at Memgraph have been working on something that we believe could help you guys with classical graph prediction tasks. With our latest newborn query module, you will have the possibility to do both label classification and link prediction. author: Antonio Filipovic parentCategory: Tutorial --- # PYTORCH x MEMGRAPH x GNN = :heart_decoration: Over the course of the last few months, we at **Memgraph** have been working on something that we believe could help you guys with classical graph prediction tasks. With our latest newborn query module, you will have the possibility to do both **label classification** and **link prediction**. But, how come a query module can do both label classification and link prediction? It's all thanks to **graph neural networks**, for short **GNNs**. :heart: # Graph neural networks Whether you are a software engineer or a deep learning enthusiast, there is a high chance you heard of **[graph neural networks](https://en.wikipedia.org/wiki/Graph_neural_network)** as a rising :star:. Maybe you even deep-doved into this topic and you are now ready for **[new MAGE spell](https://github.com/memgraph/mage)**. But even if you haven't, don't worry, I will try to give you a quick overview so you can catch up and follow along. Firstly as you already know graph consists of **nodes(vertices)** and **edges(relationships)**. ![](https://i.imgur.com/s3QOvNJ.png) Every node can have its **feature vector**, which essentially describes that node with a vector of numbers. We can look at this **feature vector** as the **representation vector** of each node, also called **embedding** of the node. To avoid getting lost in technical details, **graph neural networks** work as a **message passing<sup>[2]</sup>** system, where each node aggregates feature representations of its 1-hop neighbors. To be more precise, nodes don’t aggregate feature representations directly, but feature vectors obtained by dimensionality reduction using the *W* matrix (you can look at them as fully connected linear layers). This matrix projected feature vectors are called **messages** and they give expressive power to graph neural networks. ![](https://i.imgur.com/IeKYzNj.png) This idea originates from the field of **graph signal processing**. Now, we don't have time here to explain all about how we got from **signals** to **message passing**, but it's all math :heart:. Feel free to drop us a message on [Discord](https://discord.gg/memgraph) and we will make sure to create a blog post explaining the **graph neural network** introduction topic, not a simplified version, but one explaining all of it from the beginning, somewhere about [Tutorial on Spectral Clustering](https://arxiv.org/pdf/0711.0189.pdf) in 2007. If you would like to get a better understanding of the topic of the **graph neural networks** before continuing, I suggest you do the following: * read a following **[blog post](https://distill.pub/2021/gnn-intro/)** - it gives a gentle introduction to the topic * you can also check the **[the video explanation](https://www.youtube.com/watch?v=6g9vtxUmfwM&list=PLoROMvodv4rPLKxIpqhjhPgdQy7imNkDn&index=14&ab_channel=StanfordOnline)** by the Stanford professor Jure Leskovec about graph neural networks - I would honestly suggest to binge-watch the whole series, but if you don't have that much time, just watch the lectures called **[Message passing and Node Classification](https://www.youtube.com/watch?v=6g9vtxUmfwM&list=PLoROMvodv4rPLKxIpqhjhPgdQy7imNkDn&index=15&ab_channel=StanfordOnline)** and **[Introduction to Graph Neural Networks](https://www.youtube.com/watch?v=F3PgltDzllc&list=PLoROMvodv4rPLKxIpqhjhPgdQy7imNkDn&index=17&ab_channel=StanfordOnline)** * and if you want to deep-dive, which I suggest, I will leave [the following blog post](https://gordicaleksa.medium.com/how-to-get-started-with-graph-machine-learning-afa53f6f963a), it will be more than enough :mage: A reason why we wanted to give our **[MAGE](https://github.com/memgraph/mage)** a new spell in form of **GNN** is because **GNNs** are to **graphs** what **[CNNs](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)** are to **images**. **GNNs** can inductively learn about your dataset, which means that after training is complete you can apply their knowledge to a similar use-case, which is very cool since you don't need to retrain the whole algorithm. With other representation learning methods like **[DeepWalk](https://arxiv.org/abs/1403.6652)**, **[Node2Vec](https://arxiv.org/abs/1607.00653)**, **[Planetoid](https://arxiv.org/abs/1603.08861)**, we haven't been able to do that until, well, **GNNs**. **Now, why temporal graph neural networks?** Imagine you are in charge of a product where users interact with items every minute of every day, and they like some and hate the others. You would like to present them with more items they like, and not just that, you would love it if they bought those new items. This way you have a **stream** of data. Interactions appear across time, so you are dealing with a **temporal** dataset. The classical **GNNs** are not designed to work with streams, although they work very well on unseen data. But it is not all nails if you have a hammer - some methods work better on stream, others on static data. ## Temporal graph networks As you already know, we in [Memgraph](https://memgraph.com/) are all about streams. :dash: Thanks to the guys at [Twitter](https://twitter.com/twitterresearch), they developed a **GNN** that works on **temporal** graph networks. This way **GNNs** can deal with **continuous-time dynamic graphs**. In the image below you can see a schematic view of **temporal graph networks**. It is a lot to take, but the process once explained is not that complicated. <a href="https://i.imgur.com/Uy0UHBW.jpg"> <img border="10" alt="TGN" src="https://i.imgur.com/Uy0UHBW.jpg"> </a> Firstly, in **continuous-time dynamic graphs** you can model changes on graph which include **edge or node addition**, **edge or node feature transformation(update)**, **edge or node delition** as a time listed events. **[Temporal graph networks<sup>[1]</sup>](https://arxiv.org/pdf/2006.10637.pdf)** , shortend **TGNs**, work as following: * firstly, node embedding calculations work on the concept of message passing, which I hope you are familiar with until now :crossed_fingers: * TGNs use **events**, and whenever a new edge appears it represents **interaction event** between two nodes involved * from every **event** we create a **message** and use a **message aggregator** for all messages of the same node to get the **aggregated message** of every node * every node has it's own **memory** which represents **accumulated** state, updated with **aggregated message** by one of the **[LSTM or GRU](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)** * Lastly, the embedding module is used to generate the temporal embedding There are two embedding module types we integrated into our **TGN** implementation: * **Graph attention layer**: it is a similar concept as in **[Graph attention networks](https://arxiv.org/abs/1710.10903)**, but here they use the original idea from Vaswani et al. **[Attention is all you need](https://arxiv.org/abs/1706.03762)** which includes queries, keys and values and everything else is the same. I suggest you look at the **TGN** paper to check the exact embedding calculation details. * **Graph sum layer**: this mechanism is completely similar to the **message passing** system There is a certain problem when dealing with **embedding updates**. We don't update embeddings for every node, only for ones that appear in a batch. Also, in order not to get so much into implementation details, we will try to abstractly explain the following problem. Nodes in the batch appear at different points in time. That's why we need to take into account when was their update so that we can only use neighbors which appeared in the graph before them. This is in case we update the whole representation of the graph with batch information, and then do the calculation, which is what we did. This leads to having a different computation graph for every node. You can see what it looks like in the image below: ![](https://i.imgur.com/hdCSd48.png) ## Amazon data example Now to try how this works, we have prepared a Jupyter Notebook on our **[GitHub repository](https://github.com/memgraph/jupyter-memgraph-tutorials)**. It is about [Amazon user-item reviews](http://snap.stanford.edu/data/amazon/productGraph/). In the following example, you will see how to do **link prediction** with **TGN**. # Exploring a Amazon data network in Memgraph Through this short tutorial, you will learn how to install Memgraph, connect to it from a Jupyter Notebook and perform data analysis on Amazon data set using **graph neural network** called **Temporal graph networks**. ## 1. Prerequisites For this tutorial, you will need to install: - [Jupyter](https://jupyter.org/install) - [Docker](https://docs.docker.com/get-docker/) - [GQLAlchemy](https://pypi.org/project/gqlalchemy/) Docker is used because Memgraph is a native Linux application and cannot be installed on Windows and macOS. ## 2. Installation using Docker After installing Docker, you can set up Memgraph by running: ``` docker run -it -p 7687:7687 -p 3000:3000 -p 7444:7444 memgraph/memgraph-platform ``` This command will start the download and after it finishes, run the Memgraph container. ## 3. Connecting to Memgraph with GQLAlchemy We will be using the **GQLAlchemy** object graph mapper (OGM) to connect to Memgraph and execute **Cypher** queries easily. GQLAlchemy also serves as a Python driver/client for Memgraph. You can install it using: ``` pip install gqlalchemy ``` > **Hint**: You may need to install [CMake](https://cmake.org/download/) before installing GQLAlchemy. Maybe you got confused when I mentioned Cypher. You can think of Cypher as SQL for graph databases. It contains many of the same language constructs like `CREATE`, `UPDATE`, `DELETE`... and it's used to query the database. ```python from gqlalchemy import Memgraph ``` ```python memgraph = Memgraph("127.0.0.1", 7687) ``` Let's make sure that Memgraph is empty before we start with anything else. ```python memgraph.drop_database() ``` Following command should output {number_of_nodes:0} ```python results = memgraph.execute_and_fetch( """ MATCH (n) RETURN count(n) AS number_of_nodes ; """ ) print(next(results)) ``` ## 4. Data analysis on Amazon product dataset You will load an **amazon product dataset** as list of cypher queries. Here is how it looks: <a href="https://i.imgur.com/yuL5SrP.png"> <img border="10" alt="dataset" src="https://i.imgur.com/yuL5SrP.png"> </a> One example of mentioned query is following one: ```cypher MERGE (a:User {id: 'A1BHUGKLYW6H7V', profile_name:'P. Lecuyer'}) MERGE (b:Item {id: 'B0007MCVQ2'}) MERGE (a)-[:REVIEWED {review_text:'Like all Clarks, these guys didnt disappoint. They fit great and look even better. For the price, I dont think a better deal exists out there for casual shoes.', feature: [161.0, 133.0, 0.782608695652174, 0.0, 0.031055900621118012, 0.17391304347826086, 0.043478260869565216, 36.0, 36.0, 1.0, 3.6944444444444446, 0.0, 0.0, 3.0, 1.0, 12.0, 0.055, 0.519, 0.427, 0.9238], review_time:1127088000, review_score:5.0}]->(b); ``` So as you can see, we have `User` and we have `Item` in our graph sheme. Every user has left a very **positive** review for **Item**. This wasn't a case for all reviews in original dataset, but we processed it and removed negative reviews (all reviews with `review_score` <= 3.0). Every `User` as an `id` and every `Item` user has reviewed has also an `id`. In this one query, we find the `User` and the `Item` with mentioned ids or we create one if such `User` or `Item` is missing in database. And we create an interaction event between them in term of an `edge` which has list of **20** edge features. This `edge_features` we created from review user has left to an item: ``` 1. Number of characters 2. Number of characters without counting white space 3. Fraction of alphabetical characters 4. Fraction of digits 5. Fraction of uppercase characters 6. Fraction of white spaces 7. Fraction of special characters, such as comma, exclamation mark, etc. 8. Number of words 9. Number of unique works 10. Number of long words (at least 6 characters) 11. Average word length 12. Number of unique stopwords 13. Fraction of stopwords 14. Number of sentences 15. Number of long sentences (at least 10 words) 16. Average number of words per sentence 17. Positive sentiment calculated by VADER # VADER - Valence Aware Dictionary and sEntiment Reasoner lexicon # and rule-based sentiment analysis tool 18. Negative sentiment calculated by VADER 19. Neutral sentiment calculated by VADER 20. Compound sentiment calculated by VADER ``` We should have also prepared features for a `User` and `Item`, but following features seemed enough for our **toy** example. One more **note**: In those dataset of queries we already prepared for you, there is one query that will change "working mode" of our **temporal graph networks** module to **evaluation(eval)** mode. When mode is changed for **tgn** it also stops doing **training** of the model, and starts doing **evaluation** of trained model. If you look inside of file you should find following query: ```cypher CALL tgn.set_mode("eval") YIELD *; ``` ### Trigger creation In order to process a dataset, we need to create trigger on **edge create** event if trigger with our name already doesn't exist. You will see, this check is a neat feature to have in your Jupyter notebook if you want just to rerun it without dumping **local Memgraph** instance if you are not working with **Docker**. ```python results = memgraph.execute_and_fetch("SHOW TRIGGERS;") trigger_exists = False for result in results: if result['trigger name'] == 'create_embeddings': print("Trigger already exists") trigger_exists = True break; if not trigger_exists: memgraph.execute( """ CREATE TRIGGER create_embeddings ON --> CREATE BEFORE COMMIT EXECUTE CALL tgn.update(createdEdges) RETURN 1; """ ) ``` ### Index creation for dataset **Memgraph** works best with **indexes** defined for nodes, in our case we will have **User** and **Item** types. ```python index_queries = ["CREATE INDEX ON :User(id);", "CREATE INDEX ON :Item(id);"] for query in index_queries: results = memgraph.execute_and_fetch(query) for result in results: continue ``` ### Training and evaluating Temporal Graph Networks In order to train **Temporal graph networks** on **Amazon dataset** we will split dataset into **train** and **eval** queries. Let's first load our **raw queries**. Each query creates an **edge** between **User** and **Item** thus representing a **positive** review of a certain **Item** by a **User**. ```python import os dir_path = os.getcwd() with open(f"{dir_path}/data/queries.cypherl", "r") as fh: raw_queries = fh.readlines() train_eval_split_ratio = 0.8 queries_index_split = int(len(raw_queries) * train_eval_split_ratio) train_queries = raw_queries[:queries_index_split] eval_queries = raw_queries[queries_index_split:] print(f"Num of train queries {len(train_queries)}") print(f"Num of eval queries {len(eval_queries)}") ``` Before we start importing **train** queries, first we need to set parameters for **temporal graph networks**. ```python # since we are doing link prediction, we use self_supervised mode learning_type = "self_supervised" batch_size = 200 #optimal size as defined in paper num_of_layers = 2 # GNNs don't need multiple layers, contrary to CNNs. layer_type = "graph_attn" # choose between graph_attn or graph_sum edge_message_function_type = "identity" # choose between identity or mlp message_aggregator_type = "last" # choose between last or mean memory_updater_type = "gru" # choose between gru or rnn attention_heads = 1 memory_dimension = 100 time_dimension = 100 num_edge_features = 20 num_node_features=100 # number of sampled neighbors num_neighbors = 15 # message dimension must be defined in the case we use MLP, # because then we define dimension of **projection** message_dimension = time_dimension + num_node_features + num_edge_features tgn_param_query = f"CALL tgn.set_params({{learning_type:'{learning_type}', batch_size: {batch_size}, num_of_layers:{num_of_layers}, layer_type:'{layer_type}', memory_dimension:{memory_dimension}, time_dimension:{time_dimension}, num_edge_features:{num_edge_features}, num_node_features:{num_node_features}, message_dimension:{message_dimension}, num_neighbors:{num_neighbors}, edge_message_function_type:'{edge_message_function_type}', message_aggregator_type:'{message_aggregator_type}', memory_updater_type:'{memory_updater_type}', attention_heads:{attention_heads}}}) YIELD *;" print(f"TGN param query: {tgn_param_query}") results = memgraph.execute_and_fetch(tgn_param_query) for result in results: print(result) ``` Now it is time to execute queries and do **first** epoch of **training**. ```python for query in train_queries: results = memgraph.execute_and_fetch(query.strip()) for result in results: continue ``` Now we need to change **TGN** mode to **eval** and start importing our **evaluation** queries. ```python results = memgraph.execute_and_fetch("CALL tgn.set_eval() YIELD *;") for result in results: print(result) for query in eval_queries: results = memgraph.execute_and_fetch(query.strip()) for result in results: continue ``` After our **stream** is done, we should probably do few more rounds of training and evaluation in order to have properly working model. We can do so with following query: ```python num_of_epochs = 5 results = memgraph.execute_and_fetch( f""" CALL tgn.train_and_eval({num_of_epochs}) YIELD * RETURN epoch_num, batch_num, precision, batch_process_time, batch_type ORDER BY epoch_num, batch_num; """ ) for result in results: continue ``` Now, let's get the results and then do some **plotting** to check whether **precision** increases between epochs. ```python from collections import defaultdict import matplotlib.pyplot as plt import numpy as np results_train_dict = defaultdict(list) results_eval_dict = defaultdict(list) results = memgraph.execute_and_fetch( """ CALL tgn.get_results() YIELD epoch_num, batch_num, precision, batch_process_time, batch_type RETURN epoch_num, batch_num, precision, batch_process_time, batch_type ORDER BY epoch_num, batch_num; """ ) for result in results: if result['batch_type'] == 'Train': results_train_dict[result['epoch_num']].append(result['precision']) else: results_eval_dict[result['epoch_num']].append(result['precision']) ``` Now when we have collected the results, let's first plot the average `accuracy` of `train` batches inside epoch, and the average `accuracy` of `eval` batches inside epoch. We can do that since every batch is the same size. (**NB**: `TGN` uses a predefined batch size.) ```python X_train = [] Y_train = [] for epoch, batches_precision in results_train_dict.items(): Y_train.append(np.mean(batches_precision)) X_train.append(epoch) X_eval = [] Y_eval = [] for epoch, batches_precision in results_eval_dict.items(): Y_eval.append(np.mean(batches_precision)) X_eval.append(epoch) #scatter plot plt.plot(X_train, Y_train, 'b', label="train") plt.plot(X_eval, Y_eval, 'r', label="eval") #add title plt.title('epoch - average batch precision') #add x and y labels plt.xlabel('epoch') plt.ylabel('precision') plt.legend(loc="upper left") #show plot plt.show() ``` ![](https://i.imgur.com/8TZYE7b.png) We can see that **average accuracy** increases, which is really good. Now we can start doing some recommending. Let's find `Users` who reviewed one `Item` positively and those who reviewed multiple `Items` positively. Our module will return what it believes should be a prediction score for yet unreviewed `Items`. ```python results = memgraph.execute_and_fetch( """ MATCH (n:User) WITH n LIMIT 15 MATCH (m:Item) OPTIONAL MATCH (n)-[r]->(m) WHERE r is null CALL tgn.predict_link_score(n,m) YIELD prediction WITH n,m, prediction ORDER BY prediction DESC LIMIT 10 MERGE (n)-[:PREDICTED_REVIEW {likelihood:prediction}]->(m); """ ) for result in results: print(result) ``` Now we can apply make following query in Memgraph Lab: ```cypher MATCH (u:User)-[pr:PREDICTED_REVIEW]->(i:Item), (u)-[r:REVIEWED]->(oi:Item) RETURN *; ``` And after [applying style](https://github.com/memgraph/jupyter-memgraph-tutorials/tree/main/pytorch_amazon_network_analysis), we get the following visualization. From image below we can see that most predictions are oriented towards one of the most popular items. <a href="https://i.imgur.com/eqQBcTm.jpg"> <img border="10" alt="dataset" src="https://i.imgur.com/eqQBcTm.jpg"> </a> ## Where to next? Well, I hope this was fun and that you have learned something. You can check out what we else we have prepared in **[MAGE 1.2 release](https://memgraph.com/blog/mage-1-2-release)**. If you loved our implementation, don't hesitate to give as a star on **[GitHub](https://github.com/memgraph/mage)** :star:. If you have any comment, you can contact us on **[Discord](https://discord.gg/memgraph)**. And lastly if you wish to continue on reading our blog posts, jump on **[our page](https://memgraph.com/blog)**. ## References [1] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, M. Bronstein (2020). Temporal Graph Networks for Deep Learning on Dynamic Graphs [2] W.L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. U Advances in Neural Information Processing Systems 30, 2017

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.