HackMD - Collaborative Markdown Knowledge Base

## BUILD MULTILINGUAL RECOMMENDER SYSTEMS ### Problem Statement Modelling customer shopping intentions is crucial for e-commerce stores, as it directly impacts user experience and engagement. With this intention, we aim to utilize the '**Multilingual Shopping Session Dataset**', a dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. It consists of two main components: 1. User sessions: List of products that a user has engaged with in chronological order 2. Product attributes. Various details like product title, price in local currency, brand, color, and description. The 2 major tasks to be completed are: 1. Next Product Recommendation. 2. Next Product Recommendation for Underrepresented Languages/Locales. ### Solution #### *Task 1* Task 1 aims to predict the next product that a customer is likely to engage with, given their session data and the attributes of each product. The test set for Task 1 comprises data from English, German, and Japanese locales. The predicted product IDs should be stored in a list and are listed in decreasing order of confidence, with the most confident prediction at index 0 and least confident prediction at index 99. #### *Task 2* Utililze the knowledge of the previously languages to train the model. It is encouraged to transfer the knowledge gained from the languages with sufficient data such as English, German, and Japanese to improve the quality of recommendations for French, Italian, and Spanish. ### Timeline **Week 1: `May 22 2023 - May 28 2023`** * Understand concepts related to the problem statement: * **Basic NLP techniques**: Pre-Processing, Tokenisation, Word embedding generation etc. * **Large Language Models**, their different utilities and versions. * **Transfer Learning** basics * Download the dataset, get familiarised with the datatypes and attributes. **Week 2: `May 29 2023 - June 5 2023`** * Apply different NLP techniques to preprocess the data and generate embeddings * Lemmatization * Stemming * Stop word removal * Use embeddings as features for your transformer model. * Implement a pipeline putting all the elements together and run your code for a few epochs. * Use Transfer Learning techniques to train model for underrepresented languages **Week 3: `June 6 2023 - June 12 2023`** * Experiment with using different models and by tuning their hyperparameters. * Analyse accuracy metrics with graphs and write a conclusion reporting the qualitative and quantitative results. This is just a tentative timeline, you are free to move at your own pace. ### Resources * [Introduction to PyTorch](https://www.dataquest.io/blog/pytorch-for-beginners/) * [Basics of Deep Learning](https://github.com/vlgiitr/DL_Topics) * [Data Preprocessing](https://www.analyticsvidhya.com/blog/2021/09/essential-text-pre-processing-techniques-for-nlp/) * [Language Models](https://builtin.com/data-science/beginners-guide-language-models) * [Transfer Learning](https://towardsdatascience.com/why-should-you-leverage-transfer-learning-14d08a60f616) * [Dataset](https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge/problems/task-1-next-product-recommendation/dataset_files) *Please make an account on AIcrowd.com and participate in the challenge to access the dataset