# Datasets for session-based recommender systems
The following datasets are mainly collected from one of the papers mentioned in the course [https://arxiv.org/pdf/1802.05814.pdf](https://arxiv.org/pdf/1802.05814.pdf):
* Netflix
* link: https://www.kaggle.com/datasets/shivamb/netflix-shows?resource=download
* [This Data set contains 11 Columns and 8807 rows/records]
* The data set contains information about Types of films/showes, Title, Casts/actors, year of production, duration, description of the contents of movies, release date etc.
* Million Song Data Set Challenge Dataset Last.fm Dataset (MSD)
* original paper: https://ismir2011.ismir.net/papers/OS6-1.pdf for the dataset about songs
* https://cs229.stanford.edu/proj2012/NiuYinZhang-MillionSongDatasetChallenge.pdf
* the paper [https://arxiv.org/pdf/1802.05814.pdf](https://arxiv.org/pdf/1802.05814.pdf "user-song play counts released as part of the Million Song Dataset" and says "We
binarize play counts and interpret them as implicit preference data.
We only keep users with at least 20 songs in their listening history
and songs that are listened to by at least 200 users", but it's not straightforward
* http://millionsongdataset.com/pages/getting-dataset/
* contains user-song play counts
* features of the songs are described here: http://millionsongdataset.com/pages/example-track-description/
* MovieLens 1M (ML-1M) / MovieLens 20M (ML-20M)
* MovieLens 1M/20M movie ratings.
* 1 million version features ratings from 6000 users on 4000 movies.
* https://www.kaggle.com/datasets/odedgolden/movielens-1m-dataset
* Pinterest
* https://github.com/edervishaj/pinterest-recsys-dataset
* https://sites.google.com/site/xueatalphabeta/academic-projects
* 468 popular
interests on Pinterest with 1,456,540 images and 1,000,000
users who have interactions with these images (info from the original paper: https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Geng_Learning_Image_and_ICCV_2015_paper.pdf)
Other datasets:
* LastFM
* http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html
* This dataset contains <user, artist, plays> tuples (for ~360,000 users) collected from Last.fm API
* [Spotify Million Playlist Dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)
* The evaluation task is automatic playlist continuation: given a seed playlist title and/or initial set of tracks in a playlist, to predict the subsequent tracks in that playlist.
Also see this table listing datasets for recommender systems:
https://github.com/rn5l/session-rec?tab=readme-ov-file#related-datasets
Datasets from Papers with Code (https://paperswithcode.com/task/session-based-recommendations):
* RetailRocket:
* https://paperswithcode.com/dataset/retailrocket
## Actual session based datasets
* [YOOCHOOSE](https://www.kaggle.com/code/raceproptit/yoochose/input?select=yoochoose-clicks.dat)
* originally appeared on the RecSys Challenge 2015
* original source: https://2015.recsyschallenge.com/ (could not access cause of 503 HTTP error)
* "The data represents six months of activities of a big e-commerce businesses in Europe selling all kinds of stuff such as garden tools, toys, clothes, electronics and much more."
* there is "clicks" and "buys" data
* Diginetica
* benchmark on Papers with Code: https://paperswithcode.com/sota/session-based-recommendations-on-diginetica
* original source: https://competitions.codalab.org/competitions/11161
* shopping domain
* anonymized search and browsing logs, product data, anonymized transactions, and a large data set of product images
## Other resources
[ACM RecSys Challenge](http://www.recsyschallenge.com/)