owned this note
owned this note
Published
Linked with GitHub
# CI2023 Reproducibility Challenge - Team 1 Notes
###### tags: `CI2023` `Reproducibility Challenge`
## Welcome Team 1 👋
This is an online collaborative document that can be written in Markdown format and suited to facilitate team notes and monitor your progress and useful resources for reproducing the assigned paper.
:::info
If you are new to HackMD, please see this link: https://hackmd.io/s/features
:::
:::warning
This document is public, please do not share any personal data :warning:
:::
---
## Aim :checkered_flag:
Reproduce the paper titled *PaleoRec: A sequential recommender system for the annotation of paleoclimate datasets*, https://doi.org/10.1017/eds.2022.3.
## Notes :pencil:
### Meeting 08-05-23
#### Goal(s)
- Kickoff, assign roles (Dilini: Data lead, Erica: Technical lead)
- Determine scopes
#### Actions
- Dilini & Erica: set up first notebook (with same train/test split)
- Next meeting: Sunday 2pm
----
## Extras :bulb:
This shared document: https://hackmd.io/@eds-book/B1gUHb67h/edit
### Useful links
- [Tools for Hosting Online Calls](https://the-turing-way.netlify.app/collaboration/event-tools.html?#hosting-online-calls): This section in the Turing Way describes some useful tools to consider for facilitating virtual team meetings.
### Code of conduct
* When interacting with your team and other teams, we remind you that the EDS book [Code of Conduct](https://github.com/alan-turing-institute/environmental-ds-book/blob/master/CODE_OF_CONDUCT.md) applies to protect everyone to have a safe and healthy collaboration space!
### Communication channels
To facilitate internal communication among your team, you have a private slack channel **#repro-challenge-team-1**
Please use the **#6-reproducibility-challenge** channel in the CI2023 slack workspace to collaborate more openly with other teams or ask for support from the [organising commitee](https://eds-book.github.io/reproducibility-challenge-2023/details/team.html).
Finally, we encourage you to follow [@Climformatics](https://twitter.com/Climformatics) on Twitter for occasional programme updates and reminders beyond those sent in Slack and via email!
## Personal Notes
### Josh
Architecture:
- `PaleoRec` is a transaction-based SRS with anonymous users, which is based on the the `GRU4Rec` architecture with an `LSTM` layer.
- `PaleoRec` was implemented using PyTorch, and the exact architecture consists of:
- Input: Label Encoding of all items
- Embedding Layer [what kind of embeddings?]
- LSTM Layer
- Dense Layer
- Output: Top `k` items
- The model was trained for 100 epochs for the chain which describes how the measurements are made (Chain 1); and 150 epochs for the longer chain representing the environmental information; all training employed cross-entropy loss function.
Data:
- Training data is obtained from 1,996 datasets available through [LiPDverse](https://lipdverse.org), which themselves result from 4 major compilations (PAGES2k; TEMP12k; ISO2k; and PALMOD).
- Prior to training, data was cleaned/harmonized.
- 5,478 sequences were used for training; and another 1034 for testing.
Evaluation metric:
HR (Hit Ratio, i.e. recall) is defined as the number of hits in a $k$-size list of ranked items:
$$HR@k = \frac{1}{M}\sum_{u\in U}{1(R_{u,g_{u}} \leq k)}$$
where $k$ is the resultant recommendation set size from PaleoRec, $g_u$ is the ground truth item, $R_{u,g_u}$ is the rank for the ground truth item in the recommendation set, and $M$ is the total number of examples in the test dataset.
MRR (Mean Reciprocal Rank) takes into account the position of the ranked items and calculates the reciprocal of the rank at which the ground truth item was put:
$$MRR@k = \frac{1}{M}\sum_{u\in U}{\frac{1}{R_{u,g_{u}}}}$$
PaleoRec was evaluated for $k \in [3, 5, 7, 10, 12, 14, 16]$