# Metatrain hackathon
GOAL: make it easier to use and develop metatrain
NO NEW FEATURES, refactor and documentation
## Organization
- Tue, 10am: explain the concept & distribute work
- Tue, 1pm: pizza
- Tue, 5pm: wrap-up
- Wed, 10am: start, stand up, distribute new work as needed
- Wed, 1pm: burgers
- Wed, 5pm: wrap-up
## Needs to happen before the hackathon
- figure out remote attendance?
- zoom with breakout rooms
- pizza remote?
- merge scaler PR
- maybe the new dataloader?
## Tasks
### Review needed
- https://github.com/metatensor/metatrain/pull/792 # this one is waiting for #802 to add forces
- [merged] https://github.com/metatensor/metatrain/pull/798
- https://github.com/metatensor/metatrain/pull/799 [name=Sofiia]
- https://github.com/metatensor/metatrain/pull/801
- [merged] https://github.com/metatensor/metatrain/pull/802 [name=Joe]
- [merged] https://github.com/metatensor/metatrain/pull/804
### Code tasks
- [name=johannes will not investigate further, we don't care for now] Look into repository size and try to reduce it
- apparently the documentation is included
- some branches contain big data files, like `src/metatrain/experimental/phace/modules/physical_LE/eigenvectors.npy` (15mib) on `origin/hartmut-cgnet,origin/hartmut-transformer,origin/phace,origin/phace-hartmut` or `tests/resources/water_4fs.zip` (11mib) on `origin/flashmd-with-pet`
- [name=Rohit] overall repo size is large, ~124MB
- Deprecate NanoPET [name=joe - done]
- [name=johannes - done] Remove old PET architecture
- [name=joe - done]Update architecture to use new CompositionModel
- [name=joe - done] and then remove the old one
- [name=joe - done] Remove the old loss functions
- [name=Rocco] Check why CSCS CI takes more than 1h when the main one takes ~10min
- :white_check_mark: Reduce number of `pytest` workers
- :white_check_mark: Run `tox` in-memory
- :clock1: Ensure CUDA is used by the `tox` tests; ARM+CUDA CI is too brittle
- :white_check_mark: Updated `torch` and CUDA versions
- :clock1: Add check that GPU is used
- [name=Sofiia, in review] Collect and show code coverage for architectures
- Then try to improve it as relevant
- Change tests using `mtt train` & co to run in the same process instead of `subprocess.run()`. Maybe some `mtt_run(*args)` function? [name=Guillaume]
- [Setup regression tests on CSCS](https://github.com/metatensor/metatrain/issues/705)
- Look through all code in `utils`, and add/improve docstrings [name=Pol] and [name=Paolo] will do
- Update type hints to use builtin types: `list[int]` instead of `List[int]` [name=Qianjun - to do when there won't be merge conflicts]
- [name=johannes wondering what's going on - done, waiting for 786 by rohit] Only train models once when building docs?
- [name=johannes - done] Remove arg 'sliding_factor' from loss definitions: https://github.com/metatensor/metatrain/issues/542
- Change minimal supported Python to v3.10 now that 3.9 is EOL
### Documentation tasks
- Overall proposed new structure:
- Installation
- Getting started
- Quick Start
- Configuration and Units
- Available Architectures
- Tutorials
- Beginner
- Advanced
- Concepts and Design
- Citing
- FAQ
- Developer documentation
- Create tutorial categories (begginer/intermediate/expert)
- [name=Hanna] and [name=Cesare] will do this
- Rethink "getting started" section: Move tutorials from getting started to begginer tutorials: like split Advanced Base Configuration into maybe maybe into "choose a device and precision", "Run a reproducible training run", "use wandb for training logging" etc.
- [name=Hanna] and [name=Cesare] can handle the restructuring of tutorials (both tasks above)
- Add new tutorials:
- training a MLIP from scratch
- [name=Qianjun, done]
- link to the cookbook fine-tuning example from metatrain
- comparing different architectures on a single dataset
- visualization
- [name=Markus]
- run metatrain with lammps? or at least link to it somewhere <= should be a link to the metatomic examples
- [name=Egor]data validation with parity plots for energies and forces
- Go over the hyper parameter reference and improve it (at least for PET/SoapBPNN)
- [name=Raymond] and [name=Alessandro] will handle hyperparameter doc improvement for SOAP-BPNN and LLPR -- DONE
- Create training decision tree (see below)
- [name=Rohit] consider the super fancy https://twinery.org
- [name=Rohit] finds the state of interactive JS storytelling to be quite ugly, between sphinx-needs (draft PR) and the "story driven" renpy js / inky / twinery / monogatari / etc which seem to be geared at a.. different set of people...
- maybe even just a sphinx-design based dropdown / tab thing...
- am experimenting with a revealjs style presentation per user story thing
- TOX has a flowchart -> https://tox.wiki/en/latest/user_guide.html
- [name=Philip] Documentation page about the supported data formats
- Look at the overall doc organization and decide on changes to be made
- Create FAQ/troubleshooting section
- [name=Cesare] and [name=Hanna] will create this and first questions, then feel free to edit
- Add questions to FAQ
- Restructure the examples directory.
- [name=Philp], [name=Cesare], and [name=Hanna] will look at this on 8.10.2025 moring or afternoon. Let's see
- [name=Rohit] would like to help
[name=Rohit] has several questions about the design and CLI guidelines, but these are wider restructres, e.g. using `uv` or something to dispatch so dependencies of architectures are decoupled from the main set, or using `pydantic` for schema based option / validation / documentation. `tox -e tests` takes a long time and large resource usage, can we have a smaller subset.
- [name=Rohit] will update contributing
- in particular, how do we want people to tell us / contribute that they use `metatrain` ?
- xref `Contributing.rst`: "The first and best way to contribute to metatrain is to use it and advertise it"
### Maybe?
- metatensor.org landing page
- [name=Michele] first working version
- DNS recors broken and then fixed
- Google search console property created
- move metatensor docs to docs.metatensor.org/metatensor/?
- move metatrain documentation to docs.metatensor.org/metatrain/
## Needs to happen AFTER the hackathon
- Anything not done needs to be an issue
-----------------
# Random notes
## Users stories
- I'm a new student, and I want to go from `ssh kuma` to training start in 1h
- I'm an expert, I want to find the name of parameter to change it myself
- I'm a model developer, I have a finished and published model I want to put in metatrain
- (NOT CURRENTLY A GOAL) I'm a model developer, I have an idea about a new model I want to develop with metatrain
### New documentation
#### categorize tutorials into beginner/advanced
- getting started
- tutorials
- beginner:
- train new MLIP from scratch [name=Qianjun, done]
- fine tune PET-MAD [name=Markus]
- advanced
- compare architetures on a given dataset
- hyper-parameters sweeping [name=Rohit]
- advanced concepts
- multi-gpu training [name=Qianjun]
#### Currently supported data inputs, help people prepare data
- how to prepare input data from dft data: create a xyz file with array field info field etc
- Maybe use a DiskDataset if datatse is BIG
#### model training decision tree
- What do you want to do?
- Run MD
- Predict properties of a dataset
- Train a new potential
- What’s your target?
- Energy only
- Energy + forces
- Long-range properties (dipoles, charges, etc.)
- What’s your training data situation?
- I have no data :(
- I have a small dataset (<10k structures)
- I have a large dataset
- What resources do you have?
- CPU only
- Single GPU
- Multiple GPUs / HPC cluster
- Map to recommendations: based on the answers, guide them to:
- Use case: e.g. “Fine-tune an existing pretrained water potential”
- Suggested method: e.g. “Try architecture X with descriptor Y”
- Next step in docs: link to tutorial or config template.
#### FAQ, how to troubleshoot common issues
- Training diverges
- data creation
- convergence issues
- GPU out of memory