# MSc Summer 2025 - Saniya
###### tags: `aeon-msc`
__Contributor:__ Saniya Inamdar
__Project:__ Matrix Profile
__Project length:__ 12 Weeks
__Thesis submission:__
__Regular meeting time:__
## Project objectives
1. Conduct a systematic review of MP theory and its current applications in time-series data
mining.
2. Benchmark existing MP implementations (STUMPY, tsmp) against aeon’s univariate wrapper
3. Identify and document functional gaps in aeon’s MP module
4. Engineer MP-based feature vectors that capture discriminative motifs and discords in pre
processed EEG epochs
5. Compare HIVE-COTE 2.0 and MultiROCKET Hydra (MP features) with deep nets (EEGNet,
Bi-LSTM, Patch-TST)
6. Evaluate interpretability through expert survey of surfaced motifs/discords
7. Report findings with rigorous statistical testing and reflexive discussion of clinical
implications
## Project Summary
Tony Availability
Away June 24th to July 6th
Away July 26th to August 3rd
## Week 1: June 12th
- sign up to aeon and tsml slacks :+1:
- Github profile to Tony : CodeFor2001 :+1:
- Fork aeon :+1:
- Read the contributors guide :+1:
- survey MP code in aeon
- register to use iridis
- tsml-eval
- https://github.com/time-series-machine-learning/tsml-eval
- Start background review for matrix profile for classification
- Friday 10:30 am regular meetings
## Week 4: July 4th
Survey of code and algorithms ongoing, next stage is to contribute to aeon
## Good first issue (to excercise contribution pipeline)
Line Issue Fix
23 “winodw” typo change → “window”
17 “th stumpy tutorial” change → “the STUMPY tutorial”
@@
*- Length of the sliding winodw for the matrix profile calculation.*
*+ Length of the sliding **window** (subsequence) in samples.*
@@
*- For more information on the matrix profile, see `th stumpy tutorial*
*- <https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html>`*
*+ For an introduction to matrix profiles, see *the STUMPY tutorial**
*+ `<https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html>`_.*
#### adding a unit test (*Optional*)
#tests/transformations/series/test/test_matrix_profile_shape.py
import numpy as np
from aeon.transformations.series import MatrixProfileSeriesTransformer
def test_mp_series_length():
x = np.arange(20)
m = 5
mp = MatrixProfileSeriesTransformer(window_length=m).fit_transform(x)
assert mp.shape == (len(x) - m + 1,)
## Recommendation for a Single MatrixProfile Implementation in aeon
I believe, MatrixProfileSeriesTransformer is a better choice.
1. Performance & Scalability
stumpy’s FFT-based core routinely runs orders of magnitude faster than a pure-Python loop, especially on long series or large sliding windows. It also supports multi-threading and optional GPU back-ends, giving users room to scale without extra aeon maintenance effort.
2. Maintenance Burden
Leveraging an actively maintained library externalizes most low-level optimizations and bug fixes.
aeon contributors can focus on high-level API consistency instead of duplicating complex signal-processing code.
3. Logical Clarity
A series transformer is the natural primitive; broadcasting it across collections is already supported by aeon’s estimator framework. This keeps the API consistent with scikit-learn conventions and avoids a parallel “collection-only” path.
4. Feature Completeness & Proven Correctness
stumpy is the de-facto reference implementation in the Python ecosystem. Aligning with it ensures numerical robustness and makes it easier for users to cross-validate results against the wider literature.
5. Future Extensions
Maintaining a single, lean wrapper simplifies adding flags for GPU use, different distance metrics, or incremental/streaming variants developments already on stumpy’s roadmap.
## Implementation of research paper
1. Set up the project structure
- Made folders for raw data, intermediate results, processed results, and logs.
- Created a config file so all settings (filters, epoch length, etc.) are in one place.
2. Made reproducibility tools
- Every time we run the code, it saves the date, settings used, and software versions so we can always repeat exactly the same run later.
3. Wrote a dataset indexer
- A script that will scan the MODMA EEG dataset when we get it.
- It will list all subjects, file locations, and whether they are MDD or healthy.
- Output is a CSV we can feed into the rest of the pipeline.
4. Preprocessing scripts
- Filter: Keeps only brain‑relevant frequencies (2–47 Hz) and removes slow drifts & high‑frequency noise.
- Resample: Changes data speed to 200 Hz so all recordings are consistent.
- Re‑reference: Re‑aligns voltage readings so all channels use a common baseline.
- Quality check: Finds and flags bad data (spikes, flatlines, or too much noise).
- Epoching: Cuts the first 6 minutes into 20.48‑second chunks and keeps the first 10 clean ones.
5. Feature extraction (what we measure from the EEG)
- pMP (in‑phase matrix profile): Measures how similar brain signal patterns are within each channel the main method from the research paper.
- HFD (Higuchi fractal dimension): Measures waveform complexity used as the baseline in the paper.
6. Feature aggregation
- For each channel, take the median feature value across the 10 clean chunks.
- Save everything into a CSV with one row per subject, ready for statistics and machine learning.
7. Sanity‑tested the code
- Ran everything on fake/simulated EEG signals to check that each step works.
- No errors ready to run on the real dataset when approved.
Start drafting the thesis
- eeg classification: background, techniques used, eeg class,highlight matrix profile, details(dont assume the marker ever heard of it), describe how its being used , reproduciton of result.
- comparison of diff algos,
Work on next chapter: description and implementation -> correctness evaluation -> experimental evaluation( new dataset, describe it, how the algos are useful, simplified version of the dataset, do comparison on that dataset, comparing to other classifiers)
Set-up for the bigger experiment
demo-algorithm, implementation, evaluation