MSc Summer 2025 - Saniya

# MSc Summer 2025 - Saniya ###### tags: `aeon-msc` __Contributor:__ Saniya Inamdar __Project:__ Matrix Profile __Project length:__ 12 Weeks __Thesis submission:__ __Regular meeting time:__ ## Project objectives 1. Conduct a systematic review of MP theory and its current applications in time-series data mining. 2. Benchmark existing MP implementations (STUMPY, tsmp) against aeon’s univariate wrapper 3. Identify and document functional gaps in aeon’s MP module 4. Engineer MP-based feature vectors that capture discriminative motifs and discords in pre processed EEG epochs 5. Compare HIVE-COTE 2.0 and MultiROCKET Hydra (MP features) with deep nets (EEGNet, Bi-LSTM, Patch-TST) 6. Evaluate interpretability through expert survey of surfaced motifs/discords 7. Report findings with rigorous statistical testing and reflexive discussion of clinical implications ## Project Summary Tony Availability Away June 24th to July 6th Away July 26th to August 3rd ## Week 1: June 12th - sign up to aeon and tsml slacks :+1: - Github profile to Tony : CodeFor2001 :+1: - Fork aeon :+1: - Read the contributors guide :+1: - survey MP code in aeon - register to use iridis - tsml-eval - https://github.com/time-series-machine-learning/tsml-eval - Start background review for matrix profile for classification - Friday 10:30 am regular meetings ## Week 4: July 4th Survey of code and algorithms ongoing, next stage is to contribute to aeon ## Good first issue (to excercise contribution pipeline) Line Issue Fix 23 “winodw” typo change → “window” 17 “th stumpy tutorial” change → “the STUMPY tutorial” @@ *- Length of the sliding winodw for the matrix profile calculation.* *+ Length of the sliding **window** (subsequence) in samples.* @@ *- For more information on the matrix profile, see `th stumpy tutorial* *- <https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html>`* *+ For an introduction to matrix profiles, see *the STUMPY tutorial** *+ `<https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html>`_.* #### adding a unit test (*Optional*) #tests/transformations/series/test/test_matrix_profile_shape.py import numpy as np from aeon.transformations.series import MatrixProfileSeriesTransformer def test_mp_series_length(): x = np.arange(20) m = 5 mp = MatrixProfileSeriesTransformer(window_length=m).fit_transform(x) assert mp.shape == (len(x) - m + 1,) ## Recommendation for a Single MatrixProfile Implementation in aeon I believe, MatrixProfileSeriesTransformer is a better choice. 1. Performance & Scalability stumpy’s FFT-based core routinely runs orders of magnitude faster than a pure-Python loop, especially on long series or large sliding windows. It also supports multi-threading and optional GPU back-ends, giving users room to scale without extra aeon maintenance effort. 2. Maintenance Burden Leveraging an actively maintained library externalizes most low-level optimizations and bug fixes. aeon contributors can focus on high-level API consistency instead of duplicating complex signal-processing code. 3. Logical Clarity A series transformer is the natural primitive; broadcasting it across collections is already supported by aeon’s estimator framework. This keeps the API consistent with scikit-learn conventions and avoids a parallel “collection-only” path. 4. Feature Completeness & Proven Correctness stumpy is the de-facto reference implementation in the Python ecosystem. Aligning with it ensures numerical robustness and makes it easier for users to cross-validate results against the wider literature. 5. Future Extensions Maintaining a single, lean wrapper simplifies adding flags for GPU use, different distance metrics, or incremental/streaming variants developments already on stumpy’s roadmap. ## Implementation of research paper 1. Set up the project structure - Made folders for raw data, intermediate results, processed results, and logs. - Created a config file so all settings (filters, epoch length, etc.) are in one place. 2. Made reproducibility tools - Every time we run the code, it saves the date, settings used, and software versions so we can always repeat exactly the same run later. 3. Wrote a dataset indexer - A script that will scan the MODMA EEG dataset when we get it. - It will list all subjects, file locations, and whether they are MDD or healthy. - Output is a CSV we can feed into the rest of the pipeline. 4. Preprocessing scripts - Filter: Keeps only brain‑relevant frequencies (2–47 Hz) and removes slow drifts & high‑frequency noise. - Resample: Changes data speed to 200 Hz so all recordings are consistent. - Re‑reference: Re‑aligns voltage readings so all channels use a common baseline. - Quality check: Finds and flags bad data (spikes, flatlines, or too much noise). - Epoching: Cuts the first 6 minutes into 20.48‑second chunks and keeps the first 10 clean ones. 5. Feature extraction (what we measure from the EEG) - pMP (in‑phase matrix profile): Measures how similar brain signal patterns are within each channel the main method from the research paper. - HFD (Higuchi fractal dimension): Measures waveform complexity used as the baseline in the paper. 6. Feature aggregation - For each channel, take the median feature value across the 10 clean chunks. - Save everything into a CSV with one row per subject, ready for statistics and machine learning. 7. Sanity‑tested the code - Ran everything on fake/simulated EEG signals to check that each step works. - No errors ready to run on the real dataset when approved. Start drafting the thesis - eeg classification: background, techniques used, eeg class,highlight matrix profile, details(dont assume the marker ever heard of it), describe how its being used , reproduciton of result. - comparison of diff algos, Work on next chapter: description and implementation -> correctness evaluation -> experimental evaluation( new dataset, describe it, how the algos are useful, simplified version of the dataset, do comparison on that dataset, comparing to other classifiers) Set-up for the bigger experiment demo-algorithm, implementation, evaluation