Intern Summer 2024 - Ivan and Shapelets

# Intern Summer 2024 - Ivan and Shapelets ###### tags: `aeon-intern` __Contributor:__ Ivan Knyazev __Project:__ XAI for shapelets __Project length:__ 8 Weeks __Mentors:__ Tony Bagnall, Matthew Middlehurst __Start date:__ Tuesday, July 23rd __End date:__ Monday, September 16th __Regular meeting time:__ Friday, ?? ## Project Summary - Weeks 1-2: Familiarisation with open source, aeon and the visualisation module. Make contribution for a good first issue. - Weeks 3-4: Understand the shapelet transfer algorithm, engage in ongoing discussions for possible improvements, run experiments to create predictive models for a test data set - Weeks 5-6: Design and prototype visualisation tools for shapelets, involving a range of summary measures and visualisation techniques, including plotting shapelets on training data, calculating frequency, measuring similarity between - Weeks 7-8: Debug, document and make PRs to merge contributions into the aeon toolkit. ## Project Timeline ## Getting started tasks - [x] Introduce yourself in the community Slack channels. Use __#introductions__ to introduce youself to the wider community if you have not already and __#summer-2024__ to introduce yourself and your project to other students and mentors. - [x] Go through the contributor guide on the _aeon_ website (https://www.aeon-toolkit.org/en/stable/contributing.html). - [x] Set up a development environment, including _pytest_ and _pre-commit_ dependencies. This will make development a lot easier for you, as you must pass the PR tests to have your code merged (https://www.aeon-toolkit.org/en/stable/developer_guide/dev_installation.html). - [ ] Review some of the important dependencies for developing aeon at a basic level: - [x] __scikit-learn__ the interface aeon estimators extend from. We aim to keep as compatible as possible with sklearn tools. - [x] __pytest__ for unit testing. Any code added will have to be covered by tests. - [x] __sphinx/myst__ for documentation. Adding new functions and classes will have to be added to the API docs. - [ ] __numba__ for writing efficient functions. - [x] Make some basic Pull Requests (PRs) to gain some experience with contributing to _aeon_ through GitHub. - [x] Add the project time line objects to this document. # Make notes of progress here - 21 PRs - 14 Issues - 10 Medium Articles worth over 15,000 words with 200 views ## Week 1: 22nd July Question for next meeting: Email from workexp said i cant work after 9th septemeber. Does this make the internship 7 weeks long? 1. Find good first issue in aeon, make first contribution - I found some minor typos in the getting started guide and corrected them. Requested a PR, hopefully following all the required conventions. - Questions: I saw there being coding standards so I would like to know if there are expectations on commit messages. Also, I am not clear on how to add myself to the contributers page. Finally after opening the issue I attempted to assign mysef by commenting 'Aeon-Assign bot assign @IRKnyazev' which did not work. 3. Install tsml-eval - Question: Do I downgrade aeon? I got an error when trying to load docs locally (first i ran pip install .\[docs\]) which gave this error tsml-eval 0.4.0 requires a version of aeon that is less than 0.10.0 and at least 0.9.0 - Forked the tsml-eval git repo and cloned it in my conda env https://github.com/IRKnyazev/tsml-eval - Question: Not sure if i was required to simply pip install this dependecy? (I did this too) 5. Background reading on shapelets. Question: How should I approach reading these papers, what should I focus on understanding? - I am approaching this by reading them in order of publication. Question: Recommendation for approaching the 70 page paper - Ye, L., Keogh, E. Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min Knowl Disc 22, 149–182 (2011). https://doi.org/10.1007/s10618-010-0179-5 - Lines, L., Davis, L., Hills, J. and Bagnall, A. A shapelet transform for time series classification, KDD ‘12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (2012) https://doi.org/10.1145/2339530.2339579 - Bagnall, A., Lines, J., Bostrom, A., Large, J. and Keogh, E. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, Volume 31, pages 606–660, (2017) https://link.springer.com/article/10.1007/s10618-024-01022-1 6. Check can log on to cluster - Installed MobaXterm and connected to the server node via iridis5_a using my University credentials - Verified that it works by loading up matlab - Wrote a [medium article](https://medium.com/@vanya.knyazev/an-interns-meditations-the-coming-months-5ddeb282e9b4?source=your_stories_page-------------------------------------) introducing myself and the internship ## Week 2: 29th July Sidenote: Found useful documentation on [shapelets](https://www.aeon-toolkit.org/en/latest/examples/classification/shapelet_based.html) 1. One task remaining from last week: Reading the bakeoff paper. 2. [sphinx/myst for documentation](https://www.aeon-toolkit.org/en/stable/developer_guide/documentation.html#documentation-build) - [Numpy docs style guide](https://numpydoc.readthedocs.io/en/latest/format.html) - Familiarise with sphinx documentation generation and numpydoc docstring standards. - Improve the API documentation for a few classes/functions and go through the Pull Request and review process. - Sphinx takes docstrings and parses them into html. 3. pytest for test coverage - I did this online [tutorial](https://www.tutorialspoint.com/pytest/index.htm). It covered: - naming conventions, - running desired groups of tests, - markers (group names, xfail, skip, parametrize, fixtures), - `conftest.py` for fixtures, flags (-v, -k, -m, \-\-maxfail, \-\-junitxml), - and running tests in parallel. 4. Daniele messaged me saying we will be working on random dilated shapelet together, - and sent over [this paper](https://arxiv.org/pdf/2109.13514) for a read. - We met up and discussed our understanding of different parts of the paper 5. get set up with tsml-eval on the cluster - Followed the [instructions](https://github.com/time-series-machine-learning/tsml-eval/blob/main/_tsml_research_resources/soton/iridis/iridis_python.md - My original mistake when setting up cluster was referencing a bash script instead of python which led to strange syntax errors. Matthew spotted this and helped me get it working. - There were some other obstacles like finding the correct directory with the dataset list. - I requested a PR with my attempted improvements to the md which was not approved. Despite this, it was good practise to reflect on the method and find out where I had any misunderstandings. 6. I also want to read - the [aeon research paper](https://arxiv.org/abs/2406.14231) - the [logical shapelet paper](https://dl.acm.org/doi/abs/10.1145/2020408.2020587?casa_token=b7K0ZW1w7xQAAAAA:l1uL43Nh1FZewsrfILXDeHccuC-VAdoXG4x5bM0wChv3NaWn0tu7cMr2qrKBczwT43dOQW9BTklGLQ) referenced in the shapelet transfrom classifier paper - *READ HALF SO FAR* - Wrote a [medium article](https://medium.com/@vanya.knyazev/an-interns-meditations-mistakes-932a6e8ea8ed?source=your_stories_page-------------------------------------) reflecting on current activities ## Week 3: 5th August - Worked on some docstring PRs, found an issue relating to building the docs locally. Current recommnded practise is to make a draft PR and build the docs there. - Since then, [a developer has found a way to build the docs locally](https://github.com/aeon-toolkit/aeon/issues/1896#issuecomment-2284693255). - Study existing jupyter notebooks on shapelets - [Made some improvements tp the viz notebook](https://github.com/aeon-toolkit/aeon/pull/1930) - Make a jupyter notebook on the four different transformers in aeon - You can see local changes by having the developer aeon installed and using the following code in the first cell of a nb ```%load_ext autoreload``` ```%autoreload 2``` - Still WIP but has been great at testing my understanding and also improving existing code - Found that SAST wasn't tagged correctly - Shapelet Viz module wasn't correctly handling class index parameters - Extended shapelet Viz to return both best and worst shapelets according to input param - RDST is implemented very differently to ST so have [added some useful things](https://github.com/aeon-toolkit/aeon/pull/1959) from ST to it. - RDST feature ranking [wasn't ranking shapelets](https://github.com/aeon-toolkit/aeon/pull/1971) but features while grouping features to a shapelet. This meant a shapelet could be both the best and worst. - **The notebook structure is:** - Highlight differences between the four transformers, in order of publishment - Explain the Gun/No gun problem, this will help with interpreting shapelets later - Visualise the time series from both classes - for each transformer: - Show how the data is transformed as a pandas dataframe - Compare how tree based and linear classifier rank shapelets (only on first transformer) - Visualise the extracted shapelets and group them by class - Show the best and worst shapelet for each class using Viz module - Interpret the shapelets, try get some insight to the problem to understand classification. - Spoke to Antoine to clarrify misconceptions from mine and Daniele's code review - He explained the reason for Manhattan distance & alpha similarity logic - He shared his [Phd thesis](https://hal-sfo.ccsd.cnrs.fr/THESES-UO/tel-04368849v1) which goes into further detail beyond the published paper I read. - Also started a writing this [medium article](https://medium.com/@vanya.knyazev/an-interns-meditations-random-dilated-shapelet-transform-9800eee80aee?source=your_stories_page-------------------------------------) - I still dont understand subtraction of 1 in line 432 - upper_bounds = np.log2(np.floor_divide(n_timepoints - 1, lengths - 1)) - I had two curious questions regarding the shapelet domain which Tony explored with me - Are dilations useful for time series recorded at different frequencies? If time series are allowed to be recorded at different frequencies (same length) and be stored in the same dataset the dilation would be able to ignore the noise of the more frequently recorded time series - Tony said time series are generally assumed to be samped at same rate. Dilation is a means of downsampling which allows less data preprocessing. - Also I was interested in any experiments done on making shapelets rotationally invariant. This could allow for less shapelets by searching for the minimal Euclidian distance for atleast x-axis reflected shapelets. - Tony said avoiding variations in shapelet selection has often lead to arbritrary findings. A crude heurstic for finding similar shapelets seems good enough atm. There is still alot of scope for exploring shapelets. - why in TSC do we have a smaller training set relative to the testing? How come we don't follow the typical ML 70/30 split? - when Eamonn was setting up the archive he made the train sets smaller to make the problem harder to solve. Tend to go 50/50 for new problems ## Week 4: 12th August - Fixed length shapelet experiment - Add a method to extract_random_shapelet to fix the location of all generated shapelets [9,11,13] - I misunderstood Tony here. I thought we wanted the same random shapelet length for all shapelets but we want a random length out of these options for each shapelet. - Adapt STC (made a copy called dstc extending stc) to have the option to fix the location of all generated shapelets - Modified the set_classifier to have a fixed length STC experiment - Compare the fixed length STC to the original STC implementation using the cluster (112 univariate datasets) - First made a bash script to run stc & fixedlengthshapelettransform - conditional length_selector kwarg (empty string or FIXED) - Run fixed_length_STC.py after all experiments complete. *Not currently working* - **Cluster Workflow:** - conda activate tsml-eval - pip uninstall tsml-eval - cd aeon/tsml-eval - git fetch origin - git reset --hard origin/stc2 - $ pip install git+https://github.com/IRKnyazev/tsml-eval.git@stc2 - sinteractive - conda activate aeon_hpc - cd "/mainfs/lyceum/ik2g21/aeon/tsml-eval/_tsml_research_resources/soton/iridis/" - chmod +x fixed_length_STC.sh - squeue -u ik2g21 --format="%12i %15P %20j %10u %10t %10M %10D %20R" -r ![image](https://hackmd.io/_uploads/S1crT-qtC.png) - scancel -u ik2g21 It would be nice to not have to uninstall and reinstall tsml-eval each time but for some reason the cluster doesn't stay up to date with pushes to stc2 branch. ## Week 5: 19th August - Fixed length shapelet experiment - Made a jupyter notebook to debug the above implemented steps - Turns out despite the cluster making results there were quite a few bugs in my implementation - One of which related to rng not enforcing random_state for shapelet lengths when set to FIXED - ~~One ENH I've made is prioritising the given start pos over the length as this is something the user has control over~~ - Finished experiment, found that fixing lengths reduces accuracy - this was expected. - Continuing work of shapelet notebook - tried removing -1 from rdst line 443 - this didn't break anything for gunpoint but the shapelets generated changed - Using sast and rsast in nb and found that they dont share some class attributes with ST so implemented [them in an issue](https://github.com/aeon-toolkit/aeon/pull/2006) - for some reason pre-commit doesn't run locally for me despite it being installed. - rerunning the install fixed this ## Week 6: 26th August - For the previous experiment need to check for significant results ![image](https://hackmd.io/_uploads/r1TFGDYj0.png) - p value of less 0.001 means statistically significant difference. - Next experiment is to see if increasing fixed length range improves performance - a function to get max dilation for a dataset - a function to create a list of possible lengths [7,9,11] * all dilations within max dilation range ![image](https://hackmd.io/_uploads/HkeBWE3jC.png) - ^^ Note: not actually dilated transform, just using a wider range of lengths via possible dilation stretches ![image](https://hackmd.io/_uploads/SkL5p83jA.png) - didn't get as many dataset results for rdst due to a numba cache error, 53 were succesfully ran - Extra task: lots of classifiers in aeon that have no reference results. Set some of these off on the cluster as a background task it would be really helpful - Find aeon's reference results - compare fixedlen to reference STC - Run some other estimators on cluster - redcomets & proximity forest - can kill the job at any point as progress is tracked - had to run dos2unix reference_results.sh for the script to work - *Current progress* ![image](https://hackmd.io/_uploads/HJ8Z9QynR.png) - Document what results we have and dont have for classifiers and regressors on https://timeseriesclassification.com/. - Compare the list of all estimators to the lists on the website - ![image](https://hackmd.io/_uploads/r1sgKrkhA.png) *find aeon's regressors and classifier* - ![image](https://hackmd.io/_uploads/H1fmYSyhA.png) *referenced classifiers and regressors from the website* - Had to manually compare the lists due to differences in naming conventions and uses of abbreviations. - Then create an issue on the [repo github]( https://github.com/time-series-machine-learning/tsml-repo) with this list. - There are examples of doing this sort of thing in the benchmarking module notebooks - For the jupyter notebook explain how Ridge CV assigns coefficients and why its hard to distingush best from worst shapelets at times - Decided to swap to using Logistic regression which is easier to understand the coef weighting. - Found an [issue](https://github.com/aeon-toolkit/aeon/issues/2016) in the shapelet viz module - [distance should be negatively correated with a class](https://github.com/aeon-toolkit/aeon/pull/2017) - also ranking RDST is not as easy as thought due to the SO and argmin features being best at a certain value rather than trending up or down - Implement correction for multiclass data too - [the PR](https://github.com/aeon-toolkit/aeon/pull/2018) - Attempted to provide assistance in the helpdesk channel - Q: Is there documentation on why aeon decided to have time data points across the columns instead of the rows? I'm interested in the reasons behind it - My Answer: I'm just an intern but here's what I think - In scientific tables, the convention is to have the independent variable in the columns and the dependent variables in the rows. In a time series dataset, time is considered the independent variable because it provides the reference point across which measurements are taken. Each column represents a specific time point. The different time series (or the variables within a time series) —represent the observed values at each time point—are the dependent variables, which are naturally placed in the rows. ## Week 7: 2nd September - Found that [viz module doesn't work for rotational forests](https://github.com/aeon-toolkit/aeon/issues/2007) last week, spoke to Antoine who said that the PCA makes getting coefs challenging - [documented the scope of the viz module](https://github.com/aeon-toolkit/aeon/pull/2027) - Enquired Monash about entry requirements - Hi, I am looking to apply for an iPhd in machine learning, specifically for the Woodside energy partnership scholarship. I wanted to clarify if I am eligible for application. I have graduated with a First class honours in computer science from the University of Southampton, this was a 3 year course but had a significant research component. I am currently interning at aeon - an open source time series framework - with the purpose of publishing a research paper, and I graduated top 10% in my cohort. - They replied: Ivan, specific questions such as yours are only assessed once you have made an application by the faculty admissions team and potentially the Graduate Research Committee - Finalising the transform notebook by writing up medium article's for each section - [GunPoint article](https://medium.com/@vanya.knyazev/an-interns-meditations-gunpoint-b765df0f6964?source=your_stories_page-------------------------------------) - [ST article](https://medium.com/@vanya.knyazev/an-interns-meditations-shapelet-transform-0d4243712485?source=your_stories_page-------------------------------------) - [RDST article](https://medium.com/@Ivan-Knyazev/an-interns-meditations-rdst-for-gunpoint-2c8bc39c5984?source=your_stories_page-------------------------------------) - [RSAST article](https://medium.com/@Ivan-Knyazev/an-interns-meditations-r-sast-for-gunpoint-ad917a7aabe7?source=your_stories_page-------------------------------------) - [Review article](https://medium.com/@Ivan-Knyazev/an-interns-meditations-shapelet-driven-insights-to-gunpoint-0de644eb73e0?source=your_stories_page-------------------------------------) - Cluster still running in the background - need to ask what to do with the results once all done ## Week 8: 9th September - Revisisted Shapelet primitives paper and wrote [post](https://medium.com/@Ivan-Knyazev/an-interns-meditations-why-you-should-use-shapelets-for-time-series-classification-aa32ae19d3d2?source=your_stories_page-------------------------------------) - Despite cluster not finishing I did the summary stuff as if it has - can just rerun the code for the full expriments later - collate results - multiple classifier evaluation - pull results of other algorithms - look at benchmarking examples for step by step - compare pf to distance based & redcom to shapelet - Wrote about my ST experiments

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.