Shapelets - HackMD

# ShapeletTransform how it works currently. 1.Create an rng 2.Randomly choose instance to sample 3. get worst quality so far for class if full 4. randomly sample length 5. randomly sample starting point 6. Extract normalised series 7. find quality self._find_shapelet_quality 8. return # Shapelets ###### tags: `Research` 1. Summarise current sota (see redux) 2. Transform: full dive into code to see how it works 3. Regression: evaluate simple regression implementation 4. Components: assess existing struture, e.g. pipeline vs guided search random search vs guid ## 21/11/23 1. Profile where STC does badly 2. Summarise the number of shapelets sampled, plot by series RDST vs STC REFERENCE split by data. First plot is series length, and it seems RDST is better with short series. ![image](https://hackmd.io/_uploads/Syccm_oNp.png) STC is reference, SQRT is now on main ![image](https://hackmd.io/_uploads/HyhINxaE6.png) 29 ['STC', 'RDST', 'ROCKET', 'MrSQM', 'STDEV'] [0.85400519 0.87714886 0.8690555 0.86408254 0.86496199] so already 1% up on STC. ## Q1: What if we configure STC differently ### A) Does looking for shorter shapelets improve STC? Simple change if ST fit: ```python= if self.max_shapelet_length is None: if self.min_series_length_ < 20: self._max_shapelet_length = self.min_series_length_ elif self.min_series_length_ < 40: self._max_shapelet_length = self.min_series_length_ / 2 else: self._max_shapelet_length = self.min_series_length_ / 4 ``` ![image](https://hackmd.io/_uploads/HyNyrlp4p.png) ![image](https://hackmd.io/_uploads/ByFkSgTVp.png) tl:dr, this does not help! ### B) Does defaulting to always find 1000 shapelets make a difference Current model is this ```python if self.max_shapelets is None: self._max_shapelets = min(10 * self.n_instances_, 1000) ``` So when max shapelets is low, does STC do worse? There are 36 problems where the number will be less than 1000, and on average the difference between STC and RDST is not that much (overall 2% gap, on these 36, 0.8% gap). But can try it out C) What if we only keep 500 instead of 1000? accidentally tried searching 500 ... will keep it for reference, this is basically 500 random shapelets. Keeping 1000 shapelets Sampling 500 shapelets its worse, but not massively so FOLDER: Random 500 searched ![image](https://hackmd.io/_uploads/By982JWBa.png) 84% instead of 86%: Does beg the question of how much use is the learning process. By default we keep 1000 from 10000 sampled. What if we keep 500 from 10k sampled? 26/11/23: Keeping 500 shapelets Sampling 10000 shapelets ![image](https://hackmd.io/_uploads/B1khJyMHp.png) ![image](https://hackmd.io/_uploads/rJyfxJMrT.png) Very little difference overall MAIN: 0.865277143 STCL: 0.862414457 Fit time 92% of main, not really worth changing. ## Q: what if we just keep them all and use a linear classifier, same as ROCKET? ### 1 RidgeCV_2 1000,10000 With 1000/10000 RidgeCV significantly worse than RotF: ![image](https://hackmd.io/_uploads/HkFBMwkSa.png) Results RidgeCV_1 Incorrect scaling here: Rerun ### 2 RidgeCV_3 10000,10000 No betterm maybe a bit worse ![image](https://hackmd.io/_uploads/H132a_Xr6.png) testAccuracy,MAIN,REF,STC testAccuracyMean,0.8640978568950123,0.85265767161018,0.8388526659602056 ## Revisit: August 2025 very little gain from learning. So the new idea is to see if we can marginally improve RDST through filtering RDST fits 10k then uses a linear classifier. ## Select 1k then use rotation forest