# ShapeletTransform
how it works currently.
1.Create an rng
2.Randomly choose instance to sample
3. get worst quality so far for class if full
4. randomly sample length
5. randomly sample starting point
6. Extract normalised series
7. find quality
self._find_shapelet_quality
8. return
# Shapelets
###### tags: `Research`
1. Summarise current sota (see redux)
2. Transform: full dive into code to see how it works
3. Regression: evaluate simple regression implementation
4. Components: assess existing struture, e.g.
pipeline vs guided search
random search vs guid
## 21/11/23
1. Profile where STC does badly
2. Summarise the number of shapelets sampled, plot by series
RDST vs STC REFERENCE split by data. First plot is series length, and it seems RDST is better with short series.

STC is reference, SQRT is now on main

29
['STC', 'RDST', 'ROCKET', 'MrSQM', 'STDEV']
[0.85400519 0.87714886 0.8690555 0.86408254 0.86496199]
so already 1% up on STC.
## Q1: What if we configure STC differently
### A) Does looking for shorter shapelets improve STC?
Simple change if ST fit:
```python=
if self.max_shapelet_length is None:
if self.min_series_length_ < 20:
self._max_shapelet_length = self.min_series_length_
elif self.min_series_length_ < 40:
self._max_shapelet_length = self.min_series_length_ / 2
else:
self._max_shapelet_length = self.min_series_length_ / 4
```


tl:dr, this does not help!
### B) Does defaulting to always find 1000 shapelets make a difference
Current model is this
```python
if self.max_shapelets is None:
self._max_shapelets = min(10 * self.n_instances_, 1000)
```
So when max shapelets is low, does STC do worse? There are 36 problems where the number will be less than 1000, and on average the difference between STC and RDST is not that much (overall 2% gap, on these 36, 0.8% gap). But can try it out
C) What if we only keep 500 instead of 1000?
accidentally tried searching 500 ... will keep it for reference, this is basically 500 random shapelets.
Keeping 1000 shapelets
Sampling 500 shapelets
its worse, but not massively so
FOLDER: Random 500 searched

84% instead of 86%: Does beg the question of how much use is the learning process. By default we keep 1000 from 10000 sampled. What if we keep 500 from 10k sampled?
26/11/23:
Keeping 500 shapelets
Sampling 10000 shapelets


Very little difference overall
MAIN: 0.865277143
STCL: 0.862414457
Fit time 92% of main, not really worth changing.
## Q: what if we just keep them all and use a linear classifier, same as ROCKET?
### 1 RidgeCV_2 1000,10000
With 1000/10000 RidgeCV significantly worse than RotF:

Results RidgeCV_1
Incorrect scaling here: Rerun
### 2 RidgeCV_3 10000,10000
No betterm maybe a bit worse

testAccuracy,MAIN,REF,STC
testAccuracyMean,0.8640978568950123,0.85265767161018,0.8388526659602056
## Revisit: August 2025
very little gain from learning. So the new idea is to see if we can marginally improve RDST through filtering
RDST fits 10k then uses a linear classifier.
## Select 1k then use rotation forest