Mentorship Programme Notes: Katie Buchhorn
===
###### tags: `mentorship`
This is a collaborative document to keep track of progress during the mentorship programme.
:::info
- **Call time**: 6:45-7:30 UTC, Tuesdays (Lovkush/Franz can make it from 7-7:30)
- **Project name**: Anomaly detection in high-dimensional data
- **Mentees**: Katie Buchhorn
- **Mentor(s)**: Łukasz Mentel, Franz Király, Lovkush Agarwal
- **Call joining link**: https://qut.zoom.us/j/85767181332?pwd=TXYxQVpmN1BlL3B2TDljODdpM2ZpZz09
:::
GSoC Week | Stand up/Mentor meetings |Notes | Friday Workflow
---|---|---|---
Week 2| Mon 20th, Tues 21st*, Wed 22nd |(early stand up) [UTC + 10] | attending: annotation
Week 3| -| Conference | ?tentative: base object, annotation
Week 4| Mon 4th (July), Tues 5th |(late stand up) [UTC - 4] | -
Week 5| Every day |Dev days London! | attending: base object, annotation
Week 6| - | Wedding| -
Week 7| Tue 26th*, Wed 27th, Thur 28th | (early stand up) [UTC + 10] | attending: base object, annotation
Week 8| Tue 2nd (Aug)*, Wed 3rd, Thur 4th |(early stand up) [UTC + 10]| attending: base object, annotation
Week 9| Mon 8th, Tue 9th*, Wed 10th |(early stand up) [UTC + 10]| attending: base object, annotation
Week 10| Tue 16th*, Wed 17th, Thur 18th|(early stand up) [UTC + 10]| attending: base object, annotation
Week 11| Tue 23rd*, Wed 24th, Thur 25th|(early stand up) [UTC + 10]| attending: base object, annotation
Week 12| Tue 30th*, Wed 31st, Thur 1st (Sep)|(early stand up) [UTC + 10]| attending: base object, annotation
Week 13| Mon 5th, Tue 6th*, Wed 7th |(early stand up) [UTC + 10]| attending: base object, annotation
Last week: starting 5th Oct
*our weekly zoom meeting is just afterwards, currently scheduled at 6:45am UTC.
350/12 = 29h per week
## References
The sktime mentorship programme is inspired by the [The Turing Way](https://the-turing-way.netlify.app/welcome) project and [Open Life Science](https://openlifesci.org) programme under a CC BY 4.0, Open Life Science (OLS-2), 2020 license.
**Contents**
[TOC]
## Week 0
- For reference: sktime's [mentorship programme call for applications](https://github.com/alan-turing-institute/sktime/wiki/Mentorship-programme)
- Get started with HackMD using this short guide: https://hackmd.io/@openlifesci/OLS-HackMD-guide
- Please read our [Code of Conduct](https://github.com/alan-turing-institute/sktime/blob/master/CODE_OF_CONDUCT.rst)
- Check out sktime's [development roadmap](https://github.com/alan-turing-institute/sktime/issues/228)
**Preferred time and days provided by the mentor and mentee**
Please indicate your preferred day/time for your regular call:
6:30-7:30 UTS, Tuesdays (Lovkush/Franz can make it from 7-7:30)
### Prep work for week 1
**Mentees will:**
1. Read https://ideas.ted.com/are-you-mentorable/ (done)
2. Set 1-2 personal development goals for yourself:
During GSoC:
* improve software development skills i.e. best practices in the full-scope module development (inheret classes, unit testing, notebook explanations)
* gain a deeper understanding of the stats/maths (I chose the topics because they looked interesting and will be useful in my PhD work and hopefully many others' work)
* add annotators to sktime
Long-term (Dec 2022):
* create package for "Graph Neural Network-Based Anomaly Detection in Multivariate Time Series" i.e. an extension and refactor of https://github.com/d-ailin/GDN (possibly interfacing sktime)
4. State how your mentors can best support you in your contribution to sktime (e.g., providing code reviews, share useful resources, explain ML concepts):
* dev/environment set-up
* sktime codebase overview/introduction
* providing ongoing code reviews
* pre-emptive discussion about how to structure an idea (i.e. annotation, series_as_features, forecasting)
... I'll have plently of questions when I start ...
4. Open an issue on the [sktime/mentoring](https://github.com/sktime/mentoring/issues) repo. This can be updated during the mentorship program.
See https://github.com/sktime/mentoring/issues/23
## Week 1 - 2022-05-31
### Agenda
* discussing aim of mentoring
* how sktime mentoring programme works
* discussing project to work on & next steps
* developer set-up
### Notes
Resources
Link | Notes
---|---
Practices of the Python Pro by Dane Hillard | book on design patterns (downloaded pdf) |
https://sourcemaking.com/ | website for design patterns |
https://github.com/kamranahmedse/design-patterns-for-humans | |
https://github.com/faif/python-patterns | |
LTTC | lecture notes on software eng for data science - lecture 1 has pointers to programming books and math books -> page 4 |
- long-term goal of creating new package
- PyPI, documentation, open source, automated unit testing
- lm suggests joining reslease note meetings
- fk suggests watching creation of new packages from scratch. e.g. the project that fk and ryan are going to start
- cookiecutter: folder/file template (https://pypi.org/project/cookiecutter/)
- scaffoldpy (https://pypi.org/project/PyScaffold/)
- software engineering sub-goals
- enough understanding of sktime, enough to be able to create one's own package.
- LM. generally big part of coding is knowing what already exists, how to use google/stackoverflow, not coding things from scratch.
- LM. knowing general rules of thumbs, e.g. avoiding for loops in numpy/pandas.
- LM. another big part is design / software design / system design. contract-based design, avoiding spaghetti code, good balance of rigidity vs flexibility.
- FK suggested various books/resources
- mathematics/stats
- fk suggested various books
- KB breifly showed current research that is being done. will share link to arxiv when ready.
- how sktime mentoring works
- this hackmd is shared and we all work on it.
- each week, should have an agenda ready for mentoring meetings. any qeustions that arise thoughout the week, worth adding to next week's agenda. possible for mentors to provide answers before the actual meeting.
- ensure you are familiar with Code of Conduct
- sktime meetings: slack and discord
- governance documents: see https://www.sktime.org/en/stable/about.html. how sktime makes decisions, what are the processes we follow, etc.
- there will be regular workgroup meetings - e.g. meeting for forecasters, meeting for annotators, etc. planning for the next week.
- daily standup.
- free to join the various other meetings in sktime: governance, developer meeting
- flexibility
- there is table above in which KB detailed availability/timeline
- sktime is fully flexible
- presentations
- we provide opportunity to give presentations. at dev day, at doc sprint, in python conferences.
- ask Guzal or Franz
- blogs: explaining how mentoring works, experience, showcasing sktime, onboarding, community aspect, etc.
- individuals who have blogged in the past: Outreachy mentees - Guzal; Alexandra Amidon (no longer active at sktime) - successfull (data) science blogger; Nina Miolane (allied package - geomstats, collaborated on 2021 GSoC)
- tutorials are in "examples" directory, there they can be run via binder
- event specific tutorials are in sktime org, under conference name, e.g., pydata2021 https://github.com/sktime/sktime-tutorial-pydata-global-2021
### development goals
### project goals
### Actions
- good first issue (content/algos to learn about estimators)
- maybe adding or modifying an algorithm to get familiar with structure?
- optimally sth that can be interfaced, does not need implmenetation
- goal: to get familiar with algorithms software architecture
- examples:
- https://github.com/alan-turing-institute/sktime/issues/2499
- https://github.com/alan-turing-institute/sktime/issues/2417
- https://github.com/alan-turing-institute/sktime/issues/2357
- https://github.com/alan-turing-institute/sktime/issues/2173
- https://github.com/alan-turing-institute/sktime/issues/2059
- some of these are already worked on but abandoned, could start new or with existing PR if picking one of those
- reading list
## Week 2 - 2022-06-07
### actions from 2022-05-31
- good first issue (content/algos to learn about estimators)
- reading list
### work done
Working on this issue:
https://github.com/alan-turing-institute/sktime/issues/2499
- class VARMAX(_StatsModelsAdapter)
- init, fit okay
- error with predict "ValueError: Prediction must have `end` after `start`"
### Agenda
1. review of actions & work done
2. questions of Katie
3. scheduling, calendar check
### Questions
- debugging environment
- is there a tool to "break" and retain variable values
- questions as comments in varma.py
- comitted to varmax branch
- FK: these are not "seen" by default review process. I would suggest for sth like this: (a) make a draft PR, (b) put questions there, possibly pointers to code locations, (c) ping GitHub IDs of people you´d like to look at it
- question: where is `to_absolute_int` coming from?
- that is in `ForecastingHorizon` class, the `fh` is always wrapped in it before it reaches `_fit`
- people most familiar with this class are khrapovs and fkiraly (and mloning but he is not active)
- searching for code origin in VSCode
- Edit -> Find in files
- python extension -> CTRL click on function name to go to definition
- recommended development setup
- virtual environment either conda env/venv/pipenv etc
- editable installation of sktime indluding dev deps `pip install .[all_extras,dev]`
- jupyter installation in the same virtual env (`pip install jupyter` or `pip install jupyter-lab`)
- ensure you can run repository test suite
- description of how to write and then test a new estimator here:
https://www.sktime.org/en/stable/developer_guide/add_estimators.html
### Notes
- LM briefly showed how he runs pytest tests. runs some kind of make file, creates directory, run pytest with `-m`
- FK showed how to use vscode GUI for pytest. also explained the paramatrize decorator.
- LA asked if there is plan for separate annotation module. not yet decided, though FK has strong opinions.
### Actions, short-term goals
- make a draft PR on VARMAX
- write some unit tests for VARMAX
- read up on conda environment management, official documentation
- read up on testing in vs code (pytest), official documentation
- look at api designs for annotation that already exist i.e. [pyod](https://github.com/yzhao062/Pyod), [adtk](https://github.com/arundo/adtk), [luminaire](https://github.com/zillow/luminaire), ruptures? things to understand include:
- what is structure of base class or classes? do they have fit/predict or something else? what is the precise inputs and outputs for the base methods.
- what is format of annotation data?
- what kind of data they can handle (e.g. univariate vs multivariate).
- what 'learning tasks' can they handle? e.g. supervised vs unsupervised? offline versus online? any other variations?
- read up on decorator @...
---
## Week 3 - 2022-06-14
### work done
https://github.com/alan-turing-institute/sktime/pull/2763
### Agenda
- previous week's goals
- how to iterate through get_test_params
- git branch help, accidental commits
- discuss annotation (katie to share intrinsic dimension code)
- debugging walk-through (time permitting)
### Questions
Note by LA: I moved the questions that were here to the agenda, and then copied the agenda to the notes section. This way, during the meeting, the Notes section becomes the logical place in which to make notes, rather than having notes spread out between three different sections (Agenda, Questions and Notes). Feel free to revert back to how things were if you prefer the other arrangement.
FK: Makes sense! Should we also change the template?
### Notes
- previous week's goals
- how to iterate through get_test_params
- FK: the test framework does this automatically, usually
- if you want to do this manually, you can use `create_test_instances_and_names`
- or, of course, use `get_test_params` directly, loop through elements, and call the constructor with `**params`
- LA. If you are not already familiar, learn about `*args` and `**kwargs`. https://www.freecodecamp.org/news/args-and-kwargs-in-python/
- git branch help, accidental commits
- KB created a new branch `annotation` of another branch `varmax`, instead of creating it off `main`.
- LM recommended creating a totally fresh branch (off main), checking it out, and then `git cherrypick annotation`. this will allow one to pick precisely those commits you want from `annotation` into the newly created branch
- LA recommends this lecture explaining what commits and branches actually *are*. https://www.youtube.com/watch?v=2sjqTHE0zok
- FK: having a graphical user interface such as the branch/commit interface in vs code with gitlens might be helpful. Or, using GitHub Desktop. Otherwise, you need to constantly visualize the "branches" conceptual layer between the code and the git console
- discuss annotation (katie to share intrinsic dimension code)
- this was discussed.
- had paint file with different versions of annotation visualised.
- LA: in the paint file, you had 'multivariate' as an example, and it looked like we were annotating each individual series that made up the multivariate series. I think this is more commonly known as panel annotation? (To my knowledge, multivariate is you have multiple inter-related series and we are trying to annotate them as a whole. E.g. in an EEG, you get dozens of series all measuring the same brain at the same time; in this case we are more interested in annotating the whole collection of series not each series independently.)
- KB showed code for draft of implementing Instrinsic Dimensionality
- KB to continue understanding, experimenting and refactoring code. hopefully ready for Friday
- debugging walk-through (time permitting)
### Review actions
- [x] make a draft PR on VARMAX
- [x] write some unit tests for VARMAX
- read up on conda environment management, official documentation
- read up on testing in vs code (pytest), official documentation
- look at api designs for annotation that already exist i.e. [pyod](https://github.com/yzhao062/Pyod), [adtk](https://github.com/arundo/adtk), [luminaire](https://github.com/zillow/luminaire), ruptures? things to understand include:
- what is structure of base class or classes? do they have fit/predict or something else? what is the precise inputs and outputs for the base methods.
- what is format of annotation data?
- what kind of data they can handle (e.g. univariate vs multivariate).
- what 'learning tasks' can they handle? e.g. supervised vs unsupervised? offline versus online? any other variations?
- read up on decorator @...
### Actions, short-term goals
-
---
## Week 4 - 2022-06-21
### work done
- Actually understanding Hidalgo (Heterogeneous Intrinsic Dimension Algorithm)
- Hidalgo/twoNN pull request
- https://github.com/alan-turing-institute/sktime/pull/2828
- Bug fix for statsmodel in VARMAX interface
- https://github.com/alan-turing-institute/sktime/pull/2763
- Research on other annotation API's
- https://hackmd.io/@KatieBuc/B1xayCKt5
### Agenda
- overview of work done
- previous week's goals, review
- questions by Katie
- schedule for next weeks - does that change because of Katie's travels?
### Notes
#### review of actions from previous weeks
- make a draft PR on VARMAX
- done, 2763, under review
- write some unit tests for VARMAX
- also in 2763?
- read up on conda environment management, official documentation
- read up on updating or resetting/deleting environment (as necessary for current situation with kalman filter)
- read up on testing in vs code (pytest), official documentation
- started, still working on this
- look at api designs for annotation that already exist i.e. [pyod](https://github.com/yzhao062/Pyod), [adtk](https://github.com/arundo/adtk), [luminaire](https://github.com/zillow/luminaire), ruptures? things to understand include:
- working on this
- https://hackmd.io/@KatieBuc/B1xayCKt5
- read up on decorator @...
- done
- Debugging environment walk-through
- done (thank you everyone, equally)
#### Questions
- question: what's the best environment to run C code in? to be able to break it down into the components and see the inputs/outputs we get.
- FK: where does the situation come from?
- KB: hidalgo/python calls C from python
- LM: looks like the hidalgo code has a "so" file, that is compiled C code which is platform specific. It will probably not run unless on the specific platform this was originally from
- best way forward? analysis of options (see below)
- decide: can we run the code easily?
- if yes: interface?
- if no: what's the best way, e.g., translation or reimplementation?
- KB review of Hidalgo repo - looks like it calls an mex.h file which does not exist
- don't know, matlab-to-C interface?
- question: VARMAX example in docstring (printing a bunch of stuff - suppress print?)
- FK would suggest: suppress any print in `_fit`, that's where it is coming from? Users probably do not want this
- perhaps conditiona suppress, "verbose" hyperparam
#### schedule for next weeks - does that change because of Katie's travels?
see above, table outlines this precisely
### Actions, short-term goals
* investigate micheleallegra/Hidalgo code -> does this run
* more generally: what are the options. Can we interface? reimplmenet? how does that compare?
* does the existing code run?
* existing Hidalgo has their own gibbs, could we use pymc4? e.g., is partial replace an option?
* implement from micheleallegra repo? Or implement from paper?
- read up on conda environment management, official documentation
- read up on updating or resetting/deleting environment (as necessary for current situation with kalman filter), todo [https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#updating-an-environment](conda env docs)
- https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
- read up on testing in vs code (pytest), official documentation
- ensure you can test via vs code GUI
- look at api designs for annotation
- main goal: continue on Hidalgo
- main goal: finish VARMAX
- failing doctest
## Week 6 - 2022-07-26
### work done
- multivariate data generating function
https://github.com/alan-turing-institute/sktime/compare/main...KatieBuc:sktime:data_generator_multivar
- finished writing Gibbs sampler in python, debugging and unit testing
https://github.com/micheleallegra/Hidalgo/compare/master...KatieBuc:Hidalgo:master
- python/gibbs_katie_read_random.py (c++ code in python)
- python/tests/test_hidalgo.py (unit testing)
- python/dimension/hidalgo.py (**unfinished: docstrings, linting)
TODO:
^ with the aim to refactor gibbs (try different baseobjects) and hidalgo (get rid of for loops)
^ move across to an sktime draft PR
^ profile code (to learn)
TODO:
Investigate speed up of CI Test Routine
https://github.com/sktime/BaseObject/issues/24
### Agenda
- overview of work done
- LA had a quick look at multivarate data generation. makes sense. good docstring!
- KB quickly showed the three files related to hidalgo.
- LM. will eventually move away from C code as 'ground truth'. will introduce tests on simple inputs where we have good expectations for outputs. investigating the limits of the algorithm is not a 'test' as such, but more stuiable for notebook/research paper. no strong boundary though.
- KB. mentioned some variants in data generation. asked how hmm would do if you only changed covariance between segments. LM discussed some high level points about segmentation research. often segmenets determined by means. in some applications, means do not change (e.g. acoustics?). third large group is periodic patterns (e.g. ekg) - detecting changes in periodic patterns. LM. Fluss matrix profile works well on these periodic patterns, and ruptures works well with means/variances.
- KB. very curious to see how hidalgo performs when we have different amount of covariance in multivariate series
- KB. do we want function that generates means and lengths for us, rather than having user specifying them? LM, as long as we keep ground truth information so we can evaluate the algorithm.
- LM. what would be even better would be 'scenario generator' that could have different combinations of the above and different level of complexities. this is not common at all! 'high complexity' in mean shift means 'lots of changes that are small and hard to find'.
- LA. recommended reading ruptures paper
- previous week's goals, review
- LA and LM for conda, big things are creating new environments, update and delete.
- pytest using gui. done!
- api designs. some discusions were had. want to summarise/record discussions
- continue on hidalgo. discussed above
- continue on varmax. done!
- goals for next week
- try deleting and creating a conda environment
- finish: hackmd for suite of anomaly detection api's
- organise pair programming with us for hidalgo
---
- where to live? estimator
- sklearn object to wrap in sktime?
- LM: separate issue between user vs developer. for user, good docstring should suffice.
- LM: can use k-fold, and other that do not required ordered data
- to make explicit (via interface) reduction:
- forecasting, reduce given timeseries to predictions (create sklearn object, explicitly wrap to a forecaster object) instansiating sklearn object.
- clustering timeseries data (benchmark)
---
---
## error in pytest discovery in vscode. in final week of july
ERROR - Pytest discovery in sktime

(whereas I'm able to with the hidalgo repository, conda environment agnostic too)

conda run -n sktime-dev --no-capture-output python ~\.vscode\extensions\ms-python.python-2022.10.1\pythonFiles\get_output_via_markers.py ~\.vscode\extensions\ms-python.python-2022.10.1\pythonFiles\testing_tools\run_adapter.py discover pytest -- --rootdir . -s --cache-clear sktime
[ERROR 2022-7-1 12:10:50.504]: Error discovering pytest tests:
[n [Error]: 2022-08-01 12:09:36.821756: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-08-01 12:09:36.824194: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Exception: pytest discovery failed (exit code 2)
ERROR conda.cli.main_run:execute(49): `conda run python c:\Users\n10907700\.vscode\extensions\ms-python.python-2022.10.1\pythonFiles\get_output_via_markers.py c:\Users\n10907700\.vscode\extensions\ms-python.python-2022.10.1\pythonFiles\testing_tools\run_adapter.py discover pytest -- --rootdir c:\Users\n10907700\repos\sktime -s --cache-clear sktime` failed. (See above for error)
at ChildProcess.<anonymous> (c:\Users\n10907700\.vscode\extensions\ms-python.python-2022.10.1\out\client\extension.js:2:232783)
at Object.onceWrapper (node:events:510:26)
at ChildProcess.emit (node:events:390:28)
at maybeClose (node:internal/child_process:1064:16)
at Process.ChildProcess._handle.onexit (node:internal/child_process:301:5)]
ACTIONS:
1. installed older version of python extension as karthik suggested [here](https://github.com/microsoft/vscode-python/issues/18493)
got this message

** updated python extension version
no change
2. deleted environment, created new one
```
conda deactivate
conda env remove -n sktime-dev
conda create -n sktime-dev
conda install -c conda-forge sktime-all-extras
```
```
pip install -e .
```
WARNING: pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available."))
2.1. new SSL error: tried tp upgrade pip itself, suggested [here](https://stackoverflow.com/questions/25981703/pip-install-fails-with-connection-error-ssl-certificate-verify-failed-certi)
`pip install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org pip setuptools`
Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping
---
## Week 7. 2nd August
### notes
- katie described past week. in baseobject, has task regarding caching conda environment.
- also been struggling with pytest discovery in vscode. see bug described above. had session with lukasz. but it seems to work now.
- created draft pr for hidalgo. https://github.com/alan-turing-institute/sktime/pull/3158
- goals from last week
- conda deletion and creation. done
- hackmd suite for anomaly detection. still to do.
- organise pair programming
### goals for next week
- continue refactoring hidalgo. https://github.com/alan-turing-institute/sktime/pull/3158
- use isclose by numpy
- export conda environment, and then create environment based on exported file.
### paused goals
- hackmd suite for anomaly detection.
- continue with baseobject task.
- discuss base object, interface design (methods called), end state
---
## Week 8. 9th August
### work done / notes
- (almost) READY FOR REVIEW (PR for Hidalgo) https://github.com/alan-turing-institute/sktime/pull/3158
E AttributeError: 'Hidalgo' object has no attribute 'labels'
### questions
- safest way to delete/retract pull request/branch? i.e. https://github.com/alan-turing-institute/sktime/pull/2828
* simply "closing"
- How do I download file (from VScode interface -> environment.yml)
* FK: why would you like to do that?
* KB: to generate a report, e.g., to share with others
* FK: interesting - dont know how to do this
* but probably others have thought about this already
* can you let me know if you find out a good way?
- Please resend the github link to collaborate on BaseObject (Franz)
* https://github.com/sktime/BaseObject
* reinvited with write access
Contacting authors of Hidalgo:
* FK: what are the aims of discussing the Hidalgo algo with original author?
* KB: to inform them, ask for review
Discussion - next algorithm?
* FK: think Noa made a list of algorithms from her literature research
* https://github.com/alan-turing-institute/sktime/issues/2868
* https://github.com/alan-turing-institute/sktime/issues/2820
* https://github.com/alan-turing-institute/sktime/issues/2826
* you could ask which one she is not working on and share
* or we could make a wishlist later
* revisit next mentoring meeting
### goals for next week
- finish addressing changes in data generator and merge
- https://github.com/alan-turing-institute/sktime/pull/3114
- discuss HidAlgo with original author.
- hackmd suite for anomaly detection.
- continue with baseobject task.
- discuss base object, interface design (methods called), end state
- try refactoring HidAlgo with different BaseObject design?
- FK: not needed - `baseobject` BaseObject is 1:1 compatible with current `sktime` `BaseObject`
- another algo?!?!
## Week 9. 19th August
### work done / notes
Multivariate data generator
Hidalgo as transformer (?)
Anomaly detection API notes ready to share
https://hackmd.io/o02lCf4aT22FAdXsq9JGaQ
### questions and notes
(hackmd) - but not sure about [LADStructuralModel](https://github.com/zillow/luminaire/blob/master/luminaire/model/lad_structural.py)
- LM. looks good. add a bit of commentary / highlights.
(hackmd) - if input is X does that automatically mean offline method? if input (to fit or predict) is a single number that means online?
- LM. for me, difference between offline and online is what happens to internal state of model. offline - no updating of state of model. online - can update model with new data. shape of data should not determine if it is online/offline.
- LM. given an offline algorithm, can always make it online. by tracking all data, and refitting on all past data. this is usually very inefficient.
LA. before jumping into algorithm discussion, what are your goals for rest of mentorship?
- KB. would like to get 3 algorithms coded up. learn about more algorithms by coding them up.
- maybe also ci/cd via baseobject.
- LM. high level. two broad things to consider. one is tool-building (currently main focus of sktime). second is case-studies / using tools to create example pipelines and full workflows (problem definition, data preprocessing, model selection, parameter selection, metrics, visualisations, current weaknesses, discussion). could become a talk or blogpost or paper. would be good for both sktime and for KB personally.
- KB. would this be in notebook?
- LM. yes, that is good option.
- LA. this might not fit into mentorship timeframe, because all tools do not yet exist in sktime.
- LM. do not need tools to yet exist in sktime. could either write functions in notebook, or, use 'sub-optimal' strategies and have this as way of exploring workflows and ideas. e.g. 'i evaluate using sklearn classification metrics, but there are many other metrics in literature that are worth trying and implementing instead'
KB - currently thinking about three things: stray alogirhtm (feels most comfortable with this), wrapping ruptures (less comfortable with this), doing case study notebook (sounds like lots of microdecisions to be made).
- LM. we are of course happy to help with any of these.
KB. current preference: case study notebooks / exploring workflows.
- KB, where would i get data, kaggle? LM has a whole load of industrial data. can also generate simulated data - this should be simpler.
Next anomaly detection algorithm:
1. "You could probably start with the stray algorithm (https://robjhyndman.com/publications/stray/). It can work with any high-dimensional data set, not just time series, but we apply it to time series in that paper. It uses time series features, and then finds anomalies within the set of features -- it can be applied to a collection of time series, or use windowing on a single time series. There is already a good python library called tsfresh which computes time series features, and that works with sktime as far as I know. So implementing stray can then work with tsfresh as well."
2. "An alternative is to do dimension reduction on the feature space, and find outliers in the components of the reduced space. Our DOBIN algorithm is designed to do that:" https://robjhyndman.com/publications/dobin/
### goals for next week
- AnnotatorBaseObject design from scratch (Monday meeting)
- Merge Hidalgo PR
- Merge Data generator PR
- Decide on a data set (figure out a problem to solve?)
## Week 10. 23rd August
### work done / notes
merged data generator
two separate PR for hidalgo - need help with "Test instance not found for..."
### questions and notes
and also in Hidalgo PR's
2022-08-23T04:19:02.9836259Z =========================== short test summary info ============================
2022-08-23T04:19:02.9837056Z FAILED sktime/tests/test_softdeps.py::test_est_construct_without_modulenotfound[Hidalgo]
2022-08-23T04:19:02.9837686Z FAILED sktime/tests/test_softdeps.py::test_est_get_params_without_modulenotfound[Hidalgo]
2022-08-23T04:19:02.9838235Z FAILED sktime/tests/test_softdeps.py::test_est_fit_without_modulenotfound[Hidalgo]
computer crashes when running tests locally. Need to debug these manually.
FK: an alternative way to run tests is `check_estimator`, also see
https://www.sktime.org/en/latest/developer_guide/add_estimators.html
runs tests according to which base class you inherit from
#### data discussion
Sime ideas (from top rated kaggle datasets):
1. Mall Customer Segmentation Data (clustering)
https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python
* we can use the k-means clustering algorithms, multivariate data (non-time dependent)
2. Credit Card Fraud Detection (anomaly detecton)
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
* PCA outputs, with time and ammount (0,1 anomaly score)
** sktime methods for this? regression -> CNN?
3. Trending YouTube Video Statistics (?)
https://www.kaggle.com/datasets/datasnaek/youtube-new?select=CAvideos.csv
* not sure what to do with this, but panel data
4. European Soccer Database (exploratory analysis)
https://www.kaggle.com/datasets/hugomathien/soccer
* 25k+ matches, players & teams attributes for European Professional Football
* example prediction below (could add more features)
https://www.kaggle.com/code/airback/match-outcome-prediction-in-football
example blog: https://www.embecosm.com/2021/12/18/forget-arima-going-bayesian-with-time-series-analysis/
could talk about state space models in sktime? hyperparameter selection?
find long, univariate timeseries, to segment - makes for an interesting example (e.g. stock prices, energuy prices?)
### goals for next week
Merge Hidalgo
draft segmentation notebook with different generated data (mean shift, covariance change, *seasonality component)
*which will need to be coded in data_generator
## Week 11. 30th August
### work done / notes
1. Ready to merge: https://github.com/alan-turing-institute/sktime/pull/3158
- class STLTransformer(BaseTransformer)
- fix transform logic, call fit
3. STRAY anomaly detection PR: https://github.com/alan-turing-institute/sktime/pull/3338
- check_X ?? BaseSeriesAnnotator changes?
- change to pd.Series?
- (what kind of tags do we need to set? closest proxy with tags?)
- inherit from BaseTransformer****
4. began writing unit tests for DOBIN (pre-processing transformation)
- take arbitrary distances ElbowClassSum
- fkiraly´s PR that extends the original class to arbitrary distance: https://github.com/alan-turing-institute/sktime/pull/3256
### questions and notes
- with HMM, what's the benefit of writing class methods?
- function as argument
### goals for next week
- merge Hidalgo
- STRAY, fix indexing tuple error and under review
- DOBIN, draft PR
- LOOKOUT, draft PR
## next next wwek
- data preprocessing
- can we talk about what kind of solutions
## Week 12. 6th Sept
### work done / notes
1. STRAY: https://github.com/alan-turing-institute/sktime/pull/3338
- ready to merge (all checks passing, 2 unresolved conversations)
- one about outputs types including NAs. no consensus yet, so leaving as is
- update user guides, no normalization: pipeline functionality (pre-processing pipelines)
- docstring example/description with pipeline "the algo presented in this paper corresponds to the following pipeline"
2. DOBIN: https://github.com/alan-turing-institute/sktime/pull/3373
- 1 failing checks (AssertionError: Estimator DOBIN should not change or mutate the parameter k from None to 2 during fit.)
3. scoping out LOOKOUT anomaly detection
https://github.com/alan-turing-institute/sktime/issues/3388
* ADVENTURE [not to infinity, day of reckoning: 13th Sept]
### questions and notes
- with HMM, what's the benefit of writing class methods?
### goals for next week
- Merge STRAY (normalization pipeline)
- Merge DOBIN (unit test, review)
- Scope LOOKOUT, make decision next week
## Week 13. 13th Sept
### work done / notes
1. STRAY to merge: https://github.com/alan-turing-institute/sktime/pull/3338
2. DOBIN to merge: https://github.com/alan-turing-institute/sktime/pull/3373
3. researched LOOKOUT anomaly detection (not keen to pursue) https://github.com/alan-turing-institute/sktime/issues/3388
4. researched E-Agglo clustering https://github.com/alan-turing-institute/sktime/issues/3397
### questions and notes
should E-Agglo be in annotation or clustering?
### goals for next week
draft PR for E-Agglo
## Week 14. 20th Sept
### work done / notes
1. Draft PR for E-Agglo clustering: https://github.com/alan-turing-institute/sktime/actions/runs/3087388402/jobs/4992705882
### questions and notes
There's this infinite loop where pre-commit edits the file to remove whitepace after comma, then complains about no whitespace after comma.
* FK: how odd, which file?
### goals for next week
* get all the PR merged!
* final feedback round, next steps, think about this. probably for early Oct
examples:
- sktime.org
- blog e.g. https://guzal.hashnode.dev/
- update this issue https://github.com/sktime/mentoring/issues/23
- write open issues for others (e.g. Hidlago hyperparameter fitting, etc.)
reflecting on learning goals and feedback for last mentoring session
## Week 15. 27th Sept
### work done / notes
1. E-Agglo ready for review: https://github.com/alan-turing-institute/sktime/pull/3430
### questions and notes
How to do utilize numpy broadcasting / numpy functionality (notebook)
### goals for next week
* Merge E-Agglo!
* next week is the last week! let's have the final feedback session
* any loose ends or hand-over?
* summary of your thoughts on the mentorship
* what worked well, what worked less well?
* where are you vs original learning goals? did you learn what your wanted
* where are you vs original project goals?
* positive feedback, critical feedback (appreciated)!
* mentors, mentoring style
* sktime community
* org around the internship
* next steps (if any)?
* thanks for volunteering to help with the fall dev sprint
* presentation at dev sprint?
* completing mentoring entry in sktime/mentoring repo
* does GSoC want you to write a concluding blog post or a report? (can link that)
* GSoC submission
## Week 16, 4th Oct
### feedback round!
KB.
- genuinely loved experience, in more ways than one.
- not much bad to say.
- found very helpful.
- daily check ins. written option that was introduced few weeks in. particularly useful when in a state-of-flow, didn't want to break concentration. (had to use meeting room to attend daily). particularly when LM was there, willing to pick up shovel and dig into wild problems they came across.
- generally being very accommodating. fk coordinating meetings. felt that I was welcomes and considered
- financially generous, too.
- super well organised. e.g. good documentation.
- weekly meetings are well structued. having somebody write notes is helpful - though this jsut happened, not planned.
- things i would do differently
- scope out algorithms more at start. e.g. hidalgo took lots of time and could have anticipated this.
- felt anxious about making good progress. travel plans also affected this. to help byass anxiety, be more realistic about goals for the project. help be comfortable not working during the travels
- interview was most challenging part of gsoc! coming out of that felt like i did not showcase what i knew. be more flexible with questions. mainly to put interviewee at ease. there were things that i clearly did not know and spent significant time on that. but extra time does give chance for person to reveal what they know.
- idea of question at end: is there anything you want to showcase or explain that you felt like you could not explain in interview so far.
fk: any constructive comments for anything else?
kb. not at all. really was not expecting this much, e.g. compared to academic experiences. like the amount of time spent on nuanced detail. liked how each of us have different styles, so good combination.
kb. would have been good to get base class settled! not sure if there is solution to this, but that would have been good. only thing that comes to mind
lm.
technical feedback:
- you have demonstrated good technical skill/knowledge. able to understand problem and map to code.
- learnt to read other people's code, understand it, and adapt it. both into python *and* into sktime
- very positive: level of improvement from original code base to code that you wrote.
- areas of development in technical. invest time in learning numpy, scipy and pandas. this is lifelong journey. getting tools from their into your toolkit. learning one-line solutions to common situations.
personal
- quickly became independent. i remember session about how to setup vscode, debugging, etc. realised how much is being 'dumped' on people. but you absorbed very quickly and pragmatically.
- only stopped when you had problems.
- had nice cadence. got stuff in quickly. did not get dragged down by not understanding everything. many ppl get held up trying to understand all the nuance of the package.
- good that you had some involvement in base object, release.
- you've proven you can absorb technical aspects of coding quickly
- kudos for time management. sometimes meetings are less important than finishing what you're doing. had several instances where you cancelled meeting because you were in good place and ready to go. creates situation where we only talk about problems - which is good!
- ability to say no and decide where you provide value, is very important. so fact you did this here is excelelnt!
community aspect.
- came to dev days in london. that was fantastic.
- helping with new dev days. admirable!
- personal note. enjoyed working with you. easy to collaborate with. openness of approaching problems. receptiveness to feedback. clear you cared about problem and seeking best solution. you show you are listening, and trying to verify you understand. active listening is important skill.
- really enjoyed pair programming sessions.
- hope you stay in sktime, or another open source, so can you pass this forward
- helps us develop as humans! coding is not something one only does in isolation. about communicating. and that is one of your strong suits.
summary
- really enjoyed. you've made massive contributions. kudos and congratulations! hope you will stay.
la.
- impressed by how much progress you made. speed of prs compared to first pr is huge.
- very pleasant, in terms of personality and communication style.
- got a sense of hesitation to ask for help at start. (this was correct). in which case, big congratulations for overcoming that!
- possible area of improvement: put more consideration on 'good' coding practice. ones I notice are writing docstrings up front rather than at the end, and using self-explaining variable names (rather than single letter names)
- interview note. you were one of three mentors, so you were in top three out of all candidates! useful learning experience would be to be interviewer yourself.
fk.
- echo what lm has said. large improvement.
- under-estimation. in hindsight, don't underestimate yourself.
- techncial aspect. you learnt a lot. hidalgo - no problem for choosing this. learning lots of things in parallel so not 'waste of time'.
- did also feel hesitation to communication. that has developed really well. perhaps has to do with overcoming anxiety or perhaps imposter syndrome.
- there were growth areas. (la did not note them down)
- consider becoming coredev!