# Diatom Shapes Project
## **Meeting Notes**
### To-Do:
1. Fix annotation issue (ASJ).
1. Newer dataset with diff ratio (so more testing). (ASJ).
1. Re-run original with eucyonema and others. (SK)
1. Run cNN on new data (SK).
1. Future: get more data.
1. Create new model for binary classification gom vs non-gom
---
**Meeting Date**: 25/01/2024
**Location**: PhD Offices, UCL
**Immediate To Do:**
- Get dataset of tip outlines (ASJ).
- Run Karcher means / tPCA and geodesic distances on tip outlines (SK).
- Create CNN json for multiclass classifcation (ASJ)
- Create CNN json for Gomphonema CNN (ASJ)
- Send detectron notebook to SK (ASJ)
- Multiclass CNN (filtered species) (SK)
- Gomphonema CNN (SK)
- Send cropped images to SK (ASJ)
- ~~Create metadata CSV with additional information with columns such as Lake, Year, Specie, Environmental info (acidity)(SK)~~.
**Minor To Do:**
- ~~Install Github desktop (SK)~~
- ~~Push code (exc. KNN .py file) to Github (SK)~~
- ~~Push KNN code to Github (ASJ)~~
- Send SK detectron colab notebook (ASJ)
**Other / In Future:**
- Include size (or width) as a parameter to ML code.
- Merge CNN w. morphological analysis on curves.
- Cluster within Gomphonema species.
- Try other clustering method.
- ~~Plot tPCA with other groups highlighted (e.g., Lake, Year etc)~~.
- CNN on videos from microscope.
- Frame cropping from microscope videos.
- https://github.com/facebookresearch/detectron2
---
**Meeting Date**: 13/12/2023
**Location**: Informatics, NHM
**Big Picture**:
- CNN to automatically segment and classify diatoms from images.
- Use CNN to create dataset of diatoms per image.
- From images, get outlines of diatoms. (Hopefully using basic contour extraction).
- Shape analysis on diatoms for GM, per site.
- New pipeline for analysing shapes of diatoms, automatically, from images.
**Next Steps**:
- Fix bug in KNN code. (ASJ)
- Send over CNN code. (ASJ)
- ~~Plot Karcher Means. (SK)~~
- Publish repository on Github. (ASJ)
- ~~Get Github. (SK)~~
- Plot tPCA results (e.g., PC1 vs PC2 for all spcies on one graph, 3D plots etc.). (SK)
---
## Other Notes
**Things to Check/Ask**:
- Were curves closed?
- How many points per curve - should we increase?
- Is it worth smoothing the curves?
- How to define "tips".
- K-means clustering on curves.
- Cropping images to have more images. Then splitting into training/testing/val -- possible to have multiple curves per cropped image.
- Creating a dataset of pairwise distances and diatom area (or some other variable?)
- Genus level classification
---
- worth symmetrizing diatoms on the longer axis (apycal axes). (e.g.like Greek vases)
- test basic smoothing
- width (along x axis) - add this variable to the geodesic distances.
- ASJ will send kiwi notebooks to combine distance matrices.
- Try classification on genus level.
- Kmeans https://uc-r.github.io/kmeans_clustering
- Create those 2 datasets
1. - ~~SK will send over updated meta CSV (with species).~~
1. - ASJ will create a notebook to do basic smooting and symmetrization.
1. - ASJ will create a dataset of smoothed, symmetrized diatoms.
1. - ASJ will create a dataset of half-diatoms.
1. - SK will run previous classification algorithms and tPCA / KMs on the two new datasets.
1. - SK will test out K-means clustering on the original for now (and then on the other two datasets).
1. - ~~SK will draw bounding boxes on original images and then send that to ASJ~~.
1. - ASJ will create dataset of cropped images based on SK's bouding boxes.
1. - ASJ, SK will run CNN (detectron) algorithm on cropped images.
1. - ~~SK will gather general width data for diatoms.~~
1. - ASJ will compute widest width (in x axis).
1. - ASJ will send SK notebook to combine geodesic distance matrix with width distance matrix for classification.
1. - ~~SK will re-run classification on genus-level.~~
---
Potential Things for SK to Try:
1. Kmeans https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
2. Random Forest https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Useful Links:
1. Colab - https://colab.research.google.com/
2. Kloster's paper - https://pubmed.ncbi.nlm.nih.gov/36827378/
3. Sherpa Link: https://www.uni-due.de/phycology/sherpa
To-Do:
1. ~~Create fake diatom images for CNN. (ASJ)~~
2. Run CNN on fake diatoms. (SK - but ASJ will send code)
3. Get outlines from CNN (e.g. Segment Anything) (ASJ)
4. Run KNN on CNN outlines.
5. ~~Symmetrize using Karcher mean way (ASJ)~~
6. Run KNN on Symmetrized curves (SK).
7. Run template shortest match method on original (SK).
8. Run template shortest match method on symmetrized (SK).
9. Find measurements of diatoms from images. (ASJ will think)
# Code
```
paths_to_means = 'folder_paths'
all_means = {}
for file in os.listdir(path_to_means):
df = pd.read_csv(path_to_means+'/'+file)
all_means{file} = [list(df['X']),list(df['Y'])]
```
```
closest_matches = {}
for i in tqdm.tqdm(range(n)):
x1 = list(df[df['Filename']==names[i]]['X'])
y1 = list(df[df['Filename']==names[i]]['Y'])
d_min = 100000
closest_match = ''
for kmeans in list(all_means.values()):
kmx = all_means[kmeans][0]
kmy = all_means[kmeans][1]
d,_,_,_ = geodDistance(x1,y1,kmx,kmy,k=7)
if d < d_min:
d_min = deepcopy(d)
closest_match = deepcopy(kmeans)
closest_matches[i] = closest_match
```