# Introductions.
# Background: How are we using flow cytometry? Instruments? Science?
# Data formats and storage.
# Tools - proprietary and open source software. Compute platforms.
# Visualisation - Pulse shapes - Images.
# Clustering.
# Machine learning.
# Collaboration opportunities.
# Next steps.
Project outputs in this period and in future
1. Intros
2. Background
- longer term aims with cytometry - can we use it as a way to get to more general image analysis
- Cefas collaborating with another institute (?) on merging data from multiple machines, to look at a wider group of organisms (in marine plankton)
- Ecotaxa - free tool for reccognition of plankton. Comes from marine environment. Phytoplankton and zooplankton, but not the smaller plankton ("tiny plankton" - meaning picoplankton?)
There is a cost if you are not a member of existing projects
- Deep learning and convolutional neural networks that are being used require huge amounts of training data, and don’t generalise well. Issue with “explainable AI” too – we don’t know how the algorithms arrive at the answers they do (what discriminators are being used to classify objects). A change of camera can completely undermine the performance of the algorithms!
- Large vision models may be coming along
- Also interest in looking at whether the diversity of images could be an indicator of biodiversity, rather than going to ID of images. Perhaps akin to measures like functional diversity and dispersion.
Cefas context - interested in fish stock and change over time. Sampling and acoustics but still represents a small sample. It would be interesting to really get higher frequency sampling of whole food chain - density and abundance of copepods and look at correlations with acoustics and flow cytometry. Has been difficult in the past with siloed groups and get joined up fisheries models e.g. how do phytoplankton blooms affect the fishery. Within this dealing with big data has been a problem, and losing data in the process of dealing with it.
Need to find new ways of working with large datasets.
2 applications here - monitoring and what we can do with the data. High frequency monitoring is good for the variability in abundance and the diversity. Zooplankton variability is more complicated and needs more data, phytoplankton more homogeneous. Want to see how phytoplankton community switch from larger / smaller size and the affect on the carbon cycle. Estimating carbon is quite hard.
Even flow cytometers have a lot of variation...
UKCEH has 15 years of data with older cytometer. Some overlap with new machine but the outputs are not aligned at all. Datasets need to be joined but this is very hard. Some correction possible.
Cytometry data may be too specific to protocols and moving to different environments (rivers to ponds) is not possible, whereas this seems as if it should be possible with imagery.
Some groups can only be separated by secondary pigments, but this is not possible with the cytometer.
Veronique has another flow cytometer, can get more in this area using pulse shape, but end up with fewer images. Isabelle's machine only gets broad characteristics of the pulse ()
With expert knowledge you can identify things from the pulse shape but still need to know how to set up the machines each time.
There is an initiative to try to standardise things, so there is something to work from but it is still in early stages.
CEFAS also talking about getting a flow cam.
Different conditions will always mean that the plankton will give different results due to storage conditions, lighting conditions, etc. Images should generally be the same in this regard.
There is information in the cytometry that you can't get from the imagery, in relation to the pico-plankton, which are small but significant in total.
Cefas instrument - get some images, and pulse shapes for all images. Whilst pulse shape may not tell you more than the image, it can be used to properly ID things that do not have images. You could use linked flow cytometry data and labelled images to build predictive classification tools for cases where you only have flow cyto particle data.
Absolute values of fluorescence, etc., may change, but it is probably more about the shape - so could be normalised.
What other properties are collected (e.g. width, etc.). Flowcam collects a lot of other metadata. This all belongs to the image.
Isabelle's Attune cytpix doesn't get size, etc. It provides, for each particle, an image, forward scatter, side scatter, fluorescence (three lasers – red, blue, violet)
Excites with 3 lasers but gets one output from all of these. So
You get 13x3 plus the image, per cell.
11 fluorescence channels, plus forward and side scatter. Height (amplitude), width and area.
Veronique working with a network of people on flow cytometry. Different ways of working, but trying to harmonise e.g. putting beads in the samples. Helps to see if laser is correct, etc.
Isabelle doesn't use beads. Used to with non-imaging cytometer for the count. Now the abundance comes anyway.
Could the beads data be used to calibrate things. Should be trying to harmonise into a universally recognised measure.
Could be standards in other areas (e.g. health) but these might be hard to understand!
FAIRness of data.
USe of lasers has moved to focus on phytoplankton but is then missing a lot from larger and smaller parts of the community. These other groups (e.g. heterotrophic protists, fungi, rotifers....) have important roles in the ecosystem.
Looking at 4 orders of magnitude of size differences.
Want to know what we can and can't cover with the instruments.
Flow cytometer good for smaller things, flowcam macro may be good for the bigger things. Want to understand how to best cover this range of sizes, and organism functional types.
OBAMANext project (??) looking at the step to join plankton analysis and remote sensing data.
Lombard 2019 paper (https://www.frontiersin.org/articles/10.3389/fmars.2019.00196/full)
looks across the space of size, pigments, etc. and shows which instruments cover the space.
Heather has written a paper on river Thames and phytoplankton in a similar way: https://www.sciencedirect.com/science/article/pii/S0048969717335544
Sometimes there is a huge abundance of things (?) that aren't usually looked at. NEed to make sure these are being picked up by automated methods.
Check in with agenda:#
- look at some flowcam data
- looking at data and file formats
Flowcam - have the visual spreadsheet software that comes iwth the machine. Found that you can save individual images but these have to be selected. Saved into collages. These have been classified. Very difficult to save them individually. Have asked Flowcam if there is a different way - if upgrade the software this could be possible. Is there a way (seems to be online examples) of using code to extract and save individual images.
Ezra - looking at imageJ - open source - write a macro to run something within the software to extract specific images from microscope slide. Detects edges and make biggest shapes possible. Tested on Isabelle's data and wasn't that good.
Sari Giering has python script for extracting from .lst file.
Also, get a "true colour" image and a mask of the image...
Flowcam can take photos of multiple particles in one photo. Metadata is then incorrect and harder to interpret.
These companies (Fluidic) work very closely with LOV. Are we linked with them at all?
Sometimes they have downgraded the quality of the image when they make the collage. Rob uses PiL and opencv for some of these things. Usually the segmentation algorithms work quite well.
Ezra and Erica - could talk to Heather to dig into the documentation and formats.
We have good links with Sari Giering - now back 2 days per week.
Also, Veronique could help if talking to Fluidic.
May be other useful metadata in the tif image.
Beware lossy compression when moving to jpeg or png
And should record the provenance
CSV files come with headers of:
- particle id doesn't match label used in the collage image
- aspect ratio
- circularity (hu)
- diameter (esd)
- Edge gradient
- Original reference id
- source image (number)
.LST file has much more than this and links to GUID of images....
Need to link in to Scivision project...
Data formats
- There is an FCS binary format used in the medical flow cytometry field (something of a known standard). See: https://rdrr.io/bioc/flowCore/man/write.FCS.html#:~:text=The%20function%20write.,..%2Fdoc%2Ffcs3.
- Paper on a workflow for flow cytometry data: flowCore: a Bioconductor package for high throughput flow cytometry | BMC Bioinformatics | Full Text (biomedcentral.com)
- Cefas looking into adopting this approach rather than JSON
- Noushin would be good to talk to about building datasets, as she is doing so for the Cefas plankton imager. She has written code to place images in categories. Has been developing ResNet-50 algorithms (ResNet-50 convolutional neural network - MATLAB resnet50 (mathworks.com)). She may have re-usable code.
- When we have good, labelled images, we can look at general AI tools that could be applied to images from several instruments.
Potential to do both simple model (decision tree) based on feature information (either re-extracted or from flowcam data) alongside NN type model (ResNet-50, used by Noushin) so there is a comparison between them. Also look at KNN models.
Classical features - Noushin has a project using these alongside the image e.g. size and shape of the feature, and the location they came from. Also multi-modal data.
Could start with introducing the simple data... classical features ... and then move onto the full flow cytometry dataset.
Could investigate how beneficial this information is in addition to the image.
Also interesting to understand which part of the image that the model is using most - make sure it is a useful feature...
For Flowcam need to decide whether to merge libraries from previous work (where some are natural samples and some from mesocosm) or whether to start from scratch. May need to consider some acceptance criteria, e.g. around size. Can we get quantified classification uncertainties for each image? We could use these to find the "optimal" level of classification (the number of groups, level of discrimination).
Cefas moving to WSL (Windows subsystem for Linux)
ML work - Cefas - few goto algorithms:
- ResNet - in keras and tensorflow, etc. Rob / Noushin could help to carve out something to use
- For object detection and classification - YOLO
Also could be useful to look into normal feature extraction and metrics like those that Flowcam produces, but maybe need to produce these independently
Consider taxanomically what we can do across the various instruments. Does this depend on opportunities for methods? Probably not, but could describe in terms of what we can do by eye. Compare to benchmark of light microscopy, as closest to ground truth that we can get to.
Ideally we should classify the images down to species where possible. Can we also include information like size range, and other physical properties and could this tell us something similar to "Reynolds categories" (which don't always hold for rivers)(which are based upon surface area, volume, maximum linear dimension) and other functional schemes (e.g. morpho-functional groups, e.g. https://link.springer.com/article/10.1007/s10750-006-0437-0)
Would be interesting to look into whether it's possible to get straight to functional characteristics without going through the species (if this is too difficult) - can it move (flagellates) spiky (easy to eat), having a shell..., could block filters at water treatment plants...Will algorithms perform better on taxonomy or on functional trait defined groups (based upon classification uncertainties)?
e.g. could cluster together things that could have miscellage...
So would need a secondary classification related to the functional aspects...
Penetrators and blockers of filters....
Collaboration opportunities
- Sharing technology
- Looking at Thames from source to estuary / sea
- Salmon population across rivers to sea
- Phytoplankton effects on water treatment processes and costs
- Impacts of pollution on community composition
- DRI development and shared image analysis workflows....
- High throughput or long-term monitoring across freshwater to coastal?
Danubius - mainly modelling and less biology....
To do:
- Create initial classification hierarchies/.csv templates for the flow cytometer and flow cam macro, inspired by Rob's example (Ellie, Heather, Isabelle, Steve).
- Lancaster UKCEH team (Ellie, Heather, Steve) to contact Ezra & Erica re the separation of collages into individual images, with GUIDs and linked to metadata, and upload to JASMIN. OR to read from LST files and get access to raw images - this should be the preference as the collages will not retain all the original information.
- UKCEH to work with Noushin on image library building and ResNet-50 code, for classification.
- Work with Evgeniya and Will (and Alba) to discuss what could be possible with common tooling for: pushing images to Jasmin object storage, managing / previewing images on Jasmin, tools for simple classification / PCA, etc. On DataLabs?? (Matt / Ezra / Erica)
- Link to wider UKCEH work on ML - maybe data science group meeting around image analysis (Matt to set up)
- Look into defining a machine / set of tools for UKCEH scientific computing cloud for image analysis - and using VSCode? (Ezra)
- Engage with the Scivision team to discuss links, and potentially the need for simple feature extraction.... (Matt to set up - once we have some data and initial questions)#
- Start to consider the functional characteristics. But these could be added later based on species / taxonomy