# Western Antarctic Peninsula Transcriptome Analysis Project Synthesis
January 2023
## Reorienting to the dataset

Progress thus far:
- eukrhythmic assemblies separately for all _in situ_ data and all of the experiments (more on that later in the future work section)
- more recently: all annotation steps direct to MAD with added step of the implemented post-process `Salmon` filtering of NumReads>0
- Preliminary pseudo-differential expression analysis/all the infrastructure to complete this work in the coming weeks with new trophy-implicated gene cluster workflow
Intended submission by the end of March to _Limnology and Oceanography_. Plan is also to get ASLO abstract in on this with the new trophy-implicated gene cluster results by Feb 23 2023. Heretofore working title: "Mixotrophy in Southern Ocean cryptophytes revealed by metatranscriptomics". More recent idea "Phylogeny predicts prevailing trophy in Southern Ocean algae".


Cryptophytes make up as much as 40% of assigned TPM during diel sampling, but then <5% in other samples; diel samples also have the greatest taxonomic coherence.
_Differential expression analysis: day vs. night samples in diel dataset_

Upregulation of signal processing during the day; digestive system and metabolism processing during the night. This was alongside the observation that back-of-the-envelope trophy calculations _a là_ _Tara_ MAGs project were also more swayed towards heterotrophy than phototrophy during the night. However, the problem is that expression patterns are also most distinct in general between the two (hence the approach that I've been working on lately described later on).
But, all with the caveat that annotation rates are low. :-1:

More evidence of the cryptophytes having more upregulated heterotrophy processes:

"Senescence" genes were uniquely ascribed to cryptophytes, whereas endocytosis and lysosome activity would be upregulated in diatoms or cryptophytes, whichever was more abundant. Reassuringly (w.r.t. normalization), photosynthesis was always upregulated in diatoms relative to cryptophytes, even when cryptophytes were the more abundant party.
RAIN analysis had shown diel-related genes in the past, planning on moving this over to GeneCycle; this program is supposed to be better for [weakly informative datasets](https://pubmed.ncbi.nlm.nih.gov/32182235/). I would put this diel dataset in that category.
## The plans for a way forward

Broad-brush workflow:
### Trophy model training
For the Tara MAGs project, we classified each of the MMETSP and EukProt transcriptomes according to their trophic state. The idea for this project is to leverage that information to train a feature selection algorithm.
In order to identify trophy-implicated gene clusters from this training dataset, we need expression information for the MMETSP. Then we calculate the TPM via Salmon of all of the contigs in the collection. We calculate this abundance relative to all of MMETSP co-clustered with mmseqs2 at 80% coverage of query and target and 100% sequence identity. We'll then filter the contigs with have the most predictive power for the trophic label. Then, we'll go back and filter all the genes from the contigs which popped up as being trophy-implicated. Finally, we'll do an orthogroup clustering across all the MMETSP contig files to ID trophy-implicated gene clusters from the expression data. It'll be an orthogroup clustering with the WAP cryptophyte and diatom-subsetted assemblies to those files that will tell us whether the trophy-implicated clusters tend to show up in our Western Antarctic Peninsula dataset.
As far as identifying these genes, I am currently testing using the L1 norm as shown in the figure.
(In the past and as shown in the figure, the thought was to find the Z-score of each of these expression values across all of those different MMETSP IDs. The genes that are significantly more highly expressed w.r.t. the mean expression in each transcriptome would be retained. I have abandoned this approach in favor of the consolidated co-assembly approach).
## Notes on new steps taken (28 January 2023)
_all of this work is going on in `trophy-model-WAP-filtered` in the `akrinos/2022-Krinos-WAP/code/jupyter-notebooks` folder on Poseidon_
For identifying the trophy-implicated genes, I:
1. downloaded all of the MMETSP raw files
2. clustered all MMETSP contigs using `mmseqs2` with a minimum coverage (of shorter sequence) of 80% and percentage identity of 100%
3. quantified clustered contigs w/ the raw files using `salmon`
4. combined quantified results with Sarah's trophy annotations
5. performed logistic regression on counts with an inverse regularization parameter of 1000, L1 norm penalty, using the `liblinear` solver. The `liblinear` solver was the only one with an efficiency high enough to complete in reasonable time (alternative would have been saba solver). For this reason I needed to use the OvR ("one versus rest") algorithm for the logistic regression. This meant that 3 different models were created: heterotrophy (1) condition vs. rest (0) condition; same for the other 2 trophic modes
6. pulled contigs with nonzero coefficients for one of the the three trophic modes & saved all coefficients
7. extracted corresponding predicted proteins in the MMETSP
8. DIAMOND search of those predicted proteins against WAP MAD predicted proteins
This search yielded 125366 genes that were significantly implicated in trophy. Adjusting downward that inverse regularization parameter adjusts this number of genes to be smaller.
#### Relative to the taxonomic origin of trophy-implicated genes from the MMETSP that had best bitscore and pident>75% hits to the WAP dataset
Working with a total of 1867537 WAP genes that had hits with pident>75% to trophy-implicated genes

36222 of these genes were from cryptophytes and 842319 from diatoms. Pulling the coefficients associated with those genes in the logistic model, we can break that down further into the genes that got certain coefficients per the model. For now, I'm using:
- coefficient>0.01 in model for mixotrophy + positive or zero coefficient for photo+hetero = mixotrophy
- coefficient>0.01 in model for phototrophy + negative or zero for hetero = phototrophy
- coefficient>0.01 in model for heterotrophy + negative or zero for photo = heterotrophy
Using that rationale, I get the following breakdown of genes for some relevant protistan classes:

Frustrating, because Apicomplexans don't photosynthesize and diatoms don't (probably) eat. However, this is a data-driven approach, after all. If we consider this according to number of genes and not proportion, we get:

Which means that the bulk of our genes are making sense - the weird situations we're getting are rare relative to all genes that are found. And they may also be taxonomic misannotations. And they also may be very lowly expressed (and we based this dimensionality reduction on relative expression, _not_ presence-absense).
I think it's kind of interesting that the Classes of Chlorophyte coming from the couple of taxa that have known mixotrophs in the SO do have heterotrophy genes whilst _Chloropicophyceae_ does not. Those two other classes are _Mamiellophyceae_ and _Pyramimondadaceae_ https://www.frontiersin.org/articles/10.3389/fmars.2018.00273/full
Note that using this approach we can adjust things to be more/less stringent as we like. For example, we could say that a mixotrophic gene only needs to clear a threshold of 0.001 instead of 0.01:

Or we could keep our old threshold, but also say that a >0 photo + a >0 hetero == a Mixotroph.
Modifications based on highest coefficient change the total pool of genes we're using significantly: using a cutoff of $10^{-3}$ yields 3046216 genes, whereas a cutoff of $10^{-2}$ yields 357717 genes and a cutoff of $10^{-3/2}$ yields 40538 total genes of interest to be found in our BLAST results against the WAP dataset. But, if you cut all the coefficients to a cutoff of $10^{-3/2}$, you get nothing but phototrophy genes. If you do a combination, e.g. $10^{-3/2}$ for phototrophs and $10^{-2}$ for the other categories, you can get something like:

Which is a little unfair to phototrophs but captures e.g. the Apicomplexans being heterotrophic.
Let's try a different approach. No consideration of any model but one, and just select the top 5% of coefficients for each model, assigning the label of Phototroph, Heterotroph, Mixotroph accordingly.
https://stackoverflow.com/questions/23332330/why-does-multiclass-logistic-regression-give-different-results-than-choosing-the
## Another new step: trying `GeneCycle` in place of `RAIN` for rhythm detection in the diel data
This is motivated by [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7100990/) which describes how `GeneCycle` (the algorithm implemented therein) might be most effective for weakly-informative datasets.
I started off by just using the easier Fisher g-test option from the `GeneCycle` package, and of 1,000,732 genes tested for the cryptophyte sub-assembly, 3381 genes based on the tail area-based FDR and 3299 genes based on the local FDR showed up as significant.

I then moved to the actual method that's recommended by the paper, which is the `robust_spectrum` and `robust.g.test` suite. This has a lot more options and potentially a very long runtime. I used the rank algorithm rather than the regression one and used 100 permutations. The code from the [paper above](https://github.com/laloumdav/rhythm_detection_benchmark/blob/master/scripts/GeneCycle_script.R) is more helpful than the `GeneCycle` documentation.
This number of genes was intense for this function. So I did a pre-subsetting of the gene to those with>100 total TPM across the diel samples (the GeneCycle algorithm is probably not accurate for those with lower abundance, anyway. This only yielded 1163 genes, which was fewer than I expected. This yielded 57 high-abundance, diel-cycling genes from the combined assembly of all genes identified in the WAP dataset to be of cryptophyte origin.
Trying this with a 50 TPM cutoff, 2639 genes survive the cull and of these 166 are predicted to cycle.
I took a look at these genes, and I wasn't sure GeneCycle was quite understanding what I meant as far as replicated time points in the beginning of the time series.

So I tried again with the `regression` approach that uses an `rlm` model. I fed to `GeneCycle` that my data were periodic with period 24, but then gave them their actually post-T0 numeric labels (up to 28 hours past that first 9am). Most of these `rlm`s fail to converge, leaving you with a null spectrum and null p-value.
## Another new step: trying to improve upon the `EUKulele` annotations
We know that only half of sequences get `EUKulele`
annotations. How much of the time is this due to a conflict at that level of percentage identity and how much of the time is that due to an utter lack of hits? First of all, we need to reconsider the `EUKulele` annotations now that we've done a pre-filtering of our assembly to only include those that had >0 reads assigned by `Salmon` in the original mapping. These fragments do get assigned reads, but they may variably receive functional and taxonomic annotations.
To answer this, we need to parse the DIAMOND file similar to what we've been doing for the Narragansett Bay dataset (time-consuming).
## Another new step: revisiting the RiboDetector workflow for assembling + interpreting 18S rRNA genes
I think for this it will be necessary to go back and assemble the reads pulled from RiboDetector with EMIRGE or other pipeline designed for 18S rRNA assembly (https://github.com/csmiller/EMIRGE)