# Changes Made to `TransPi.nf` and `nextflow.config`
Ryan Seaman | Feb 15, 2023
## `TransPi.nf`
### Busco Changes:
- Container updated to: `quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0` for all three busco processes
- For `busco4` and `busco4_tri` output folder switched from file to path:
For **`busco4`**
```groovy=1361
tuple sample_id, path("${sample_id}.TransPi.bus4") into busco4_ch
```
For **`busco4_tri`**
```groovy=1598
path("${sample_id}.Trinity.bus4")
```
- Changed from offline to online. (For `busco4`, `busco4_tri`, and `busco_all`)
For **`busco4`**
```groovy=1370
busco -i ${sample_id}.combined.okay.fa -o ${sample_id}.TransPi.bus4 -l vertebrata_odb10 -m tran -c ${task.cpus}
```
For **`busco4_tri`**
```groovy=1605
busco -i ${sample_id}.Trinity.fa -o ${sample_id}.Trinity.bus4 -l vertebrata_odb10 -m tran -c ${task.cpus}
```
For **`busco4_all`**
```groovy=1558
busco -i \${name}.fa -o \${name}.bus4 -l vertebrata_odb10 -m tran -c ${task.cpus}
```
---
### `.versions` Changes:
- Versions are now saved to the outDir rather than the work Dir. This was done with a global search and replace for the `publishDir` for versions so now they all look as such:
```groovy=
publishDir "${params.outdir}/.versions", mode: "copy", overwrite: true, pattern: "*.version.txt"
```
- Versions are put into a channel named `all_versions_ch` for `get_run_info` to use. Here is how the channel was made and used:
```groovy=391
Channel
.fromPath("${params.outdir}/.versions", type: 'any')
.set{ all_versions_ch}
```
```groovy=3030
path(versionspath) from all_versions_ch
```
- This simplifies the `get_run_info` process and the `RUN_INFO.txt` making process to one simple step.
- `get_run_info` now waits for `get_go_comparison` (the last process that adds to the .versions dir) to complete and then runs.
- This is done with a text file named `go_time.txt` that goes to the channel `go_time_ch` whis created in `get_go_comparison` and goes into `get_run_info` as an input.
- The script within `get_run_info` was also changed with a series of if statements that add all of the versions within the ``.versions/`` dir:
```groovy=3086
if [ -f ${versionspath}/fastqc.version.txt ];then
v=\$( cat ${versionspath}/fastqc.version.txt )
echo -e "\$v " >>RUN_INFO.txt
fi
```
- Lastly, `workflow.onComplete` no longer contains code regarding versions.
---
### `evegene` Changes:
- Container has been swithced to `rerv/transpi:v1.0.0`
---
### `summary_trinotate_individual` Changes:
- Did not have a container, but needed a container that had `bc`. This was added with the container set to `rerv/transpi:v1.0.0`:
```groovy=2727
conda (params.condaActivate && params.myConda ? params.localConda : params.condaActivate ? "-c conda-forge bioconda::cd-hit=4.8.1 bioconda::exonerate=2.4 bioconda::blast=2.11.0" : null)
if (params.oneContainer){ container "${params.TPcontainer}" } else {
container (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container ? "https://depot.galaxyproject.org/singularity/mulled-v2-962eae98c9ff8d5b31e1df7e41a355a99e1152c4:0ed9db56fd54cfea67041f80bdd8b8fac575112f-0" : "rerv/transpi:v1.0.0")
```
---
### DB, Bin, and Conf access:
- Databases needed to be put into channels due to google buckets not being able to be found if referenced within the process scripts.
- The following Channels were created as such:
```groovy=367
Channel
.fromPath("${params.uniprotdb}", type: 'any', checkIfExists: true)
.into{ uniprotdb1_ch; uniprotdb2_ch; uniprotdb3_ch}
Channel
.fromPath("${params.hmmerdb}", type: 'any', checkIfExists: true)
.into{ hmmerdb1_ch; hmmerdb2_ch; hmmerdb3_ch}
Channel
.fromPath("${params.sqllitedb}", type: 'any', checkIfExists: true)
.set{ sqlitedb1_ch}
Channel
.fromPath("${params.Tsql}", type: 'any', checkIfExists: true)
.set{ tsql1_ch}
Channel
.fromPath("${params.path2bin}", type: 'any', checkIfExists: true)
.into{ binpath1_ch; binpath2_ch; binpath3_ch; binpath4_ch}
Channel
.fromPath("${params.path2conf}", type: 'any', checkIfExists: true)
.set{ confpath1_ch}
```
- Note: Some of these params are associated with new params I made in the `nextflow.config` file which I will show later.
- Here is is a connection between each new channel and its process.
| Channel | Associated Process |
| ------------- | ------------------------ |
| uniprotdb1_ch | transdecoder_diamond |
| uniprotdb2_ch | custom_diamond_trinotate |
| uniprotdb3_ch | get_run_info |
| hmmerdb1_ch | transdecoder_hmmer |
| hmmerdb2_ch | hmmer_trinotate |
| hmmerdb3_ch | get_run_info |
| sqlitedb1_ch | swiss_diamond_trinotate |
| tsql1_ch | trinotate |
| binpath1_ch | summary_custom_uniprot |
| binpath2_ch | get_GO_comparison |
| binpath3_ch | get_report |
| binpath4_ch | get_busco4_comparison |
| confpath1_ch | summary_custom_uniprot |
- This is an example of how it is used. It was a variation of this in all of them:
```groovy=1872
process transdecoder_diamond {
label 'med_cpus'
tag "${sample_id}"
publishDir "${params.outdir}/transdecoder", mode: "copy", overwrite: true
conda (params.condaActivate && params.myConda ? params.localConda : params.condaActivate ? "-c conda-forge bioconda::diamond=0.9.30=h56fc30b_0" : null)
if (params.oneContainer){ container "${params.TPcontainer}" } else {
container (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container ? "https://depot.galaxyproject.org/singularity/diamond:0.9.30--h56fc30b_0" : "quay.io/biocontainers/diamond:0.9.30--h56fc30b_0")
}
input:
tuple sample_id, file(pep) from transdecoder_diamond
path(uniprotdb) from uniprotdb1_ch
output:
tuple sample_id, file("${sample_id}.diamond_blastp.outfmt6") into transdecoder_predict_diamond
script:
"""
dbPATH=${uniprotdb}
echo -e "\\n-- Starting Diamond (blastp) --\\n"
if [ ! -d \${dbPATH} ];then
echo "Directory \${dbPATH} not found. Run the precheck to fix this issue"
exit 0
elif [ -d \${dbPATH} ];then
if [ ! -e \${dbPATH}/${params.uniname} ];then
echo "File \${dbPATH}/${params.uniname} not found. Run the precheck to fix this issue"
exit 0
elif [ -e \${dbPATH}/${params.uniname} ];then
if [ ! -e \${dbPATH}/${params.uniname}.dmnd ];then
cp \${dbPATH}/${params.uniname} .
diamond makedb --in ${params.uniname} -d ${params.uniname} -p ${task.cpus}
diamond blastp -d ${params.uniname}.dmnd -q ${pep} -p ${task.cpus} -f 6 -k 1 -e 0.00001 >${sample_id}.diamond_blastp.outfmt6
elif [ -e \${dbPATH}/${params.uniname}.dmnd ];then
cp \${dbPATH}/${params.uniname}.dmnd .
diamond blastp -d ${params.uniname}.dmnd -q ${pep} -p ${task.cpus} -f 6 -k 1 -e 0.00001 >${sample_id}.diamond_blastp.outfmt6
fi
fi
fi
echo -e "\\n-- Done with Diamond (blastp) --\\n"
"""
}
```
- It is set as an input in line 1887.
- And then set to the variable to be used thoughtout in line 1894.
---
### `mapping_evigene` Changes:
- Here, the only change was switching the publishDir to only look for `.txt` files (the only output it is supposed to have).
```groovy=1301
publishDir "${params.outdir}/mapping", mode: "copy", overwrite: true, pattern: "*.txt"
```
---
### `rna_quast` Changes:
- Here, the only change was shifting the output line from file to path (just like `busco4`)
```groovy=1264
tuple sample_id, path("${sample_id}.rna_quast") into rna_quast_sum
```
---
### Changes for `onlyAnn`
- Here, the default was to look locally for a specific directory called onlyAnn.
- That was not going to work on google so instead I created a param called `params.transcriptomeDir` that would point to a bucket containing the assembled transcriptome.
- Here is what I chagned:
```groovy=1751
if (params.onlyAnn){
println("\n\tRunning only annotation analysis\n")
Channel
.fromFilePairs("${params.transcriptomeDir}/*.{fa,fasta}", size: -1, checkIfExists: true)
.into{ annotation_ch_transdecoder; annotation_ch_transdecoderB; evigene_ch_rnammer_ann; evigene_ch_trinotate; evi_dist_ann}
}
```
- I then added the location of `params.transcriptomeDir` to the `log.info` on line 330.
```groovy=321
====================================================
TransPi - Transcriptome Analysis Pipeline v${workflow.manifest.version}
====================================================
TransPi.nf Directory: ${projectDir}/TransPi.nf
Launch Directory: ${launchDir}
Results Directory: ${params.outdir}
Work Directory: ${workDir}
TransPi DBs: ${params.pipeInstall}
Uniprot DB: ${params.uniprot}
Transcriptome Directory:${params.transcriptomeDir}
""".stripIndent()
```
---
## `nextflow.config`
### DB, Bin, and Conf Access:
- Here I added in params that point to many points within the bucket. Here are a few for example:
```=75
// Directory to Transcriptome (for onlyAnn and onlyEvi)
transtcriptomeDir='gs://nosi-workshop-bucket/03-tp-bucket/trans'
// Path to Bin
path2bin='gs://nosi-workshop-bucket/03-tp-bucket/bin'
// Path to Conf
path2conf='gs://nosi-workshop-bucket/03-tp-bucket/conf'
```
My idea is to change the `gs://nosi-workshop-bucket/03-tp-bucket` to a variable like `userBucketPath` and then within their notebook they will...
1. First, create a variable that points to the point in their bucket:
```python=1
myBucket='gs://mybucket/transpi_test'
```
2. They will then copy in out bucket into that location
```python=2
!gsutil -m cp -r gs://our_source_bucket/transpi_bucket $mybucket
```
3. Then do a global serch and replace within `nextflow.config` Something like: (I think that is right perl syntax but I'm not sure)
```python=3
!perl -ne "s:userBucketPath:$myBucket:g; print" ./nextflow.config
```
This should theoretically work set everything in place for them, and they would also not need to run precheck.
### Adding the gls profile:
```
gls {
process.executor = 'google-lifesciences'
process.container = 'ubuntu'
google.location = 'us-central1'
google.region = 'us-central1'
google.project = 'nosi-mdibl-inbrecloud'
workDir = 'gs://nosi-workshop-bucket/03-tp-bucket/onlyAnn/theWORK'
process.machineType = 'c2-standard-30'
dag.overwrite = true
params.outdir='gs://nosi-workshop-bucket/03-tp-bucket/onlyAnn/theOUTPUT'
params.bucket='gs://nosi-workshop-bucket/03-tp-bucket'
google.lifeSciences.bootDiskSize=50.GB
google.storage.parallelThreadCount = 100
google.storage.maxParallelTransfers = 100
}
```
### Overwrite --> true:
This allows for the pipeline_info items to be overwritten so that it has the most up to date information
```=344
trace.overwrite=true
dag.overwrite=true
report.overwirte=true
timeline.overwrite=true
```
---
#### Tentative changes:
- In the Config, changed cpu and mem maxes to:
- Mem Max: 100 GB
- CPU Max: 15