Changes Made to `TransPi.nf` and `nextflow.config`

# Changes Made to `TransPi.nf` and `nextflow.config` Ryan Seaman | Feb 15, 2023 ## `TransPi.nf` ### Busco Changes: - Container updated to: `quay.io/biocontainers/busco:5.4.3--pyhdfd78af_0` for all three busco processes - For `busco4` and `busco4_tri` output folder switched from file to path: For **`busco4`** ```groovy=1361 tuple sample_id, path("${sample_id}.TransPi.bus4") into busco4_ch ``` For **`busco4_tri`** ```groovy=1598 path("${sample_id}.Trinity.bus4") ``` - Changed from offline to online. (For `busco4`, `busco4_tri`, and `busco_all`) For **`busco4`** ```groovy=1370 busco -i ${sample_id}.combined.okay.fa -o ${sample_id}.TransPi.bus4 -l vertebrata_odb10 -m tran -c ${task.cpus} ``` For **`busco4_tri`** ```groovy=1605 busco -i ${sample_id}.Trinity.fa -o ${sample_id}.Trinity.bus4 -l vertebrata_odb10 -m tran -c ${task.cpus} ``` For **`busco4_all`** ```groovy=1558 busco -i \${name}.fa -o \${name}.bus4 -l vertebrata_odb10 -m tran -c ${task.cpus} ``` --- ### `.versions` Changes: - Versions are now saved to the outDir rather than the work Dir. This was done with a global search and replace for the `publishDir` for versions so now they all look as such: ```groovy= publishDir "${params.outdir}/.versions", mode: "copy", overwrite: true, pattern: "*.version.txt" ``` - Versions are put into a channel named `all_versions_ch` for `get_run_info` to use. Here is how the channel was made and used: ```groovy=391 Channel .fromPath("${params.outdir}/.versions", type: 'any') .set{ all_versions_ch} ``` ```groovy=3030 path(versionspath) from all_versions_ch ``` - This simplifies the `get_run_info` process and the `RUN_INFO.txt` making process to one simple step. - `get_run_info` now waits for `get_go_comparison` (the last process that adds to the .versions dir) to complete and then runs. - This is done with a text file named `go_time.txt` that goes to the channel `go_time_ch` whis created in `get_go_comparison` and goes into `get_run_info` as an input. - The script within `get_run_info` was also changed with a series of if statements that add all of the versions within the ``.versions/`` dir: ```groovy=3086 if [ -f ${versionspath}/fastqc.version.txt ];then v=\$( cat ${versionspath}/fastqc.version.txt ) echo -e "\$v " >>RUN_INFO.txt fi ``` - Lastly, `workflow.onComplete` no longer contains code regarding versions. --- ### `evegene` Changes: - Container has been swithced to `rerv/transpi:v1.0.0` --- ### `summary_trinotate_individual` Changes: - Did not have a container, but needed a container that had `bc`. This was added with the container set to `rerv/transpi:v1.0.0`: ```groovy=2727 conda (params.condaActivate && params.myConda ? params.localConda : params.condaActivate ? "-c conda-forge bioconda::cd-hit=4.8.1 bioconda::exonerate=2.4 bioconda::blast=2.11.0" : null) if (params.oneContainer){ container "${params.TPcontainer}" } else { container (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container ? "https://depot.galaxyproject.org/singularity/mulled-v2-962eae98c9ff8d5b31e1df7e41a355a99e1152c4:0ed9db56fd54cfea67041f80bdd8b8fac575112f-0" : "rerv/transpi:v1.0.0") ``` --- ### DB, Bin, and Conf access: - Databases needed to be put into channels due to google buckets not being able to be found if referenced within the process scripts. - The following Channels were created as such: ```groovy=367 Channel .fromPath("${params.uniprotdb}", type: 'any', checkIfExists: true) .into{ uniprotdb1_ch; uniprotdb2_ch; uniprotdb3_ch} Channel .fromPath("${params.hmmerdb}", type: 'any', checkIfExists: true) .into{ hmmerdb1_ch; hmmerdb2_ch; hmmerdb3_ch} Channel .fromPath("${params.sqllitedb}", type: 'any', checkIfExists: true) .set{ sqlitedb1_ch} Channel .fromPath("${params.Tsql}", type: 'any', checkIfExists: true) .set{ tsql1_ch} Channel .fromPath("${params.path2bin}", type: 'any', checkIfExists: true) .into{ binpath1_ch; binpath2_ch; binpath3_ch; binpath4_ch} Channel .fromPath("${params.path2conf}", type: 'any', checkIfExists: true) .set{ confpath1_ch} ``` - Note: Some of these params are associated with new params I made in the `nextflow.config` file which I will show later. - Here is is a connection between each new channel and its process. | Channel | Associated Process | | ------------- | ------------------------ | | uniprotdb1_ch | transdecoder_diamond | | uniprotdb2_ch | custom_diamond_trinotate | | uniprotdb3_ch | get_run_info | | hmmerdb1_ch | transdecoder_hmmer | | hmmerdb2_ch | hmmer_trinotate | | hmmerdb3_ch | get_run_info | | sqlitedb1_ch | swiss_diamond_trinotate | | tsql1_ch | trinotate | | binpath1_ch | summary_custom_uniprot | | binpath2_ch | get_GO_comparison | | binpath3_ch | get_report | | binpath4_ch | get_busco4_comparison | | confpath1_ch | summary_custom_uniprot | - This is an example of how it is used. It was a variation of this in all of them: ```groovy=1872 process transdecoder_diamond { label 'med_cpus' tag "${sample_id}" publishDir "${params.outdir}/transdecoder", mode: "copy", overwrite: true conda (params.condaActivate && params.myConda ? params.localConda : params.condaActivate ? "-c conda-forge bioconda::diamond=0.9.30=h56fc30b_0" : null) if (params.oneContainer){ container "${params.TPcontainer}" } else { container (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container ? "https://depot.galaxyproject.org/singularity/diamond:0.9.30--h56fc30b_0" : "quay.io/biocontainers/diamond:0.9.30--h56fc30b_0") } input: tuple sample_id, file(pep) from transdecoder_diamond path(uniprotdb) from uniprotdb1_ch output: tuple sample_id, file("${sample_id}.diamond_blastp.outfmt6") into transdecoder_predict_diamond script: """ dbPATH=${uniprotdb} echo -e "\\n-- Starting Diamond (blastp) --\\n" if [ ! -d \${dbPATH} ];then echo "Directory \${dbPATH} not found. Run the precheck to fix this issue" exit 0 elif [ -d \${dbPATH} ];then if [ ! -e \${dbPATH}/${params.uniname} ];then echo "File \${dbPATH}/${params.uniname} not found. Run the precheck to fix this issue" exit 0 elif [ -e \${dbPATH}/${params.uniname} ];then if [ ! -e \${dbPATH}/${params.uniname}.dmnd ];then cp \${dbPATH}/${params.uniname} . diamond makedb --in ${params.uniname} -d ${params.uniname} -p ${task.cpus} diamond blastp -d ${params.uniname}.dmnd -q ${pep} -p ${task.cpus} -f 6 -k 1 -e 0.00001 >${sample_id}.diamond_blastp.outfmt6 elif [ -e \${dbPATH}/${params.uniname}.dmnd ];then cp \${dbPATH}/${params.uniname}.dmnd . diamond blastp -d ${params.uniname}.dmnd -q ${pep} -p ${task.cpus} -f 6 -k 1 -e 0.00001 >${sample_id}.diamond_blastp.outfmt6 fi fi fi echo -e "\\n-- Done with Diamond (blastp) --\\n" """ } ``` - It is set as an input in line 1887. - And then set to the variable to be used thoughtout in line 1894. --- ### `mapping_evigene` Changes: - Here, the only change was switching the publishDir to only look for `.txt` files (the only output it is supposed to have). ```groovy=1301 publishDir "${params.outdir}/mapping", mode: "copy", overwrite: true, pattern: "*.txt" ``` --- ### `rna_quast` Changes: - Here, the only change was shifting the output line from file to path (just like `busco4`) ```groovy=1264 tuple sample_id, path("${sample_id}.rna_quast") into rna_quast_sum ``` --- ### Changes for `onlyAnn` - Here, the default was to look locally for a specific directory called onlyAnn. - That was not going to work on google so instead I created a param called `params.transcriptomeDir` that would point to a bucket containing the assembled transcriptome. - Here is what I chagned: ```groovy=1751 if (params.onlyAnn){ println("\n\tRunning only annotation analysis\n") Channel .fromFilePairs("${params.transcriptomeDir}/*.{fa,fasta}", size: -1, checkIfExists: true) .into{ annotation_ch_transdecoder; annotation_ch_transdecoderB; evigene_ch_rnammer_ann; evigene_ch_trinotate; evi_dist_ann} } ``` - I then added the location of `params.transcriptomeDir` to the `log.info` on line 330. ```groovy=321 ==================================================== TransPi - Transcriptome Analysis Pipeline v${workflow.manifest.version} ==================================================== TransPi.nf Directory: ${projectDir}/TransPi.nf Launch Directory: ${launchDir} Results Directory: ${params.outdir} Work Directory: ${workDir} TransPi DBs: ${params.pipeInstall} Uniprot DB: ${params.uniprot} Transcriptome Directory:${params.transcriptomeDir} """.stripIndent() ``` --- ## `nextflow.config` ### DB, Bin, and Conf Access: - Here I added in params that point to many points within the bucket. Here are a few for example: ```=75 // Directory to Transcriptome (for onlyAnn and onlyEvi) transtcriptomeDir='gs://nosi-workshop-bucket/03-tp-bucket/trans' // Path to Bin path2bin='gs://nosi-workshop-bucket/03-tp-bucket/bin' // Path to Conf path2conf='gs://nosi-workshop-bucket/03-tp-bucket/conf' ``` My idea is to change the `gs://nosi-workshop-bucket/03-tp-bucket` to a variable like `userBucketPath` and then within their notebook they will... 1. First, create a variable that points to the point in their bucket: ```python=1 myBucket='gs://mybucket/transpi_test' ``` 2. They will then copy in out bucket into that location ```python=2 !gsutil -m cp -r gs://our_source_bucket/transpi_bucket $mybucket ``` 3. Then do a global serch and replace within `nextflow.config` Something like: (I think that is right perl syntax but I'm not sure) ```python=3 !perl -ne "s:userBucketPath:$myBucket:g; print" ./nextflow.config ``` This should theoretically work set everything in place for them, and they would also not need to run precheck. ### Adding the gls profile: ``` gls { process.executor = 'google-lifesciences' process.container = 'ubuntu' google.location = 'us-central1' google.region = 'us-central1' google.project = 'nosi-mdibl-inbrecloud' workDir = 'gs://nosi-workshop-bucket/03-tp-bucket/onlyAnn/theWORK' process.machineType = 'c2-standard-30' dag.overwrite = true params.outdir='gs://nosi-workshop-bucket/03-tp-bucket/onlyAnn/theOUTPUT' params.bucket='gs://nosi-workshop-bucket/03-tp-bucket' google.lifeSciences.bootDiskSize=50.GB google.storage.parallelThreadCount = 100 google.storage.maxParallelTransfers = 100 } ``` ### Overwrite --> true: This allows for the pipeline_info items to be overwritten so that it has the most up to date information ```=344 trace.overwrite=true dag.overwrite=true report.overwirte=true timeline.overwrite=true ``` --- #### Tentative changes: - In the Config, changed cpu and mem maxes to: - Mem Max: 100 GB - CPU Max: 15