3. Raw fastq to uBAM and mark adapters

# 3. Raw fastq to uBAM and mark adapters ###### tags: `2. Main Steps` ### Step 1. Convert fastq to uBAM This converts your raw fastq files to an unaligned BAM file and by default it sorts it by query name (this is important because MarkIlluminaAdapters requires the input to be sorted by query name). This step can also add readgroups for you! **NOTE**: In hindsight, these steps are actually better to do in a /work directory because there is more space and may help jobs run faster. The code below is done in the /datacommons/noor directory, which still worked fine (but again, not recommended in hindsight). We started working with our temporary files in the /work directory at the variant filtration step (so go to that step for more details and examples!). This is where another benefit of SLURM comes in handy! If you want to run the same codes on--let's say--100 different files, of which the differences only contain tiny tweaks. For instance, you want to look through 22 human autosomes in 22 files. You don't want to run the code *sequentially* when you can run multiple at the same time. We implemented something ***SLURM job array.*** [See further instructions here](https://github.com/Noor-WGS-data/Genome_sequence_data/blob/main/Tutorials/Running_jobs_in_parallel.md). ``` #!/bin/bash #SBATCH -e errs/%a_uBAM-%A.err #SBATCH -o outs/%a_uBAM-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --cpus-per-task=4 #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 19,30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 java -Xmx8G -jar $GATK FastqToSam \ --FASTQ ../data/2021_09_21_fastq/L${LL}_R1_001.fastq \ --FASTQ2 ../data/2021_09_21_fastq/L${LL}_R2_001.fastq \ --OUTPUT ../data/2021_09_21_fastq/uBAMs/unmapped_L${LL}.bam \ --READ_GROUP_NAME L${LL} \ --SAMPLE_NAME L${LL} \ --LIBRARY_NAME L${LL} \ --PLATFORM Illumina \ --TMP_DIR /work/$USER ``` ### Step 2. Mark Illumina Adapters (needs to be sorted by QUERY NAME) ``` #!/bin/bash #SBATCH -e errs/%a_markadapt-%A.err #SBATCH -o outs/%a_markadapt-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --cpus-per-task=4 #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 19 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 java -jar $GATK MarkIlluminaAdapters \ -I ../data/2021_09_21_fastq/uBAMs/unmapped_L${LL}.bam \ -M adapter_metrics_L${LL}.txt \ -O ../data/2021_09_21_fastq/uBAMs/unmapped_adapter_L${LL}.bam \ --TMP_DIR /work/$USER ```