processing of test data

--- tags: QC, STAR, filename-editing, slurm-script --- # processing of test data 12April22 1. Downloaded read data and put in `/mnt/lustre/macmaneslab/macmanes/physiology/reads` 2. edited read file names to remove `001` ``` for filename in *fastq.gz; do [ -f "$filename" ] || continue mv "$filename" "${filename//_001/}" done ``` 3. put genome and gff in `/mnt/lustre/macmaneslab/macmanes/physiology/genome` 4. Indexed genome using STAR. The script is in `$HOME/physiology` and named `star_gg.slurm` ``` #!/bin/bash #SBATCH --partition=macmanes #SBATCH -J star_gg #SBATCH --cpus-per-task=40 #SBATCH --output star.log source ~/.bash_profile mkdir $HOME/physiology/STAR_indices STAR --runMode genomeGenerate --runThreadN 40 --genomeDir $HOME/physiologyc/STAR_indices \ --genomeFastaFiles $HOME/physiology/genome/Peer2.0.1.fasta \ --sjdbGTFfile $HOME/physiology/genome/Peromyscus_eremicus__Peer2.0.1.fasta_v3.functional.gff3 ``` 5. moved the index files to `$HOME/physiology/STAR_index` 6. Did the mapping job. The script is in `$HOME/physiology` and is named `ReadCount.job`. It is executed like this ``` sbatch $HOME/ReadCount.job \ /mnt/lustre/macmaneslab/macmanes/physiology/genome/Peer2.0.1.fasta \ /mnt/lustre/macmaneslab/macmanes/physiology/genome/Peromyscus_eremicus__Peer2.0.1.fasta_v3.functional.gff3 \ /mnt/lustre/macmaneslab/macmanes/physiology/reads/first/pilot/220408_A01346_0055_AH22JTDRX2_16MerMACMANES_RNA_PILOT \ .fastq.gz ``` the actual mapping command is ``` STAR --runMode alignReads \ --genomeDir $DIR/STAR_index \ --readFilesIn ${READ_DIR}/${sample}R1${SUFFIX} ${READ_DIR}/${sample}R2${SUFFIX} \ --readFilesCommand zcat \ --runThreadN 40 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.3 \ --quantMode GeneCounts \ --outFileNamePrefix ${DIR}/STAR_mapped/${sample} \ --outSAMtype BAM Unsorted \ --genomeLoad LoadAndKeep ``` 7. once the mapping is done, I pull out all the mapping data using the following command to make a nice table. ``` printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "date" "sample" "reads" "unique" "multi" "toomany" "unmapped"> mappingdata.txt for file in $(lr *Log.final.out | awk '{print $9}') do sample=$(echo $file | sed "s/_Log.final.out//g" | sed "s/.Log.final.out//g") date=$(grep "Started job" $file | cut -f 2 | cut -d ' ' -f1-2) reads=$(grep "Number of input reads" $file | cut -f 2) unique=$(grep "Uniquely mapped reads %" $file | cut -f 2) multi=$(grep "% of reads mapped to multiple loci" $file | cut -f 2) toomany=$(grep "% of reads mapped to too many loci" $file | cut -f 2) unmapped=$(grep "Number of reads unmapped: too short" $file | cut -f 2) printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "$date" "$sample" "$reads" "$unique" "$multi" "$toomany" "$unmapped" >> mappingdata.txt done ``` it looks like this ``` Sample Number_reads Uniquely_mapping_reads Mapped_to_multiple_loci Mapped_to_too_many_loci Unmapped_reads_too_short P1_A10_S73_L001_ 741096 77.05% 22.13% 0.08% 4776 P1_A11_S81_L001_ 360614 80.69% 18.40% 0.08% 2577 P1_A12_S89_L001_ 567409 78.29% 20.82% 0.08% 4215 P1_A1_S1_L001_ 589599 80.62% 18.48% 0.11% 3931 P1_A2_S9_L001_ 715087 77.35% 21.82% 0.08% 4616 P1_A3_S17_L001_ 478268 80.11% 18.84% 0.08% 4150 P1_A4_S25_L001_ 504477 79.56% 19.46% 0.09% 3866 P1_A5_S33_L001_ 263045 78.02% 21.18% 0.08% 1674 P1_A6_S41_L001_ 258891 79.53% 18.74% 0.07% 3756 ``` On the basis of this experiments, I asked Steve to repool the libraries, bring some concentratons up, some down, and to exclude some. That spreadsheet is [here](https://universitysystemnh-my.sharepoint.com/:x:/g/personal/mdm2000_usnh_edu/EebVFrf5eehAgPnuOVoSUsIBOx1ZiYuYdekPpUtVttrYAw?e=Z57b1S) ``` sbatch ./ReadCount.job \ /mnt/lustre/macmaneslab/macmanes/physiology/genome/Peer2.0.1.fasta \ /mnt/lustre/macmaneslab/macmanes/physiology/genome/Peromyscus_eremicus__Peer2.0.1.fasta_v3.functional.gff3 \ /mnt/lustre/macmaneslab/macmanes/physiology/raw_reads/second_run/pilot/cobb.sr.unh.edu/managed/220810_A01346_0075_BH5H53DRX2_8MerMACMANES_RNA_P4/reads \ .fastq.gz ```