---
tags: QC, STAR, filename-editing, slurm-script
---
# processing of test data
12April22
1. Downloaded read data and put in `/mnt/lustre/macmaneslab/macmanes/physiology/reads`
2. edited read file names to remove `001`
```
for filename in *fastq.gz; do
[ -f "$filename" ] || continue
mv "$filename" "${filename//_001/}"
done
```
3. put genome and gff in `/mnt/lustre/macmaneslab/macmanes/physiology/genome`
4. Indexed genome using STAR. The script is in `$HOME/physiology` and named `star_gg.slurm`
```
#!/bin/bash
#SBATCH --partition=macmanes
#SBATCH -J star_gg
#SBATCH --cpus-per-task=40
#SBATCH --output star.log
source ~/.bash_profile
mkdir $HOME/physiology/STAR_indices
STAR --runMode genomeGenerate --runThreadN 40 --genomeDir $HOME/physiologyc/STAR_indices \
--genomeFastaFiles $HOME/physiology/genome/Peer2.0.1.fasta \
--sjdbGTFfile $HOME/physiology/genome/Peromyscus_eremicus__Peer2.0.1.fasta_v3.functional.gff3
```
5. moved the index files to `$HOME/physiology/STAR_index`
6. Did the mapping job. The script is in `$HOME/physiology` and is named `ReadCount.job`. It is executed like this
```
sbatch $HOME/ReadCount.job \
/mnt/lustre/macmaneslab/macmanes/physiology/genome/Peer2.0.1.fasta \
/mnt/lustre/macmaneslab/macmanes/physiology/genome/Peromyscus_eremicus__Peer2.0.1.fasta_v3.functional.gff3 \
/mnt/lustre/macmaneslab/macmanes/physiology/reads/first/pilot/220408_A01346_0055_AH22JTDRX2_16MerMACMANES_RNA_PILOT \
.fastq.gz
```
the actual mapping command is
```
STAR --runMode alignReads \
--genomeDir $DIR/STAR_index \
--readFilesIn ${READ_DIR}/${sample}R1${SUFFIX} ${READ_DIR}/${sample}R2${SUFFIX} \
--readFilesCommand zcat \
--runThreadN 40 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 10 \
--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.3 \
--quantMode GeneCounts \
--outFileNamePrefix ${DIR}/STAR_mapped/${sample} \
--outSAMtype BAM Unsorted \
--genomeLoad LoadAndKeep
```
7. once the mapping is done, I pull out all the mapping data using the following command to make a nice table.
```
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "date" "sample" "reads" "unique" "multi" "toomany" "unmapped"> mappingdata.txt
for file in $(lr *Log.final.out | awk '{print $9}')
do
sample=$(echo $file | sed "s/_Log.final.out//g" | sed "s/.Log.final.out//g")
date=$(grep "Started job" $file | cut -f 2 | cut -d ' ' -f1-2)
reads=$(grep "Number of input reads" $file | cut -f 2)
unique=$(grep "Uniquely mapped reads %" $file | cut -f 2)
multi=$(grep "% of reads mapped to multiple loci" $file | cut -f 2)
toomany=$(grep "% of reads mapped to too many loci" $file | cut -f 2)
unmapped=$(grep "Number of reads unmapped: too short" $file | cut -f 2)
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "$date" "$sample" "$reads" "$unique" "$multi" "$toomany" "$unmapped" >> mappingdata.txt
done
```
it looks like this
```
Sample Number_reads Uniquely_mapping_reads Mapped_to_multiple_loci Mapped_to_too_many_loci Unmapped_reads_too_short
P1_A10_S73_L001_ 741096 77.05% 22.13% 0.08% 4776
P1_A11_S81_L001_ 360614 80.69% 18.40% 0.08% 2577
P1_A12_S89_L001_ 567409 78.29% 20.82% 0.08% 4215
P1_A1_S1_L001_ 589599 80.62% 18.48% 0.11% 3931
P1_A2_S9_L001_ 715087 77.35% 21.82% 0.08% 4616
P1_A3_S17_L001_ 478268 80.11% 18.84% 0.08% 4150
P1_A4_S25_L001_ 504477 79.56% 19.46% 0.09% 3866
P1_A5_S33_L001_ 263045 78.02% 21.18% 0.08% 1674
P1_A6_S41_L001_ 258891 79.53% 18.74% 0.07% 3756
```
On the basis of this experiments, I asked Steve to repool the libraries, bring some concentratons up, some down, and to exclude some. That spreadsheet is [here](https://universitysystemnh-my.sharepoint.com/:x:/g/personal/mdm2000_usnh_edu/EebVFrf5eehAgPnuOVoSUsIBOx1ZiYuYdekPpUtVttrYAw?e=Z57b1S)
```
sbatch ./ReadCount.job \
/mnt/lustre/macmaneslab/macmanes/physiology/genome/Peer2.0.1.fasta \
/mnt/lustre/macmaneslab/macmanes/physiology/genome/Peromyscus_eremicus__Peer2.0.1.fasta_v3.functional.gff3 \
/mnt/lustre/macmaneslab/macmanes/physiology/raw_reads/second_run/pilot/cobb.sr.unh.edu/managed/220810_A01346_0075_BH5H53DRX2_8MerMACMANES_RNA_P4/reads \
.fastq.gz
```