---
tags: STAR, slurm script, htseq-count
---
# Hypothalamus seq
reads are in `/mnt/lz01/macmaneslab/shared/hypothalamus_seq/raw_reads`
index is here: /mnt/lz01/macmaneslab/shared/STAR_index
### I did this to the reads to remove the `001` which seems to be probelmatic
```
for filename in *fastq.gz; do
[ -f "$filename" ] || continue
mv "$filename" "${filename//_001/}"
done
```
### This is a working slurm script for mapping and counting
### Usage `sbatch MapCount.job index_location reads_dir results_dir`
current location is: `/mnt/lz01/macmaneslab/macmanes/MapCount.job`
Note-1 this is fairly substantially changed from Dani's script to make it more general.
Note-2 genome indexing is already done prior
```
#!/bin/bash
#SBATCH --partition=macmanes
#SBATCH --cpus-per-task=40
#SBATCH --mem 310Gb
#SBATCH --exclude=node117,node118
#SBATCH --output hypo_data.log
ulimit -n 10000
module purge
module load anaconda/colsa
conda activate star-2.7.10b
#### Purpose: align reads to a genome and count them
### USAGE
# sbatch MapCount.job index_location reads_dir _results_dir
DIR=$(pwd)
INDEX=$1
READ_DIR=$2
RESULTS_DIR=$3
echo "index: $INDEX"
echo "Reads directory: $READ_DIR"
echo "RESULTS_DIR: $RESULTS_DIR"
# make a directory for star read mapping
mkdir $RESULTS_DIR
#####################
##### map reads #####
#####################
# list all samples to map
samples=$(basename -a ${READ_DIR}/*1.fastq.gz | sed "s/R1.fastq.gz//g")
echo Mapping these samples: $samples
# map reads with STAR
for sample in $samples
do
STAR --runMode alignReads \
--genomeDir $INDEX \
--readFilesIn ${READ_DIR}/${sample}R1.fastq.gz ${READ_DIR}/${sample}R2.fastq.gz \
--readFilesCommand zcat \
--runThreadN 40 \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 10 \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0.3 \
--quantMode GeneCounts \
--outFileNamePrefix $RESULTS_DIR/${sample} \
--outSAMtype BAM Unsorted \
--genomeLoad LoadAndKeep \
--limitBAMsortRAM 254748364800 \
--sjdbGTFtagExonParentTranscript Parent
done
#######################
##### count reads #####
#######################
conda deactivate
conda activate htseq-0.13.5
# count reads with htseq-count
echo Counting genes with htseq-count for these samples: $samples
for sample in $samples
do
htseq-count \
--stranded=reverse \
-f bam \
-t exon \
--idattr=Parent \
--additional-attr=gene \
--additional-attr=GeneID \
--additional-attr=Dbxref \
$RESULTS_DIR/${sample}Aligned.out.bam \
/mnt/lz01/macmaneslab/shared/genome/PerEre_H2_v1/PerEre_H2_v1_genomic.gff > $RESULTS_DIR/${sample}gene.counts
done
```
## mapping data analysis
```
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "date" "sample" "reads" "unique" "multi" "toomany" "unmapped"> mappingdata.txt
for file in $(ls -lthr *Log.final.out | awk '{print $9}')
do
sample=$(echo $file | sed "s/_Log.final.out//g" | sed "s/.Log.final.out//g")
date=$(grep "Started job" $file | cut -f 2 | cut -d ' ' -f1-2 | sed "s/ //g")
reads=$(grep "Number of input reads" $file | cut -f 2)
unique=$(grep "Uniquely mapped reads %" $file | cut -f 2)
multi=$(grep "% of reads mapped to multiple loci" $file | cut -f 2)
toomany=$(grep "% of reads mapped to too many loci" $file | cut -f 2)
unmapped=$(grep "Number of reads unmapped: too short" $file | cut -f 2)
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "$date" "$sample" "$reads" "$unique" "$multi" "$toomany" "$unmapped" >> mappingdata.txt
done
```
### once this is done, need to edit and download count files
for now, just need 1st and last cols in data.
otherwise won't import preperly into DESeq2
```
mkdir count_data
for samples in *counts;
do awk '{print $1, "\t", $NF}' $samples > count_data/$samples;
done
```
### change names of files
to make parsing in R easier
##### Need to add water info here probably.. maybe will be easier to add it into R. Would be cleaner here if possible
```
cd count_data
for filename in *counts; do
[ -f "$filename" ] || continue
mv "$filename" "${filename/MP/_male_pit}"
mv "$filename" "${filename/FP/_female_pit}"
mv "$filename" "${filename/FS/_female_sonpvn}"
mv "$filename" "${filename/MS/_male_sonpvn}"
done
```
download to local computer via whatever means is easiest.
```
scp macmanes@premise.sr.unh.edu:/mnt/lz01/macmaneslab/macmanes/dani_mapping/count_data/*counts .
```
### Work in R
```
...
```
### scratch (don't use this code)
```
mkdir count_data
for samples in *counts;
do awk '{print $(NF-1)}' $samples | awk -F : '{print $2}' | awk -F , '{print $1}' > temp1;
awk '{print $NF}' $samples > temp2;
paste temp1 temp2 > count_data/$samples;
sed -i '$d' count_data/$samples;
sed -i '$d' count_data/$samples;
sed -i '$d' count_data/$samples;
sed -i '$d' count_data/$samples;
sed -i '$d' count_data/$samples;
rm temp1 temp2;
done
```
also need to aggregate count data
```
#!/bin/bash
# Define the path to your TSV file
#input_file="path/to/your/file.tsv"
#output_file="path/to/your/aggregated_file.tsv"
# Use awk to process the TSV file
awk 'BEGIN {
FS=OFS="\t"
}
{
if (NR > 1) { # Skip the header line if your file has a header
data[$1] += $2
}
}
END {
for (id in data) {
print id, data[id]
}
}' $1 > $2
```