Hypothalamus seq

--- tags: STAR, slurm script, htseq-count --- # Hypothalamus seq reads are in `/mnt/lz01/macmaneslab/shared/hypothalamus_seq/raw_reads` index is here: /mnt/lz01/macmaneslab/shared/STAR_index ### I did this to the reads to remove the `001` which seems to be probelmatic ``` for filename in *fastq.gz; do [ -f "$filename" ] || continue mv "$filename" "${filename//_001/}" done ``` ### This is a working slurm script for mapping and counting ### Usage `sbatch MapCount.job index_location reads_dir results_dir` current location is: `/mnt/lz01/macmaneslab/macmanes/MapCount.job` Note-1 this is fairly substantially changed from Dani's script to make it more general. Note-2 genome indexing is already done prior ``` #!/bin/bash #SBATCH --partition=macmanes #SBATCH --cpus-per-task=40 #SBATCH --mem 310Gb #SBATCH --exclude=node117,node118 #SBATCH --output hypo_data.log ulimit -n 10000 module purge module load anaconda/colsa conda activate star-2.7.10b #### Purpose: align reads to a genome and count them ### USAGE # sbatch MapCount.job index_location reads_dir _results_dir DIR=$(pwd) INDEX=$1 READ_DIR=$2 RESULTS_DIR=$3 echo "index: $INDEX" echo "Reads directory: $READ_DIR" echo "RESULTS_DIR: $RESULTS_DIR" # make a directory for star read mapping mkdir $RESULTS_DIR ##################### ##### map reads ##### ##################### # list all samples to map samples=$(basename -a ${READ_DIR}/*1.fastq.gz | sed "s/R1.fastq.gz//g") echo Mapping these samples: $samples # map reads with STAR for sample in $samples do STAR --runMode alignReads \ --genomeDir $INDEX \ --readFilesIn ${READ_DIR}/${sample}R1.fastq.gz ${READ_DIR}/${sample}R2.fastq.gz \ --readFilesCommand zcat \ --runThreadN 40 \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0.3 \ --quantMode GeneCounts \ --outFileNamePrefix $RESULTS_DIR/${sample} \ --outSAMtype BAM Unsorted \ --genomeLoad LoadAndKeep \ --limitBAMsortRAM 254748364800 \ --sjdbGTFtagExonParentTranscript Parent done ####################### ##### count reads ##### ####################### conda deactivate conda activate htseq-0.13.5 # count reads with htseq-count echo Counting genes with htseq-count for these samples: $samples for sample in $samples do htseq-count \ --stranded=reverse \ -f bam \ -t exon \ --idattr=Parent \ --additional-attr=gene \ --additional-attr=GeneID \ --additional-attr=Dbxref \ $RESULTS_DIR/${sample}Aligned.out.bam \ /mnt/lz01/macmaneslab/shared/genome/PerEre_H2_v1/PerEre_H2_v1_genomic.gff > $RESULTS_DIR/${sample}gene.counts done ``` ## mapping data analysis ``` printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "date" "sample" "reads" "unique" "multi" "toomany" "unmapped"> mappingdata.txt for file in $(ls -lthr *Log.final.out | awk '{print $9}') do sample=$(echo $file | sed "s/_Log.final.out//g" | sed "s/.Log.final.out//g") date=$(grep "Started job" $file | cut -f 2 | cut -d ' ' -f1-2 | sed "s/ //g") reads=$(grep "Number of input reads" $file | cut -f 2) unique=$(grep "Uniquely mapped reads %" $file | cut -f 2) multi=$(grep "% of reads mapped to multiple loci" $file | cut -f 2) toomany=$(grep "% of reads mapped to too many loci" $file | cut -f 2) unmapped=$(grep "Number of reads unmapped: too short" $file | cut -f 2) printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "$date" "$sample" "$reads" "$unique" "$multi" "$toomany" "$unmapped" >> mappingdata.txt done ``` ### once this is done, need to edit and download count files for now, just need 1st and last cols in data. otherwise won't import preperly into DESeq2 ``` mkdir count_data for samples in *counts; do awk '{print $1, "\t", $NF}' $samples > count_data/$samples; done ``` ### change names of files to make parsing in R easier ##### Need to add water info here probably.. maybe will be easier to add it into R. Would be cleaner here if possible ``` cd count_data for filename in *counts; do [ -f "$filename" ] || continue mv "$filename" "${filename/MP/_male_pit}" mv "$filename" "${filename/FP/_female_pit}" mv "$filename" "${filename/FS/_female_sonpvn}" mv "$filename" "${filename/MS/_male_sonpvn}" done ``` download to local computer via whatever means is easiest. ``` scp macmanes@premise.sr.unh.edu:/mnt/lz01/macmaneslab/macmanes/dani_mapping/count_data/*counts . ``` ### Work in R ``` ... ``` ### scratch (don't use this code) ``` mkdir count_data for samples in *counts; do awk '{print $(NF-1)}' $samples | awk -F : '{print $2}' | awk -F , '{print $1}' > temp1; awk '{print $NF}' $samples > temp2; paste temp1 temp2 > count_data/$samples; sed -i '$d' count_data/$samples; sed -i '$d' count_data/$samples; sed -i '$d' count_data/$samples; sed -i '$d' count_data/$samples; sed -i '$d' count_data/$samples; rm temp1 temp2; done ``` also need to aggregate count data ``` #!/bin/bash # Define the path to your TSV file #input_file="path/to/your/file.tsv" #output_file="path/to/your/aggregated_file.tsv" # Use awk to process the TSV file awk 'BEGIN { FS=OFS="\t" } { if (NR > 1) { # Skip the header line if your file has a header data[$1] += $2 } } END { for (id in data) { print id, data[id] } }' $1 > $2 ```