Treino - HackMD

# Treino - Download das Referências do NCBI: "wget", dentro do diretório refs; - Criação de um link simbólico desse genoma > genoma.fa; - Construção de um indice utilizando STAR. Para isso foi necessario a criação de um arquivo com as coordenadas de inicio e fim dos cromossomos (--sjdbFileChrStartEnd), já que o genoma não possui anotação. Abordagem sem referência de anotação. - Criando arquivo de coordenadas a partir do arquivo gff (NCBI): arquivo gff > coord_genome.txt ```bashscript= cat coord_genome.txt | sed '/##species/d' | sed '/##sequence-region/d' | cut -f 1,4,5,7 > coors_genome_cleaned.txt ``` Script index do genoma: star-index.sh ```bashscript= #!/bin/bash # ./star-idx.sh dmel-all-aox-chromosome-r6.30.fasta genoma=$1 coords=$2 mkdir -p star_index STAR --runThreadN 5 \ --runMode genomeGenerate \ --genomeFastaFiles ${genoma} \ --genomeDir ./star_index \ --sjdbOverhang 149 \ --sjdbFileChrStartEnd ${coords} --limitGenomeGenerateRAM 173586271274 # Erro antes de add --limitGenomeGenerateRAM: #EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome #SOLUTION: please specify --limitGenomeGenerateRAM not less than 163586271274 and make that much RAM available ``` - Modo de uso: ```bashscript= ./star-idx.sh genoma.fa coors_genome_cleaned.txt ``` - Script processamento, alinhamento e montagem transcritos (sem referência de anotação): - pipeRNAfcavedit.sh: Processamento de reads, alinhamento com STAR - run-pipe_edit.sh: Chama o script pipeRNAfcavedit.sh - Cuffscript: Montagem dos transcritos pelo cufflinks das amostras de forma isolada, a seguir roda o Cuffmerge que agrega todos os transcritos em uma única montagem, em seguida roda o cuffquant e cuffnorm. Primeiramente foi rodado o script run-pipe_edit.sh para as sequencias SRAs com layout paired: - Modo de uso: ```bashscript= run-pipe_edit.sh ./raw/input/ ./output/ ./refs/star_index/ &> runpipeditPE.out ``` Em seguida foi rodado o script run-pipe_edit.sh para as sequencias SRAs com layout single: - Protocolo de transferência de diretrio ```bashscrip scp -r jcardoso@143.107.143.162:/work/projects/Pedu/treino/PAIRED_out/processed/fastqc . ``` - Visualização: Entrar na pasta onde contém os arquivos html ```bash= firefox PE1KPYL_B1_T1_R1_fastqc.html firefox PE1KPYL_B1_T1_R2_fastqc.html ``` Em seguida foi feito o Cuffdiff (comandos em outra aba). A seguir: - A partir do arquivo merged.gtf criamos o transcriptoma de referencia trancriptome.fa dos transcritos de todas as amostras: `gffread merged.gtf -g /work/projects/Pedu/treino/refs/GCA_002156105.1_ASM215610v1_genomic.fna -w transcriptome.fa` ### TransDecoder: Identifica candidatos a regiões codificantes dentro de sequencias de transcritos, como aquelas geradas por montagem de novo, ou aquelas construidas baseadas em alinhamentos contra o genoma de referencia utilizando TopHat e Cufflinks. A idendificação de sequencias codificantes é baseada nos seguintes critérios: - Uma ORF (comprimento minimo) é encontrado em um transcrito. - a log-likelihood score similar to what is computed by the GeneID software is > 0. - the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames. - if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc). - a PSSM is built/trained/used to refine the start codon prediction. - optional the putative peptide has a match to a Pfam domain above the noise cutoff score. Script para predição com o transdecoder: https://hackmd.io/@lamoroso92/ByzNy5goS ### Linha de comando Transdecoder /work/projects/Pedu/programas/TransDecoder-TransDecoder-v5.5.0/TransDecoder.LongOrfs -t transcriptome.fa ### Trnity básico: Trinity --seqType fq --max_memory 100G --left reads_1.fq --right reads_2.fq --CPU 18 ## Recuperando reads que não tiveram mapeamento contra o genoma de referência Info: /work/projects/Pedu/treino/output/star_out -> diretório que contém o arquivo de alinhameto das reads contra a referência (alinhamento por biblioteca) ## comando para filtrar as reads que não tiveram alinhamento `samtools view Aligned.out.sorted.byname.bam | cut -f 1,3| grep "*" | cut -f1| nsort -u > reads_unmapped.txt` Link bitwise flag: https://broadinstitute.github.io/picard/explain-flags.html Explain SAM Flags - GitHub Pages To decode a given SAM flag value, just enter the number in the field below. The encoded properties will be listed under Summary below, to the right. broadinstitute.github.io ### outra forma de selecionar reads não mapeads (melhor) `samtools view -f 4 Aligned.out.sorted.byname.bam | cut -f1| nsort -u > unmapped.txt` ## OBSERVAÇÃO: Neste caso exite reads em que apenas 1 read do par não teve alinhamento e a outras sim, portanto exitirá uma redundância entre a contagem de reads alinhadas e não alinhadas ## Recuperando as reads não alinhadas `pullseq -i arquivo_R1.fastq -n arquivos_ids.txt > out_R1.fastq` ## EXEMPLO: Rodando para /work/projects/Pedu/treino/output/star_out/PE1KPYL_B1_T1 ``` pullseq -i /work/projects/Pedu/treino/output/processed/prinseq/PE1KPYL_B1_T1_R1.atropos.prinseq.fastq -n unmapped.txt > PE1KPYL_B1_T1_R1.atropos.prinseq.unmapped.fastq pullseq -i /work/projects/Pedu/treino/output/processed/prinseq/PE1KPYL_B1_T1_R2.atropos.prinseq.fastq -n unmapped.txt > PE1KPYL_B1_T1_R2.atropos.prinseq.unmapped.fastq pullseq -i /work/projects/Pedu/treino/output/processed/prinseq/PE1KPYL_B1_T1_R1_singletons.atropos.prinseq.fastq -n unmapped.txt > PE1KPYL_B1_T1_R1_singletons.atropos.prinseq.unmapped.fastq pullseq -i /work/projects/Pedu/treino/output/processed/prinseq/PE1KPYL_B1_T1_R2_singletons.atropos.prinseq.fastq -n unmapped.txt > PE1KPYL_B1_T1_R2_singletons.atropos.prinseq.unmapped.fastq ``` ### awk: Para usar os fastq no trinity é exigido um sufico /1 e /2 nos arquivos fastq. Para isso, segue o exemplo abaixo: ``` awk '{ if (NR%4==1) { print $1"/1" } else if (NR%4==3) { print "+" } else { print $1 } }' PETOLAP_B1_T1_R1.atropos.prinseq.unmapped.fastq > PETOLAP_B1_T1_R1.awk.fastq ``` ## Trinity: `/usr/local/bioinfo/trinityrnaseq-Trinity-v2.6.6/Trinity --seqType fq --max_memory 100G --left PEA_B2_T1_R1.pullseq.fq --right PEA_B2_T1_R2.pullseq.fq --CPU 18M` `/usr/local/bioinfo/trinityrnaseq-Trinity-v2.6.6/Trinity --seqType fq --max_memory 100G --left PEA_B2_T1_R1.pullseq.fq --right PEA_B2_T1_R2.pullseq.fq --single PEA_B2_T1_R1_singleton.pullseq.fq --single PEA_B2_T1_R2_singleton.pullseq.fq --genome_guided_bam /data/projects/Pedu/output/star_out/PEA_B2_T1/Aligned.out.sorted.bam --CPU 18M` `/usr/local/bioinfo/trinityrnaseq-Trinity-v2.6.6/Trinity --seqType fq --max_memory 100G --left PESUSAP_B1_T1_R1.awk.fastq --right PESUSAP_B1_T1_R2.awk.fastq --genome_guided_bam Aligned.out.sorted.bam --genome_guided_max_intron 1000 --CPU 18` OBS: Foi um teste, fizemos mais 2: um somente com as reads e outro somente com o genome-guided.bam. ## Pegando um transcrito da referência predito na biblioteca PESUSAP O transcrito foi escolhido baseado no alinhamento de reads da biblioteca e por ter varios exons. ``` # DIRETÓRIO: /work/projects/Pedu/treino/output/cufflinks/PESUSAP_B1_T1$ gffread transcripts.gtf -g /work/projects/Pedu/treino/refs/GCA_002156105.1_ASM215610v1_genomic.fna -w transcriptoma.fa ``` ``` echo CUFF.6.1 | pullseq -i transcriptoma.fa -N > CUFF.6.1_transcrito.fa ``` ## Formatando os arquivo da saída do trinity [Trinity-GG.fasta] para o formato NCBI (formatdb) ``` formatdb -i /work/projects/Pedu/treino/output/star_out/PESUSAP_B1_T1/trinity_out_dir/Trinity-GG.fasta -p F -o T ``` ``` blastn -db /work/projects/Pedu/treino/output/star_out/PESUSAP_B1_T1/trinity_out_dir/Trinity-GG.fasta -query CUFF.6.1_transcrito.fa -out blastn_CUFF.6.1_to_Trinity-GG.fasta.tsv -outfmt 6 -num_threads 10 ``` ### Estimação do tramanhos dos fragmentos por read: bamPEFragmentSize This tool calculates the fragment sizes for read pairs given a BAM file from paired-end sequencing.Several regions are sampled depending on the size of the genome and number of processors to estimate thesummary statistics on the fragment lengths. Properly paired reads are preferred for computation, i.e., it will only use discordant pairs if no concordant alignments overlap with a given region. The default setting simply prints the summary statistics to the screen. usage: ``` bamPEFragmentSize -p 10 -n 10000 -hist ./fragmentSize.png -T "Fragment size of PE data" --maxFragmentLength 500 -b ../../star_out/PEA_B2_T1/Aligned.out.sorted.bam &> ./bamPEFragmentSize.out ``` OBS: Não funcionou!! ### Trinity Comando para rodar os dados montagem de novo: /usr/local/bioinfo/trinityrnaseq-Trinity-v2.6.6/Trinity --seqType fq --max_memory 100G --left PEA_B2_T1_R1.pullseq.fq --right PEA_B2_T1_R2.pullseq.fq --CPU 18M ### RSeQC #### RNA_fragment_size.py Calculate fragment size for each gene/transcript. For each transcript, it will report : 1) Number of fragment that was used to estimate mean, median, std (see below). 2) mean of fragment size 3) median of fragment size 4) stdev of fragment size **Options:** --version show program’s version number and exit -h, --help show this help message and exit -i INPUT_FILE, --input=INPUT_FILE Input BAM file -r REFGENE_BED, --refgene=REFGENE_BED Reference gene model in BED format. Must be strandard 12-column BED file. [required] -q MAP_QUAL, --mapq=MAP_QUAL Minimum mapping quality (phred scaled) for an alignment to be called “uniquely mapped”. default=30 -n FRAGMENT_NUM, --frag-num=FRAGMENT_NUM Minimum number of fragment. default=3 - Exemplo: ``` python2.7 RNA_fragment_size.py -r hg19.RefSeq.union.bed -i SRR873822_RIN10.bam > SRR873822_RIN10.fragSize head -4 SRR873822_RIN10.fragSize ``` ### **infer_experiment.py** This program is used to “guess” how RNA-seq sequencing were configured, particulary how reads were stranded for strand-specific RNA-seq data, through comparing the “strandness of reads” with the “standness of transcripts”. The “strandness of reads” is determiend from alignment, and the “standness of transcripts” is determined from annotation. For non strand-specific RNA-seq data, “strandness of reads” and “standness of transcripts” are independent. For strand-specific RNA-seq data, “strandness of reads” is largely determined by “standness of transcripts”. See below 3 examples for details. You don’t need to know the RNA sequencing protocol before mapping your reads to the reference genome. Mapping your RNA-seq reads as if they were non-strand specific, this script can “guess” how RNA-seq reads were stranded. - For pair-end RNA-seq, there are two different ways to strand reads (such as Illumina ScriptSeq protocol): > 1++,1–,2+-,2-+ read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand > 1+-,1-+,2++,2– read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand - For single-end RNA-seq, there are also two different ways to strand reads: > ++,– read mapped to ‘+’ strand indicates parental gene on ‘+’ strand read mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand > +-,-+ read mapped to ‘+’ strand indicates parental gene on ‘-‘ strand read mapped to ‘-‘ strand indicates parental gene on ‘+’ strand **Options:** --version show program’s version number and exit -h, --help show this help message and exit -i INPUT_FILE, --input-file=INPUT_FILE Input alignment file in SAM or BAM format -r REFGENE_BED, --refgene=REFGENE_BED Reference gene model in bed fomat. -s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE Number of reads sampled from SAM/BAM file. default=200000 -q MAP_QUAL, --mapq=MAP_QUAL Minimum mapping quality (phred scaled) for an alignment to be considered as “uniquely mapped”. default=30 Example 1: Pair-end non strand specific: `infer_experiment.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam` #### Output:: This is PairEnd Data Fraction of reads failed to determine: 0.0172 Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903 Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925 Interpretation: 1.72% of total reads were mapped to genome regions that we cannot determine the “standness of transcripts” (such as regions that having both strands transcribed). For the remaining 98.28% (1 - 0.0172 = 0.9828) of reads, half can be explained by “1++,1–,2+-,2-+”, while the other half can be explained by “1+-,1-+,2++,2–”. We conclude that this is NOT a strand specific dataset because “strandness of reads” was independent of “standness of transcripts” Example 2: Pair-end strand specific: `infer_experiment.py -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam` #### Output:: This is PairEnd Data Fraction of reads failed to determine: 0.0072 Fraction of reads explained by "1++,1--,2+-,2-+": 0.9441 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0487 Interpretation: 0.72% of total reads were mapped to genome regions that we cannot determine the “standness of transcripts” (such as regions that having both strands transcribed). For the remaining 99.28% (1 - 0.0072 = 0.9928) of reads, the vast majority was explained by “1++,1–,2+-,2-+”, suggesting a strand-specific dataset.