Fasta Input manipulation

# Fasta Input manipulation ## GC content To check GC% distribution in any given fasta file (assembly) we can run the script: `/groups/fr2020/bin/average_GC_hist_FASTA.sh` as (in your working dir with the fasta file): `/groups/fr2020/bin/average_GC_hist_FASTA.sh infile.fa 1` I will create an output GC.png with an histogram. Always, rename the output for further details "infile_CG.png" *The number "1" dictates how to bin/group the data...not really important now. ## Fasta/Assembly Summary We are using the stats script from BBTools https://jgi.doe.gov/data-and-tools/bbtools/ In the folder with your fasta file: `/bioware/bbtools-38.84/stats.sh in=contig.fa > contig.summary.txt` ## Extract fasta seqs by IDs: ####Extract IDs from assembly: You nee a text file (pattern file) with a list of IDs (>ID) tabulated: ``` $ more abulated-IDs.txt ID_1 ID_2 ID_3 ... ID_N ``` then: `seqkit grep --pattern-file Tabulated-IDs.txt Input-fasta.fa > IDs-fasta.fa` ## Count number of sequences inside a fasta file: ``` grep -c "^>" Perma827_Rotifera.fa ``` ## More about seqkit tool in https://bioinf.shenwei.me/seqkit/ ## Split fasta file based on GC% `~/bin/prinseq-lite-0.20.4/prinseq-lite.pl -max_gc 50 -fasta input.fa` Will output two files: *_good (with the seqs within the limit) and *_bad the rest of seqs. ## Fasta size selection First, let check the size distribution of a given multifasta file. like: `cat file.fa | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'` It will output in the terminal, or you can pipe it to a file `> file-sizes.txt` Then select by size with seqkit tool: fitlering sequences with minimum 100 aa/nt length i.e > 100, `seqkit seq -m 100 test.fa` For filtering sequences with maximum of 1000 aa/nt length<=1000 `seqkit seq -M 1000 test.fa` For filtering sequences between maxiumum size 200 and minimum size of 100: `seqkit seq -m 100 -M 200 test.fa`