RAW data study with kASA

# RAW data study with kASA ## Generating the index ### Downloading the RefSeq database and taxonomy Only if you haven't already that is. Or if you want to use a different database, feel free to. ``` mkdir database mkdir taxonomy cd database curl -O -L "https://ftp.ncbi.nih.gov/refseq/release/complete/complete.[1-4275].1.genomic.fna.gz" cd ../taxonomy curl -O -L "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip" unzip taxdmp.zip mkdir accession2taxid cd accession2taxid curl -O -L "https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_{wgs,gb}.accession2taxid.gz" cd ../.. ``` ### Build Assuming you've already installed kASA. Change values for `-m ...` and `-n ...` to the available RAM in GB and the number of cores, respectively. If you are short on space, delete the `--igotspace` parameter. Should the temporary files go elsewhere, change `-t <path>/`. ``` mkdir index mkdir temp kASA build -d index/idx -i database/ -m <RAM> -t temp/ -n <CPUs> -f taxonomy/accession2taxid/ -y taxonomy/ -u species --three --kH 12 --igotspace ``` ### Shrink? This can only be done if `wc -l index/idx_content.txt` is smaller than 65535. It halves the index size without loss of information. ``` kASA shrink -d index/idx -o index/idx_s -s 2 ``` After that, replace `index/idx` below with `index/idx_s`. ## Identifying critters Replace `-i <inputfolder>/` with the path to the fastq files. If you don't need the read-per-read information, delete the `-q ...` parameter. The output format of these files can be changed to `--kraken`, `--jsonl`, `--json`, and `--tsv`. Most pipelines work with the kraken output so that is used here by default. Personally, I prefer the jsonl format. The profile will always be in csv format. Change values for `-m ...` and `-n ...` to the available RAM in GB and the number of cores, respectively. `identify_multiple` works by processing multiple files at the same time, dividing resources as best as possible. This only makes sense if the index fits into the RAM. If not, using `identify` could work better. Using `--six` will slow down the whole processing significantly but makes sense if the orientation of the sequences isn't known. `--one` and `--three` are valid options as well to speed things up but risks misidentification. ``` mkdir out kASA identify_multiple -d index/idx -m <RAM> -t temp/ -n <CPUs> -i <inputfolder>/ -p out/ -q out/ -r -k 12 7 --kraken --six ``` ## Interpret results The scripts in my GitHub can be of help to organize it or convert one format to another. It's up to you. Consult me when it's done and we'll have a look at it together.