# RAW data study with kASA
## Generating the index
### Downloading the RefSeq database and taxonomy
Only if you haven't already that is. Or if you want to use a different database, feel free to.
```
mkdir database
mkdir taxonomy
cd database
curl -O -L "https://ftp.ncbi.nih.gov/refseq/release/complete/complete.[1-4275].1.genomic.fna.gz"
cd ../taxonomy
curl -O -L "https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip"
unzip taxdmp.zip
mkdir accession2taxid
cd accession2taxid
curl -O -L "https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_{wgs,gb}.accession2taxid.gz"
cd ../..
```
### Build
Assuming you've already installed kASA. Change values for `-m ...` and `-n ...` to the available RAM in GB and the number of cores, respectively. If you are short on space, delete the `--igotspace` parameter. Should the temporary files go elsewhere, change `-t <path>/`.
```
mkdir index
mkdir temp
kASA build -d index/idx -i database/ -m <RAM> -t temp/ -n <CPUs> -f taxonomy/accession2taxid/ -y taxonomy/ -u species --three --kH 12 --igotspace
```
### Shrink?
This can only be done if `wc -l index/idx_content.txt` is smaller than 65535. It halves the index size without loss of information.
```
kASA shrink -d index/idx -o index/idx_s -s 2
```
After that, replace `index/idx` below with `index/idx_s`.
## Identifying critters
Replace `-i <inputfolder>/` with the path to the fastq files.
If you don't need the read-per-read information, delete the `-q ...` parameter. The output format of these files can be changed to `--kraken`, `--jsonl`, `--json`, and `--tsv`. Most pipelines work with the kraken output so that is used here by default. Personally, I prefer the jsonl format. The profile will always be in csv format.
Change values for `-m ...` and `-n ...` to the available RAM in GB and the number of cores, respectively.
`identify_multiple` works by processing multiple files at the same time, dividing resources as best as possible. This only makes sense if the index fits into the RAM. If not, using `identify` could work better.
Using `--six` will slow down the whole processing significantly but makes sense if the orientation of the sequences isn't known. `--one` and `--three` are valid options as well to speed things up but risks misidentification.
```
mkdir out
kASA identify_multiple -d index/idx -m <RAM> -t temp/ -n <CPUs> -i <inputfolder>/ -p out/ -q out/ -r -k 12 7 --kraken --six
```
## Interpret results
The scripts in my GitHub can be of help to organize it or convert one format to another. It's up to you. Consult me when it's done and we'll have a look at it together.