Building the index for Human read-removal with kASA

# Building the index for Human read-removal with kASA ## Create conda environment and install kASA ``` conda create -n kasa -c conda-forge -c silvioweging kasa conda activate kasa ``` ## Download reference ``` curl -L -o assembly_info.txt https://ftp.ncbi.nih.gov/genomes/.vol2/refseq/assembly_summary_refseq.txt grep "Homo sapiens" assembly_info.txt | cut -f 1,16,20 | sed -r 's/^(\S*)\s+(\S*)\s+(\S*)/\3\/\1\_\2\_genomic.fna.gz/' | xargs curl -L -o reference.fasta.gz ``` ## Generate content file ``` zgrep '>' reference.fasta.gz | cut -d ' ' -f 1 | cut -d '>' -f 2 | tr '\n' ' ' | sed "s/ /;/g" | rev | cut -c2- | rev | xargs printf "Homo sapiens\t9606\t9606\t%s" > kASA_idx_homo_sapiens_content.txt ``` ## Build index This uses a lot of space and memory if you let it. You can influence the RAM usage via setting `-m inf` to a value in GB that suites you. If you are short on space, delete the `--igotspace` parameter. Should the temporary files go elsewhere, change `-t <path>/`. ``` kASA build -c kASA_idx_homo_sapiens_content.txt -d idx -i reference.fasta.gz -k 12 7 -n -1 -m inf --three --igotspace -t ./ kASA shrink -c kASA_idx_homo_sapiens_content.txt -d idx -o kASA_idx_homo_sapiens -s 2 ``` ## Clean up ``` rm idx* rm assembly_info.txt rm stxxl* rm reference.fasta.gz ```