--- tags: GeneLab title: Generating fasta for simulating data for human-read removal --- # Generating fasta for simulating data for human-read removal [toc] ## Download results This holds 3 bugs (a Pseudomonas aeruginosa, a Staph epidermidis, and a Micrococcus luteus) and about 5 MB of human genome. Created as shown below, result fasta downloadable with the following: ```bash curl -L -o mock-bug-and-human.fa.gz https://figshare.com/ndownloader/files/31454380 ``` ## Making ### Env Used for downloading the genomes by accession and renaming the fasta headers ```bash conda create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit conda activate bit ``` ### Downloading genomes Making list of target accessions, 3 microbes picked randomly, and the human ref genome: ```bash printf "GCF_000006765.1\nGCF_006094375.1\nGCF_019890915.1\nGCF_000001405.39\n" > target-accs.txt ``` Downloading: ```bash bit-dl-ncbi-assemblies -w target-accs.txt -f fasta ``` ### Subsetting human genome Taking just over 5 MB, removing the starting lines that are all Ns: ```bash zcat GCF_000001405.39.fa.gz | head -n 70000 | grep -v "^N*$" > GCF_000001405.39-subset.fa rm GCF_000001405.39.fa.gz ``` ### Renaming fasta headers ```bash gunzip *.gz bit-rename-fasta-headers -i GCF_000006765.1.fa -w bug-Pseudomonas-aeruginosa-GCF_000006765.1 -o bug-GCF_000006765.1-for-cat.fa bit-rename-fasta-headers -i GCF_006094375.1.fa -w bug-Staphylococcus-epidermidis-GCF_006094375.1 -o bug-GCF_006094375.1-for-cat.fa bit-rename-fasta-headers -i GCF_019890915.1.fa -w bug-Micrococcus-luteus-GCF_019890915.1 -o bug-GCF_019890915.1-for-cat.fa bit-rename-fasta-headers -i GCF_000001405.39-subset.fa -w human-GCF_000001405.39 -o human-GCF_000001405.39-for-cat.fa ``` ### Combining ```bash cat *-for-cat.fa > mock-bug-and-human.fa gzip mock-bug-and-human.fa ```