## K-mer analysis

*k-mer size of 79 on PE100 reads*
## Response questions:
1. the size for PE100 reads are 24M, and for PE50 reads are 14M. The size for PE100 is larger becaser it contains nucleotide BPs for length of 100, so the file contains more information than PE50 if the total number of segments are same.
2. I think the 1 and 2 here means the first and second read (forward and backward) for the same DNA strain.
3. The contents of the two files are different.
4. Both files contain 400000 reads
5. A larger k-mer size could better avoid the influence of repetitive excerpts in the whole genome. For exmaple, there might be tons of AACT segments in the genome, but it's very unlikely that an excerpt of 100 bp repeate many times in the genome.
6. For organisms with small or less complex genomes, short k-mers might be sufficient to accurately assemble the genome. These genomes typically have fewer repetitive sequences, reducing the risk of misassemblies with short k-mers
7. Both k-mer size and seed size represent a trade-off between specificity and sensitivity; high speficity and high sensitivity could hardly be reached at the same time in k-mers and seed size in blast.
8. There are many difference between the delta stain and the Wuhan strain: for example, in the 35th bp, the G is mutated into T.
9. Gaps are 19/29823
10. I view the beginning and the ending part of the virus genome, and I think most of the mutations accumulate in these two regions. To determine which genes these mutations are located, I think I can input the location of mutations in a virus genome browser like the UCSC genome browser.
## Part 2 spades:
```
#SBATCH -J SPAdes_Assembly
#SBATCH -N 1
#SBATCH -n 4
#SBATCH --mem=32G
#SBATCH -t 48:00:00
#SBATCH -p your_partition_name
#SBATCH --output=spades_assembly_%j.out
#SBATCH --error=spades_assembly_%j.err
# Set up environment
DATA_FOLDER=/group/bit150/Lab_08/PE50_reads/
SAMPLE=OL456172.1
SPADES_OUTPUT_FOLDER=/path/to/spades_output_${SAMPLE}
MEGAHIT_ASSEMBLY=/path/to/megahit_output/final.contigs.fa
REFERENCE_BLASTDB=/path/to/Wuhan_blastdb/NC_045512.2
BLAST_OUTPUT=${SPADES_OUTPUT_FOLDER}/SPAdes_vs_Megahit_blast.txt
module load spades
module load ncbi-toolkit
spades.py -1 ${DATA_FOLDER}/${SAMPLE}_1.fq \
-2 ${DATA_FOLDER}/${SAMPLE}_2.fq \
--threads 4 \
--memory 32 \
-o ${SPADES_OUTPUT_FOLDER}
blastn -db ${REFERENCE_BLASTDB} \
-query ${SPADES_OUTPUT_FOLDER}/contigs.fasta \
-subject ${MEGAHIT_ASSEMBLY} \
-out ${BLAST_OUTPUT}
echo "BLAST comparison complete. Results in ${BLAST_OUTPUT}"
```
## Part 3 response:
Assembling a genome from sequencing data is conceptually similar to reconstructing the text of a book from randomly shredded pieces. Both processes involve piecing together fragments to reconstruct an original sequence, whether it's the linear sequence of nucleotides in DNA or the sequential order of words and sentences in a text. This comparison illuminates both the similarities and differences in these challenges.
In both cases, the objective is to reconstruct the original order from a collection of fragments (reads in genome sequencing, and sentences or words in the book analogy). A contig in genomics, representing a continuous sequence of DNA reconstructed from overlapping reads, is akin to a reconstructed paragraph or page in the text. A scaffold, which in genomics refers to a set of contigs linked together in the correct order but with gaps in between, can be compared to chapters or sections of the book where the overall structure is known but some parts are missing. The concept of a chromosome, a complete and continuous piece of DNA containing many genes, parallels a complete book or a distinct volume in a series. Polymorphisms, which are variations in the DNA sequence among individuals, are analogous to typos or variations in different editions of the same book. The reference genome, a complete assembly used as a standard for comparison, is similar to the final, edited version of a book used as a benchmark.