1. The protein I chose is OPN1LW protein, which is involved in human colorblind disease. The protein sequence is found from [Uniprot](https://www.uniprot.org/uniprotkb/P04000/entry#sequences) website. 2. I decided to use the refseq_human_protein database from blastp and the Cdd_NCBI database from rpsblast. 3. the following is the code for two databases: For blastp with refseq_human_protein: ``` #!/bin/bash module load ncbi-toolkit # declare database and query files (note: no space before or after =) database=/group/bit150/blast_databases/refseq_human_protein query=/home/bit150-48/Lab_BLAST/P04000.fasta output_file=opn1lw_blast_table.txt # run blast blastp \ -db ${database} \ -query ${query} \ -out opn1lw_blast_table.txt \ -outfmt 7 ``` and For rpsblast with Cdd_NCBI: ``` #!/bin/bash module load ncbi-toolkit # declare database and query files (note: no space before or after =) database=/group/bit150/blast_databases/Cdd_NCBI query=/home/bit150-48/Lab_BLAST/P04000.fasta output_file=opn1lw_rpsblast_table.txt # run blast blastp \ -db ${database} \ -query ${query} \ -out opn1lw_rpsblast_table.txt \ -outfmt 7 ``` 3. I got the same result from the online NCBI database and the bash blast program. I uploaded the protein sequence to the NCBI, and the first protein I got, which has 100% accordance with my uploaded sequence, is NP_064445.2, which is same from my bash blast result. Therefore, I think this online NCBI database contains protein sequence information, and people who have a unknown sequence can upload their data to NCBI database to find the most probable protein. ![微信图片_20231207101606](https://hackmd.io/_uploads/By6x2YJL6.png) This is my blast result (first several lines): ``` # BLASTP 2.13.0+ # Query: sp|P04000|OPSR_HUMAN Long-wave-sensitive opsin 1 OS=Homo sapiens OX=9606 GN=OPN1LW PE=1 SV=2 # Database: /group/bit150/blast_databases/refseq_human_protein # Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 518 hits found sp|P04000|OPSR_HUMAN NP_064445.2 100.000 364 0 0 1 364 1 364 0.0 752 sp|P04000|OPSR_HUMAN NP_001041646.1 96.429 364 13 0 1 364 1 364 0.0 696 sp|P04000|OPSR_HUMAN NP_000504.1 96.429 364 13 0 1 364 1 364 0.0 696 sp|P04000|OPSR_HUMAN NP_001316996.1 96.429 364 13 0 1 364 1 364 0.0 696 ``` 4. Another potential databaser could be the GeneCards database. GeneCards is a comprehensive database that provides detailed information about all known and predicted human genes. It aggregates data from various sources, offering insights into gene function, expression patterns, associated disorders, and potential interactions. 5. OPN1LW (long-wave-sensitive opsin 1) is a gene in humans that encodes a protein essential for color vision. This protein is a type of photopigment found in the retina's cone cells, specifically responsible for the perception of long-wavelength light, which is typically seen as red. Mutations in OPN1LW can lead to color vision deficiencies, such as protanomaly or protanopia, which are types of red-green color blindness. 6. To change the word size of the output file, the blast code code by modified to: ``` #!/bin/bash module load ncbi-toolkit # declare database and query files database=/group/bit150/blast_databases/refseq_human_protein query=/group/bit150/Lab_BLAST/mouse_protein_set1.fasta output_file=mouse_protein_set1_blast_table.txt # run blast with a modified word size blastp \ -db ${database} \ -query ${query} \ -out ${output_file} \ -outfmt 7 \ -word_size 6 # the number here could be changed from 2-7 ```