在終端機用NCBI序列編號查找分類群名稱

# 在終端機用NCBI序列編號查找分類群名稱 ###### tags:`學習筆記` `bash` NCBI 中有 [sequence identifiers](https://www.ncbi.nlm.nih.gov/genbank/sequenceids/) 作為序列的唯一識別編號，可以透過 [Entrez Molecular Sequence Database System](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) 進行序列編號的大量操作。NCBI 有推出 [ncbi-entrez-direct](https://www.ncbi.nlm.nih.gov/books/NBK179288/) 小工具，方便我們使用 Entrez 的服務。以下以 NCBI Accession Number "OL800728" 為例，進行資料的查找： ```bash! $ esearch -db nucleotide -query "OL800728"| esummary ``` 輸出以下結果： ```bash! <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE DocumentSummarySet PUBLIC "-//NLM//DTD esummary nuccore 20201205//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20201205/esummary_nuccore.dtd"> <DocumentSummarySet status="OK"> <DocumentSummary> <Id>2259973435</Id> <Caption>OL800728</Caption> <Title>Coccopigya hispida voucher MNHN-IM-2013-40587 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial</Title> <Extra>gi|2259973435|gb|OL800728.1|</Extra> <Gi>2259973435</Gi> <CreateDate>2022/07/02</CreateDate> <UpdateDate>2022/07/02</UpdateDate> <Flags>0</Flags> <TaxId>69563</TaxId> <Slen>624</Slen> <Biomol>genomic</Biomol> <MolType>dna</MolType> <Topology>linear</Topology> <SourceDb>insd</SourceDb> <SegSetSize>0</SegSetSize> <ProjectId>0</ProjectId> <Genome>mitochondrion</Genome> <SubType>specimen_voucher</SubType> <SubName>MNHN-IM-2013-40587</SubName> <AssemblyGi>0</AssemblyGi> <GeneticCode>1</GeneticCode> <Strand>ds</Strand> <Organism>Coccopigya hispida</Organism> <Statistics> <Stat type="Length" count="624"/> <Stat type="Length" subtype="literal" count="624"/> <Stat type="all" count="3"/> <Stat type="blob_size" count="1099050"/> <Stat type="cdregion" count="1"/> <Stat type="cdregion" subtype="CDS" count="1"/> <Stat type="gene" count="1"/> <Stat type="gene" subtype="Gene" count="1"/> <Stat type="org" count="1"/> <Stat type="pub" count="2"/> <Stat source="CDS" type="all" count="1"/> <Stat source="CDS" type="prot" count="1"/> <Stat source="CDS" type="prot" subtype="Prot" count="1"/> <Stat source="all" type="Length" count="624"/> <Stat source="all" type="all" count="4"/> <Stat source="all" type="blob_size" count="1099050"/> <Stat source="all" type="cdregion" count="1"/> <Stat source="all" type="gene" count="1"/> <Stat source="all" type="org" count="1"/> <Stat source="all" type="prot" count="1"/> <Stat source="all" type="pub" count="2"/> </Statistics> <Properties na="1">1</Properties> <OSLT indexed="yes">OL800728.1</OSLT> <AccessionVersion>OL800728.1</AccessionVersion> </DocumentSummary> </DocumentSummarySet> ``` 那如果我們只需要取得物種編碼或學名，只要善用 Linux 的一些文字處理工具就可以達成： ```bash= $ esearch -db nucleotide -query "OL800728.1"| esummary| grep TaxId| cut -d '>' -f 2| cut -d '<' -f 1 69563 $ esearch -db nucleotide -query "OL800728.1"| esummary| grep "Organism"| cut -d '>' -f 2| cut -d '<' -f 1 Coccopigya hispida ``` 如果有大量序列編碼要處理的話，只要存成文字檔，寫個迴圈就可以解決了： ```bash! for line in $(filename); do \ esearch -db nucleotide -query $line| esummary|\ grep "Organism"| cut -d '>' -f 2| cut -d '<' -f 1; done ``` 大致上可以如此處理，搭配簡單的小工具，在家使用終端機就可以大量查找序列編號並且轉換為物種學名轉換了。 <span style="font-size:30px">🐕‍🦺</span><font color="dcdcdc">2023.07.26</font>