自動上網爬ＫＥＧＧ資料

# 自動上網爬ＫＥＧＧ資料 KEGG（Kyoto Encyclopedia of Genes and Genomes）是一個綜合資料庫，整合了基因體資訊、化學資訊和生化系統功能資訊，目前包含了16個子資料庫。比如，KEGG PATHWAY資料庫包含了圖解的細胞代謝、膜轉運、訊號傳導等路徑信息； KEGG GENES資料庫、KEGG GENOME資料庫則包含了部分或者完整序列的基因/基因體資訊；KEGG Orthology（KO）是KEGG直系同源資料庫，將各個KEGG註釋系統聯繫在一起，將分子網路和基因體資訊連結起來，根據直系同源關係，實現跨物種的基因體或轉錄體的功能註釋。 https://www.genome.jp/kegg/ *　取得編號的清單 *　利用編號作為網址擷取的變數，獨立至每個頁面索取資料。 1.　先 split -l 100 物種物種_ 2. ls >list 3.　修改 list 內，去掉 "物種" 4.　修改 run_grep_fa.sh 的 "物種" 5.　修改 grep_web.pl 內 =~ />物種/ 使用　LWP::Simple　套件可以直接把網頁內容爬過去。再用`　split（）`　和正規處理去擷取出自己要的內容。 #### 建議希望之後直接擷取 KEGG org_id list 去處理先印header ``` open IN, "<$input" ||die "can't open [$input]:$!"; open OUT, ">$input.spe_list" ||die "can't open [$input.spe_list]:$!"; printf OUT "kegg_gene\tannotation\tEC_number\tK_number\n"; open OUT_1, ">>$input.spe_list" ||die "can't open [$input.spe_list]:$!"; my (@header, @ola, @other); ``` 寫成迴圈次數/100取餘數為0時休息來達到每100條休息，不要用分檔案的，容易出錯。 ``` foreach(<IN>){ chomp; @header=split/\t/,$_; @other=split/;\s/,$header[1]; printf OUT_1 "$header[0]\t$other[0]\t$other[1]\t$other[2]\n"; push @ola,$header[0]; $_=$header[0]; s/:/_/; } close IN; close OUT; close OUT_1; my (@web_get, @head, @seq ,$full_seq, $BIG_full_seq, $website, $seq_len); for (my $i=0;$i<=$#ola;$i++){ $website = get "http://rest.kegg.jp/get/$ola[$i]/ntseq"; print "$i greping $ola[$i]... \n"; @web_get=split/\n/,$website; open R_output, ">>$input.txt" ||die "can't open [$input.txt]:$!"; open R1_output, ">>$input.fa" ||die "can't open [$input.fa]:$!"; foreach(@web_get){ if($_=~/\A>($org)/){ @head=split/;\s/,$_; printf R_output "$head[0]\t$head[1]\t$head[2]\n"; }else { push @seq,$_; } } $full_seq=join//,@seq; $BIG_full_seq=uc $full_seq; $seq_len =length $full_seq; ``` output fasta 結果 header 就先使用id　做　header 在製作一個 id和seq_name的對照表 ``` printf R1_output "$head[0]\t$head[1]\t$head[2]\tlenght=$seq_len\n$BIG_full_seq\n"; @seq=(); @web_get=(); @head=(); sleep(0.1); } ``` 延長休息時間 ``` sleep(15); ```