二級分析 / samtools === ###### tags: `基因體/二級分析` ###### tags: `生物資訊`, `基因體`, `二級分析`, `sam/bam` <br> [TOC] <br> ## SAMtools 介紹 - http://samtools.sourceforge.net/ - See http://htslib.org/ for the new 1.x releases of SAMtools, BCFtools, and HTSlib. - http://www.htslib.org/ - **規格 (Specifications, Spec.)**:(非專業人員勿入) - [SAM/BAM and related specifications](http://samtools.github.io/hts-specs/) ([github](https://github.com/samtools/hts-specs)) - Sequence Alignment/Map Format Specification - [SAM v1](http://samtools.github.io/hts-specs/SAMv1.pdf) - The Variant Call Format (VCF) Version 4.X Specification - [VCF v4.1](http://samtools.github.io/hts-specs/VCFv4.1.pdf) - [VCF v4.2 - 2015/10](http://samtools.github.io/hts-specs/VCFv4.2.pdf) - [VCF v4.3 - 2022/08/22](http://samtools.github.io/hts-specs/VCFv4.3.pdf) - [VCF v4.4 - 2023/01/27](http://samtools.github.io/hts-specs/VCFv4.4.pdf) <br> ## 下載軟體 - ### Ubuntu 有提供套件,可直接安裝 ``` $ sudo apt install samtools ``` - ### [Download1](http://www.htslib.org/download/) ``` $./configure --prefix=/where/to/install ... configure: error: libbzip2 development files not found ``` - [fail to find bzip2 #696](https://github.com/samtools/htslib/issues/696) - 關鍵:產生 bzlib.h, libbz2.so ``` autoconf autoheader ./configure --prefix=$INSTALL_DIR CPPFLAGS="-I$INSTALL_DIR/include" LDFLAGS="-L$INSTALL_DIR/lib -Wl,-R$INSTALL_DIR/lib" ``` - ### [Download2](https://sourceforge.net/projects/samtools/files/) - [Home / samtools / 1.11](https://sourceforge.net/projects/samtools/files/samtools/1.11/) - samtools - htslib - bcftools <br> ## [Samtools 操作說明](http://bioinfo.cs.ccu.edu.tw/wiki/doku.php?id=samtools) ### bam2sam - [bam 轉成 sam](https://www.biostars.org/p/1701/) ```samtools view -h -o out.sam in.bam``` - ```-h``` print header for the SAM output ### depth - help ``` $ samtools depth Usage: samtools depth [options] in1.bam [in2.bam [...]] Options: -a output all positions (including zero depth) -a -a (or -aa) output absolutely all positions, including unused ref. sequences -b <bed> list of positions or regions -f <list> list of input BAM filenames, one per line [null] -l <int> read length threshold (ignore reads shorter than <int>) [0] -d/-m <int> maximum coverage depth [8000] -q <int> base quality threshold [0] -Q <int> mapping quality threshold [0] -r <chr:from-to> region --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null] The output is a simple tab-separated table with three columns: reference name, position, and coverage depth. Note that positions with zero coverage may be omitted by default; see the -a option. ``` | reference name | position | coverage depth | | -------------- | -------- | -------------- | | Y | 2652036 | 1 | | Y | 2652037 | 1 | | Y | 2652038 | 1 | | Y | 2652039 | 1 | | Y | 2652040| 1 | - 產生每一核甘酸位置之覆蓋率之統計表 ```samtools depth output.bam``` - 產生染色體編號資訊 ``` samtools depth output.bam | awk '{print $1}' | uniq | sort ``` ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT hs37d5 NC_007605 GL000191.1 GL000192.1 GL000193.1 GL000194.1 GL000195.1 GL000196.1 GL000197.1 GL000198.1 GL000199.1 GL000200.1 GL000201.1 GL000202.1 GL000203.1 GL000204.1 GL000205.1 GL000206.1 GL000207.1 GL000208.1 GL000209.1 GL000210.1 GL000211.1 GL000212.1 GL000213.1 GL000214.1 GL000215.1 GL000216.1 GL000217.1 GL000218.1 GL000219.1 GL000220.1 GL000221.1 GL000222.1 GL000223.1 GL000224.1 GL000225.1 GL000226.1 GL000227.1 GL000228.1 GL000229.1 GL000230.1 GL000231.1 GL000232.1 GL000233.1 GL000234.1 GL000235.1 GL000236.1 GL000237.1 GL000238.1 GL000239.1 GL000240.1 GL000241.1 GL000242.1 GL000243.1 GL000244.1 GL000245.1 GL000246.1 GL000247.1 GL000248.1 GL000249.1 ``` - [Question: Tools To Calculate Average Coverage For A Bam File?](https://www.biostars.org/p/5165/) ``` samtools depth *bamfile* | awk '{sum+=$3} END { print "Average = ",sum/NR}' samtools depth *bamfile* | awk '{sum+=$3; sumsq+=$3*$3} END { print "Average = ",sum/NR; print "Stdev = ",sqrt(sumsq/NR - (sum/NR)**2)}' ``` - `(sum/NR)**2` 無法 work? (2022/12/07) - 改成 `(sum/NR)*(sum/NR)` or `(sum/NR)^2` | fastq dataset | average<br>coverage | Stdev | RawSize:<br>R1<br>R2<br>R1+R2 | ZipSize:<br>R1<br>R2<br>R1+R2 | raw/zip<br>ratio | date | | -------- | -------- | -------- | ---------------------- | ---------------------- | --- | ---- | | WGS-LIS-AI018A | 44.48 | 75.42 | 160.94GB<br>160.94GB<br>321.87GB | 32.89GB<br>34.29GB<br>67.18GB | 4.89x<br>4.69x<br>4.79x | | WES-EDA-013A | 17.56 | 46.07 | 11.27GB<br>10.75GB<br>22.03GB | 2.48GB<br>2.69MB<br>5.16GB | 4.55x<br>4.00x<br>4.26x | | parabricks_sample | 7.24 | 46.20 | 7.24GB<br>7.24GB<br>14.47GB | 2.42GB<br>2.61GB<br>5.03GB| 2.98x<br>2.77x<br>2.88x | | D15780_S13_L001 | 1.81 | 9.32 | 155.44MB<br>155.47MB<br>310.91MB| 58.04MB<br>63.79MB<br>121.83MB | 2.68x<br>2.44x<br>2.55x | | NA24631 | 1.50279 | 1.29443 | 7.2MiB<br>7.3MiB<br>14.6MiB | 2.1MiB<br>2.2MiB<br>4.3GB| 3.50x<br>3.30x<br>3.40x | 2022/12/07 | | HG002.hiseqx.pcr-free.30x | 33.4799 | 58.4035 | | | | 2023/03/28 | | fastq dataset | raw/coverage:<br>R1+R2 | zip/coverage:<br>R1+R2 | | ----------------- | ------------ | ------------ | | WGS-LIS-AI018A | 7.24GB/1x | 1.51GB/1x | | WES-EDA-013A | 1.25GB/1x | 0.29GB/1x | | D15780_S13_L001 | 171.77MB/1x | 67.31MB/1x | | parabricks_sample | 2.00GB/1x | 0.69GB/1x | | fastq dataset | first line | 列1長度 (含換行字元) | 列2長度 (含換行字元) | | ----------------- | ---------- | ------ | ------ | | WGS-LIS-AI018A | @A00355:26:H3WFJDSXX:1:1101:1081:1047 2:N:0:GAATTCGT+ATAGAGGC | 62 | 152 | | WES-EDA-013A | @K00236:240:H3NH2BBXY:7:1101:1255:1103 1:N:0:NTGTTCTA | 54 | 151 | | D15780_S13_L001 | @M01199:29:000000000-C54LP:1:1101:23230:1786 1:N:0:13 | 54 | 300 | | parabricks_sample | @HWI-D00127:570:HK3TJBCX2:1:1101:1141:1991 1:N:0:GGCTAC | 56 | 116 | <br>