snapATAC - HackMD

# snapATAC #### 北大松山追記(28/01/2020) ここで使用されているbed file 形式に合わせられていると思いますが、参考として ``` cat gencode.vM17.annotation.gtf | awk -F '[\t ]' 'BEGIN{OFS="\t"} $3~/gene/ {print $1,$4,$5,substr($14,2,length($14)-3)}' >gencode.vM17.gene.bed ``` gtf fileのfeature(３列目)にgene と書かれている行に対して、『seqname(１列目)　start(４列目)　end(５列目)　genename(14列目)』を出力するコマンド #### 北大４年松山追記(15/01/2020) 3.9で、gencodeを用いる箇所があります。 tutorialはcellranger-atac の version が 1.0.0なのでgencode.vM16 を用いていますが、今回のデータはcellranger-atac version 1.1.0 で作成しており、gencode.vM17を用いるのが良いと思われます。 (参照:Cell Ranger-ATAC https://support.10xgenomics.com/single-cell-atac/software/pipelines/1.1/advanced/references) (参照:GENCODE genes.gtf https://www.gencodegenes.org/mouse/release_M17.html) 3.11 の MACS2 を動かす際に、macs.options を指定する箇所がありますが、tutorial のコピー＆ペーストではcluster 1の場合と他の cluster の場合で options が異なるので、どちらかに合わせた方が良いと思われます。 ### 1. snapATAC インストール #### 1-1. conda で新しい環境を作成 ``` [localhost ~]$ conda create -n test_scatac [localhost ~]$ conda activate test_scatac ``` #### 1-2. snapATAC(python module) インストール参考URL : [https://github.com/r3fang/SnapATAC](https://github.com/r3fang/SnapATAC) ``` (test_scatac) [localhost ~]$ pip install snaptools ``` #### 1-3. snapATAC(R package) インストール ``` #まずRをインストール (test_scatac) [localhost ~]$ conda install -c r r=3.6 #必要なパッケージをインストール (test_scatac) [localhost ~]$ conda install -c r r-devtools (test_scatac) [localhost ~]$ conda install -c r r-stringi (test_scatac) [localhost ~]$ conda install -c r r-igraph (test_scatac) [localhost ~]$ conda install -c r r-rcurl (test_scatac) [localhost ~]$ conda install -c r r-ggplot2 (test_scatac) [localhost ~]$ conda install -c conda-forge r-dosnow (test_scatac) [localhost ~]$ conda install -c conda-forge r-plot3d (test_scatac) [localhost ~]$ conda install -c bioconda bioconductor-rhdf5 # Rを実行 (test_scatac) [localhost ~]$ R > library(devtools) > install_github("r3fang/SnapATAC") #途中 update を尋ねられたとき 3:None を選択する。 ``` Rstudioがあった方が便利なのでインストールした ``` conda install -c r rstudio conda install -c r/label/borked rstudio ``` ### 2.CellRanger output の .bamから .snapを生成する。参考URL : https://github.com/r3fang/SnapATAC/wiki/FAQs#can-i-run-snapatac-with-cellranger-outcome #### 2-1. extract the header file ``` (test_scatac)[localhost outs]$ samtools view possorted_bam.bam -H > Young_mouse_lung_possorted.header.sam ``` #### 2-2. create a bam file with the barcode embedded into the read name 約1時間15分かかった。 ``` (test_scatac)[localhost snap_from_bam]$ cat <( cat Young_mouse_lung_possorted.header.sam ) <( samtools view ../outs/possorted_bam.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^CB:Z:"){ td[substr($i,1,2)] = substr($i,6,length($i)-5); } }; printf "%s:%s\n", td["CB"], $0 }' ) | samtools view -bS - > Young_mouse_lung_possorted.snap.bam ``` #### 2-3. sort the bam file by read name 約30分かかった。 ``` (test_scatac)[localhost snap_from_bam]$ samtools sort -n -@ 6 -m 1G Young_mouse_lung_possorted.snap.bam -o Young_mouse_lung.snap.nsrt.bam ``` #### 2-4. generate the snap file 約1.5時間かかった。事前にUSCSから、mm10.chrom.sizes ファイルをダウンロードしておくこと。 ``` (test_scatac) [localhost snap_from_bam]$ snaptools snap-pre \ --input-file=Young_mouse_lung.snap.nsrt.bam \ --output-snap=Young_mouse_lung.snap \ --genome-name=mm10 \ --genome-size=/home/genome/mm10/mm10.chrom.sizes \ --min-mapq=30 \ --min-flen=50 \ --max-flen=1000 \ --keep-chrm=TRUE \ --keep-single=FALSE \ --keep-secondary=False \ --overwrite=True \ --max-num=20000 \ --min-cov=500 \ --verbose=True ``` #### 2-5. remove temporary files ``` $ rm Young_mouse_lung_possorted.snap.bam $ rm Young_mouse_lung_possorted.header.sam ``` #### 2-6. Cell-by-bin matrix ``` (test_scatac) [localhost snap_from_bam]$ snaptools snap-add-bmat \ --snap-file=Young_mouse_lung.snap \ --bin-size-list 1000 5000 10000 \ --verbose=True ``` ``` ===== reading the barcodes and bins ====== @AM nBinSize:3 @AM binSizeList: [1000, 5000, 10000] @AM binSize:1000 nBin:2730901 @AM binSize:5000 nBin:546206 @AM binSize:10000 nBin:273121 ``` ### 3. Analysis 参考URL : [https://github.com/r3fang/SnapATAC/blob/master/examples/10X_brain_5k/README.md](https://github.com/r3fang/SnapATAC/blob/master/examples/10X_brain_5k/README.md) #### 3-1. Barcode selection 必要なインプットファイルは、上記で生成した.snapファイルと、CellRanger output の singlecell.csv Rで実行（Rscript mouse_lung.R として保存） ``` > setwd("/home/amelieff/scATAC/scATAC_mouse_lung/test_snapATAC_mouse/") > library(SnapATAC); > x.sp = createSnap( file="Young_mouse_lung.snap", sample="Young_mouse_lung", num.cores=1 ); > barcodes = read.csv( "Young_mouse_singlecell.csv", head=TRUE ); > barcodes = barcodes[2:nrow(barcodes),]; > promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1); > UMI = log(barcodes$passed_filters+1, 10); > data = data.frame(UMI=UMI, > promoter_ratio=promoter_ratio); > barcodes$promoter_ratio = promoter_ratio; > library(viridisLite); > library(ggplot2); > p1 = ggplot( data, aes(x= UMI, y= promoter_ratio)) + geom_point(size=0.1, col="grey") + theme_classic() + ggtitle("Young mouse lung") + ylim(0, 1) + xlim(0, 6) + labs(x = "log10(UMI)", y="promoter ratio") > p1 > barcodes.sel = barcodes[which(UMI >= 3 & UMI <= 5 & promoter_ratio >= 0.15 & promoter_ratio <= 0.6),]; > rownames(barcodes.sel) = barcodes.sel$barcode; > x.sp = x.sp[which(x.sp@barcode %in% barcodes.sel$barcode),]; > x.sp@metaData = barcodes.sel[x.sp@barcode,]; > x.sp ## number of barcodes: 3595 ## number of bins: 0 ## number of genes: 0 ## number of peaks: 0 ## number of motifs: 0 ``` #### 3-2. Add cell-by-bin matrix ``` # show what bin sizes exist in .snap file > setwd("snap_from_bam/"); > showBinSizes("Young_mouse_lung.snap"); > x.sp = addBmatToSnap(x.sp, bin.size=5000, num.cores=1) ``` #### 3-3. Matrix binarization ``` > x.sp = makeBinary(x.sp, mat="bmat") ``` #### 3-4. Bin filtering ``` # First, filter out any bins overlapping with the ENCODE blacklist to prevent from potential artifacts. > system("wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz") > library(GenomicRanges) > black_list = read.table("mm10.blacklist.bed.gz") > black_list.gr = GRanges( black_list[,1], IRanges(black_list[,2], black_list[,3]) ); > idy = queryHits(findOverlaps(x.sp@feature, black_list.gr)) > if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]} > x.sp ``` ``` ## number of barcodes: 3595 ## number of bins: 546103 ## number of genes: 0 ## number of peaks: 0 ## number of motifs: 0 ``` ``` # Second, remove unwanted chromosomes. > chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))] > idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature) > if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]} > x.sp ``` ``` number of barcodes: 3595 number of bins: 545183 number of genes: 0 number of peaks: 0 number of motifs: 0 ``` ``` # Third, the bin coverage roughly obeys a log normal distribution. We remove the top 5% bins that overlap with invariant features such as promoters of the house keeping genes. > bin.cov = log10(Matrix::colSums(x.sp@bmat)+1) > hist( + bin.cov[bin.cov > 0], + xlab="log10(bin cov)", + main="log10(Bin Cov)", + col="lightblue", + xlim=c(0, 5) + ); > bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95) > idy = which(bin.cov <= bin.cutoff & bin.cov > 0); > x.sp = x.sp[, idy, mat="bmat"]; > x.sp ``` ``` number of barcodes: 3595 number of bins: 477876 number of genes: 0 number of peaks: 0 number of motifs: 0 ``` #### 3-5. Dimensionality reduction ``` > x.sp = runDiffusionMaps( obj=x.sp, input.mat="bmat", num.eigs=50 ) ``` #### 3-6. Determine significant components ``` > plotDimReductPW( obj=x.sp, eigs.dims=1:50, point.size=0.3, point.color="grey", point.shape=19, point.alpha=0.6, down.sample=5000, pdf.file.name=NULL, pdf.height=7, pdf.width=7 ) ``` #### 3-7. Graph-based clustering ``` #eigs.dims には上記のプロットの中からblobにみえ始める次元数を選択する > x.sp = runKNN( obj=x.sp, eigs.dims=1:20, k=15 ); > x.sp=runCluster( obj=x.sp, tmp.folder=tempdir(), louvain.lib="R-igraph", seed.use=10 ); > x.sp@metaData$cluster = x.sp@cluster ``` #### 3-8. Visualization ``` > x.sp = runViz( obj=x.sp, tmp.folder=tempdir(), dims=2, eigs.dims=1:20, method="Rtsne", seed.use=10 ); > par(mfrow = c(2, 2)); > plotViz( obj=x.sp, method="tsne", main="Young mouse lung", point.color=x.sp@cluster, point.size=1, point.shape=19, point.alpha=0.8, text.add=TRUE, text.size=1.5, text.color="black", text.halo.add=TRUE, text.halo.color="white", text.halo.width=0.2, down.sample=10000, legend.add=FALSE ); ``` ``` > plotFeatureSingle( obj=x.sp, feature.value=log(x.sp@metaData[,"passed_filters"]+1,10), method="tsne", main="Young mouse lung Read Depth", point.size=0.2, point.shape=19, down.sample=10000, quantiles=c(0.01, 0.99) ); ``` ※FRiP (fraction of reads in peaks) とは、「得られたピーク領域内にマップされたリード数の全マップリード数に対する割合」を表す。（参照URL：[http://rnakato.hatenablog.jp/entry/2018/03/14/180854](http://rnakato.hatenablog.jp/entry/2018/03/14/180854)） ``` > plotFeatureSingle( obj=x.sp, feature.value=x.sp@metaData$peak_region_fragments / x.sp@metaData$passed_filters, method="tsne", main="Young mouse lung FRiP", point.size=0.2, point.shape=19, down.sample=10000, quantiles=c(0.01, 0.99) # remove outliers ); ``` ``` > plotFeatureSingle( obj=x.sp, feature.value=x.sp@metaData$duplicate / x.sp@metaData$total, method="tsne", main="Young mouse lung Duplicate", point.size=0.2, point.shape=19, down.sample=10000, quantiles=c(0.01, 0.99) # remove outliers ) ``` #### 3-9. Gene based annotation ``` #アノテーションをダウンロード ##cellranger-atac version 1.0.0の場合、gencode.vM16 ##version 1.1.0 の場合、gencode.vM17を用意する > system("wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X/gencode.vM16.gene.bed"); > genes = read.table("gencode.vM16.gene.bed"); > genes.gr = GRanges(genes[,1], IRanges(genes[,2], genes[,3]), name=genes[,4] ); #みたいmarker genesを指定する(下記はtutorialママ) > marker.genes = c( "Snap25", "Gad2", "Apoe", "C1qb", "Pvalb", "Vip", "Sst", "Lamp5", "Slc17a7" ); > genes.sel.gr <- genes.gr[which(genes.gr$name %in% marker.genes)]; # re-add the cell-by-bin matrix to the snap object; > x.sp = addBmatToSnap(x.sp); > x.sp = createGmatFromMat( obj=x.sp, input.mat="bmat", genes=genes.sel.gr, do.par=TRUE, num.cores=10 ); # normalize the cell-by-gene matrix > x.sp = scaleCountMatrix( obj=x.sp, cov=x.sp@metaData$passed_filters + 1, mat="gmat", method = "RPM" ); # smooth the cell-by-gene matrix > x.sp = runMagic( obj=x.sp, input.mat="gmat", step.size=3 ); > par(mfrow = c(3, 3)); > for(i in 1:9){ plotFeatureSingle( obj=x.sp, feature.value=x.sp@gmat[, marker.genes[i]], method="tsne", main=marker.genes[i], point.size=0.1, point.shape=19, down.sample=10000, quantiles=c(0, 1) )} ``` #### 3-10. Heretical clustering ``` # calculate the ensemble signals for each cluster > ensemble.ls = lapply(split(seq(length(x.sp@cluster)), x.sp@cluster), function(x){ SnapATAC::colMeans(x.sp[x,], mat="bmat"); }) # cluster using 1-cor as distance > hc = hclust(as.dist(1 - cor(t(do.call(rbind, ensemble.ls)))), method="ward.D2"); > plotViz( obj=x.sp, method="tsne", main="10X Brain Cluster", point.color=x.sp@cluster, point.size=1, point.shape=19, point.alpha=0.8, text.add=TRUE, text.size=1.5, text.color="black", text.halo.add=TRUE, text.halo.color="white", text.halo.width=0.2, down.sample=10000, legend.add=FALSE ); > plot(hc, hang=-1, xlab="") ``` #### 3-11. Identify peaks まずcondaの環境（test_scatac）内にmacs2をインストールしておく ``` (test_scatac) [localhost ~]$ conda install -c bioconda macs2 (test_scatac) [localhost ~]$ conda install -c bioconda/label/cf201901 macs2 ``` snaptoolsとmacs2のパスを確認 ``` > system("which snaptools") # /home/amelieff/anaconda3/envs/test_scatac/bin/snaptools > system("which macs2") # /home/amelieff/anaconda3/envs/test_scatac/bin/macs2 ``` 確認したsnaptoolsとmacs2のパスを指定してmacsを実行(下記では、cluster1が対象) ※macs2のoptionについて > 1. To find enriched cutting sites such as some DNAse-Seq datasets. In this case, all 5' ends of sequenced reads should be extended in both directions to smooth the pileup signals. If the wanted smoothing window is 200bps, then use `--nomodel --shift -100 --extsize 200`. > 2. For certain nucleosome-seq data, we need to pile up the centers of nucleosomes using a half-nucleosome size for wavelet analysis (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about 147bps, this option can be used: `--nomodel --shift 37 --extsize 73`. （参照：[https://github.com/taoliu/MACS/](https://github.com/taoliu/MACS/)） ``` > runMACS( obj=x.sp[which(x.sp@cluster==1),], output.prefix="Young_mouse_lung", path.to.snaptools="/home/amelieff/anaconda3/envs/test_scatac/bin/snaptools", path.to.macs="/home/amelieff/anaconda3/envs/test_scatac/bin/macs2", gsize="mm", buffer.size=500, num.cores=5, macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR", tmp.folder=tempdir() ) ``` 各cluserで連続的にmacsを実行。 ``` # call peaks for all cluster with more than 100 cells clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)]; peaks.ls = mclapply(seq(clusters.sel), function(i){ print(clusters.sel[i]); runMACS(obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], output.prefix=paste0("Young_mouse_lung.", gsub(" ", "_", clusters.sel)[i]), path.to.snaptools="/home/amelieff/anaconda3/envs/test_scatac/bin/snaptools", path.to.macs="/home/amelieff/anaconda3/envs/test_scatac/bin/macs2", gsize="mm", # mm, hs, etc buffer.size=500, num.cores=1, macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR", tmp.folder=tempdir()); }, mc.cores=4); # assuming all .narrowPeak files in the current folder are generated from the clusters peaks.names = system("ls | grep narrowPeak", intern=TRUE); peak.gr.ls = lapply(peaks.names, function(x){ peak.df = read.table(x) GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3])) }) peak.gr = reduce(Reduce(c, peak.gr.ls)); peak.gr ``` ``` ## GRanges object with 88490 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] b'chr1' 3119781-3120324 * ## [2] b'chr1' 3440217-3440417 * ## [3] b'chr1' 3671732-3672060 * ## [4] b'chr1' 4412662-4413015 * ## [5] b'chr1' 4414992-4415192 * ## ... ... ... ... ## [88486] b'chrUn_GL456390' 21595-21891 * ## [88487] b'chrUn_GL456396' 9840-10040 * ## [88488] b'chrUn_GL456378' 25987-26187 * ## [88489] b'chrUn_GL456387' 15189-15389 * ## [88490] b'chrUn_GL456393' 5231-5443 * ## ------- ## seqinfo: 38 sequences from an unspecified genome; no seqlengths ``` #### 3-12. Create a cell-by-peak matrix ``` > peaks.df = as.data.frame(peak.gr)[,1:3]; > write.table(peaks.df,file = "peaks.combined.bed",append=FALSE, quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"), fileEncoding = "") > saveRDS(x.sp, file="Young_mouse_lung.snap.rds") ``` ``` (test_scatac) [localhost snap_from_bam]$ snaptools snap-add-pmat \ --snap-file Young_mouse_lung.snap \ --peak-file peaks.combined.bed ``` 40分かかった。エラーメッセージ: cell x peak matrix already exists, delete it using snap-del first が返ってくる場合、snap-delを実行してから再度 snap-add-pmat を実行する。 ``` (test_scatac) [localhost snap_from_bam]$ snaptools snap-del --snap-file Young_mouse_lung.snap --session-name PM ``` #### 3-13. Add cell-by-peak matrix ``` > x.sp = readRDS("Young_mouse_lung.snap.rds"); > x.sp = addPmatToSnap(x.sp); > x.sp = makeBinary(x.sp, mat="pmat"); > x.sp ``` ``` ## number of barcodes: 3595 ## number of bins: 546206 ## number of genes: 9 ## number of peaks: 88490 ## number of motifs: 0 ```