Binder – Snakemake 使用教學

文／王俐晴（中央研究院資訊科學研究所實習生）

Snakemake 是一個處理工作流程的工具，透過與其他軟體套件 (例如：samtools、BWA 等) 的整合，或單純使用 shell 和 Python 指令，建立規則 (rule) 以實現對資料的系統化分析，並產製結果報告。在功能上與 Makefile 類似，但支援的設定更多，書寫簡單，邏輯易懂，更具使用者友善性。

利用 Snakemake 來讓研究人員將其含有研究方法 (分析順序) 的 Snakefile 和相關資料上傳至資料儲存庫（如 depositar）。使其他使用者在重現研究成果或利用自身資料集進行驗證時，可透過將資料上傳至 Binder 等運算環境，並在終端機輸入簡單指令，自動執行 Snakefile 中定義的工作流程。為了達成不同的需求，Snakemake 支援引用外部資源 (例如 GitHub)，或在分析複雜的資料時，將資料的處理分為不同的階段，達到多層次的工作流程。最後，Snakemake 還能產製工作流程圖 (DAG 有向無環圖) 或成果報告 (如 .zip 或 HTML 格式)。

撰寫工作流程檔案 (Snakefile)

Rule (規則)

如下方示意，可把 rule 想像為一件捆綁在一起的工作，有時希望多個不相關的工作可以同步執行時，可以將它們設為不同的 rule，以分攤工作，平衡流程。












rule all:
    input:
        "new_report.html"

rule unzip_file:                    # rule 後接 名稱
    input:
        "resource.zip"
    output:
        "resource/tp-marriage.csv",
        "resource/tp-economy.csv"
    shell:
        "unzip {input} -d ./"

Tip

rule all 是在 $ snakemake 的時候預設的執行目標，通常代表所有工作都要執行。

Rule attribute (規則屬性)

定義工作內容，如要拿取或輸出的資料、欲執行的處理過程、分配給這項工作的資源等。

input:
- 放入這個 rule 會用到的檔案 (可以不用有使用到的檔案都放)，形成一個從其他地方拿入資料的概念。相同的檔案會放在前一個 rule 的 output 來強調順序，下一個規則屬性會提到
- 路徑名需加 " " (例如 "resource.zip")
- 多個輸出以 , 分開 (例如 "report.zip", directory("data"))
- 有放在 input 裡的檔案，在相同 rule 的其他地方使用時，可以使用 {input} 取代
- 有多個輸入檔案時，在其他地方使用需按照在 input: 中的順序，以 {input[0]},{input[1]}…類推使用。用法類似 output 屬性，可見下方 output 最後一點的說明
- 反之，若沒有在 input: 中書寫，在其他屬性中使用則得打出完整路徑 (如 workflow/Snakefile)
- 指定要拿取某 rule 的輸出做為輸入，可使用 rules.<欲拿取的 rule 名稱>.output
output:
- 放入這個 rule 會輸出的檔案 (可以不用會輸出到的檔案都放)，形成一個給其他地方資料的概念。(相同的檔案會放在下一個 rule 的 input，使下一個 rule 可以依靠此 rule 的 output 建立順序)
- 路徑名需加 " " (例如 "report.zip")
- 多個輸出以 , 分開
- 有放在 output 裡的檔案，在相同 rule 的其他屬性使用時，可以使用 {output} 取代 ; 若是有多個 output 時，則需按照在 output: 的順序以 {output[0]}, {output[1]}…類推使用 (見範例一 shell: 下方的 output[0] 與 output[1]，分別依順序對應了 output: 中的 report.zip 與 second_file_in_output_example.txt)

# 範例一

rule generate_report:
    input:
        rules.run_workflow.output
    output:
        "report.zip",
        "second_file_in_output_example.txt"
    shell:
        '''
        snakemake -s workflow/Snakefile --report {output[0]}
        rm -rf {input}
        touch {output[1]}
        '''

shell:
- 寫入所要執行的 shell 指令
- 單個指令用 " " (例如 "snakemake -s workflow/Snakefile --report {output}")
- 多個指令用 ''' ''' 來包住指令 (例如''' <..shell command..> ''')
- 指令與 Python 程式碼都要用""" """包住全部
- Python 程式碼需另外以 python -c " <..Python code..> " 包圍 (如下方範例二)

# 範例二

rule data_update:
    input:
        rules.deploy.output[0],
        rules.deploy.output[1]
    output:
        "workflow/new_snakefile"
    shell:"""
        python -c "
            import re
            # Read the file
             /* more */
            "
        mv {input[0]} {output}	# shell 指令
        """

run:
- 直接放入 Python 程式碼，不用 " "
- 加入 shell 指令用 shell("<..shell command..>")

# 範例三

run:
    shell("unzip {input} -d {output}")
    
    with open(input.report) as f:
        report_content = f.read()

    # 讀取 CSV 文件內容
    csv_content1 = open(input.m_csv).read()
    /* more */

script:
- 放入要執行的 Python 檔案

# 範例四

rule ebar_chart:
    input:
        "resource/tp-economy.csv"
    output:
        "result/bar/e-bar.html"
    script:
        "scripts/make_bar_chart.py"

log:
- 輸出日誌文件的路徑
- 路徑名需加 " "
- 在 shell 的指令候用 … 2> {log} 使用 (見範例五)
- 執行後可在此檔案看到執行紀錄
params:
- 傳遞額外的參數到 shell 指令或其他地方
- 以 {params} 使用 (如範例五: {params.extra_params} 就等同於指定取用 params: 裡面的 extra_params，也就是 --option1 value1 --option2 value2)
threads:
- 規則所需的 CPU 核心數量
resources:
- 規則所需的記憶體與磁碟空間等

# 範例五

log:
    "process_data.log"
params:
    extra_params="--option1 value1 --option2 value2"
shell:
    "process_data {input} {params.extra_params} > {output} 2> {log}"

設定文件 (config.yaml)

可將需多次使用的路徑或資源放置於設定文件，減少 Snakefile 中不必要的重複書寫。

# config.yaml

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq
    C: data/samples/C.fastq

# Snakefile

def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]
rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
        # "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"


configfile: "config.yaml" # config file 的路徑
    /..more../
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.sorted.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.sorted.bam.bai", sample=config["samples"])
/..more../

config["samples"][wildcards.sample] 取得 config 檔案裡 samples: 下的路徑。在 shell 中使用 {input} 時會將三個路徑都當輸入跑一遍
{sample} 是萬用字元 (wildcards)，會得到樣本名稱 (不是路徑，所以 logs/bwa_mem/ 裡面是 A, B, C 的 log 檔)
bam=expand("sorted_reads/{sample}.sorted.bam, sample=config["samples"]) 會讓 bam = sorted_reads/{A,B,C}.sorted.bam 三個檔案，所以在 shell 用 {input.bam} 時會執行到三個檔案

常用 Snakemake 命令列工具參數

您可使用以下指令執行撰寫的 Snakefile 流程：

-n (--dry-run)：模擬試跑 workflow (除錯用)。如：$ snakemake -n
-p：列出所有紀錄 (除錯用)。如：$ snakemake -p
-f：強制重新跑完整 workflow。如： $ snakemake -f
-s：指定欲執行的 Snakefile。如： $ snakemake -s new_snakefile

製作 Snakemake 報告 (report)

Snakemake 於製作 Snakemake report 時，會加入流程圖 (Workflow)、執行時間 (Statistics)、使用套件等其他相關資訊 (About)。

您並可在 input: 或 output: 中使用 report(...)，將各流程的結果加入報告。參數如下：

category: 在首頁 Result 顯示，指定文件在報告中的分類
caption: 指定文件的說明文字
labels: 文件在報告中的標籤

rule ebar_chart:
    input:
        report("resource/tp-economy.csv",
        category=lambda w: "Data Source",
        caption="report/csv2.rst")
    output:
        "result/bar/e-bar.svg"
    conda:
        "environment.yaml"
    script:
        "scripts/make_bar_chart.py"

將輸出如下圖之報告：

產製 report 的指令如下：

輸出工作流程圖 (DAG 有向無環圖)：$ snakemake --dag <目標 Snakefile 名稱> | dot -Tsvg > <輸出 DAG 圖名>.svg
輸出 report (內建 workflow, statistics, about)：$ snakemake -s 目標 Snakefile 名稱> -–report <欲產生的 report 路徑與名稱 (.zip or .html …)>

於 Binder 環境執行 Snakemake

Tip

目前若要在 Binder 中使用 Snakemake 需要另行安裝，可將如下內容的 Binder 環境設定檔 (environment.yml) 上傳為 depositar 的資源，以在開啟 Binder 時一併安裝 Snakemake 於 Binder 所提供的 JupyterLab 環境中 (也可以在開啟 Binder 後手動在 terminal 安裝)。

Tip

受 xsrf 限制，Binder 環境下無法正常瀏覽 Snakemake HTML 報告，請下載至本機瀏覽。

Important

副檔名必須為 .yml (而非 .yaml)，否則不會安裝該檔案所定義的相依套件。

environment.yml





















# 放置執行工作需要有的環境資源
name: environment               # 環境名稱

channels:
  - conda-forge
  - bioconda                    # 生物資訊套件的 channel

dependencies:                   # 函式庫
  - snakemake-minimal >=8.4.4   # 可強調版本
  - python
  - matplotlib
  - pandas
  - numpy
  - plotly
  - folium
  - pip                         # 這裡是 Conda 自身的 pip 套件
  - cwltool 

# 如果有需要從 PyPI 安裝的套件，可以這樣放在 pip 下 (conda沒有的)
# - pip:
#   - python_package_not_in_conda

範例

台北市金價與結婚人數分析資料集，包含以下內容：

environment.yml: Binder 環境設定檔
Snakefile: 工作流程檔案
report.zip: 報告說明文字
resource.zip: 原始資料
scripts.zip: 繪製金價與結婚對數長條圖的 Python 程式

壓縮檔內容 (執行時無需解壓縮)

.
├── report # 來自 report.zip
│   ├── csv1.rst
│   ├── csv2.rst
│   └── workflow.rst
├── resource # 來自 resource.zip
│   ├── tp-economy.csv
│   └── tp-marriage.csv
└── scripts # 來自 scripts.zip
    └── make_bar_chart.py

Snakefile























































































report: "report/workflow.rst"

from snakemake.utils import min_version
min_version("6.10.0")

rule all:
    input:
        "report.html"

rule unzip_file:   
    input:
        "resource.zip",
        "scripts.zip",
        "report.zip"
    output:
        "resource/tp-marriage.csv",
        "resource/tp-economy.csv",
        "scripts/make_bar_chart.py"
    shell:
        '''
        unzip {input[0]} -d ./
        unzip {input[1]} -d ./
        unzip {input[2]} -d ./
        '''

rule mbar_chart:
    input:
        report("resource/tp-marriage.csv",
        category=lambda w: "Data Source",
        caption="report/csv1.rst")
    output:
        "result/bar/m-bar.svg"
    conda:
        "environment.yaml"
    script:
        "scripts/make_bar_chart.py"

rule ebar_chart:
    input:
        report("resource/tp-economy.csv",
        category=lambda w: "Data Source",
        caption="report/csv2.rst")
    output:
        "result/bar/e-bar.svg"
    conda:
        "environment.yaml"
    script:
        "scripts/make_bar_chart.py"

rule custom_report:
    input:
        "result/bar/m-bar.svg",
        "result/bar/e-bar.svg"
    output:
        report(
            "m-bar.svg",
            category=lambda w: "Marriage Diagram",
            labels=lambda w, params: {"type": "figure", "num": "1"},
        ),
        report(
            "e-bar.svg",
            category=lambda w: "Gold Price Diagram",
            labels=lambda w, params: {"type": "figure", "num": "1"},
        ),
        "first.txt"
    shell:
        '''
        sleep $(shuf -i 1-3 -n 1)
        cp {input[0]} {output[0]}
        cp {input[1]} {output[1]}
        touch {output[2]}
    '''

rule generate_report:
    input:
        "first.txt",
        "m-bar.svg",
        "e-bar.svg",
        "resource/tp-marriage.csv",
        "resource/tp-economy.csv"
    output:
        "report.html",
    shell:
        '''
        snakemake --report {output[0]}
        rm -rf first.txt
        '''

於 Binder 服務執行此資料集，在開啟的 JupyterLab 環境所提供的終端機 (Terminal) 中，輸入 snakemake，即可執行流程。完成的長條圖位於 result 資料夾內，報告輸出於 report.html。

參考輸出

Assuming unrestricted shared filesystem usage.
host: jupyter-dataset-b9909-0s6d8kk7
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job stats:
job                count
---------------  -------
all                    1
custom_report          1
ebar_chart             1
generate_report        1
mbar_chart             1
unzip_file             1
total                  6

Select jobs to execute...
Execute 1 jobs...

(中間省略)

[Wed Feb 19 09:23:37 2025]
localrule all:
    input: report.html
    jobid: 0
    reason: Input files updated by another job: report.html
    resources: tmpdir=/tmp

[Wed Feb 19 09:23:37 2025]
Finished job 0.
6 of 6 steps (100%) done

結語

以上為簡單的 Snakemake 使用教學，因為研究過程中主要是為了協助使用者順利在 depositar 的 Binder 功能上順利使用 Snakemake，所以皆搭配了一些在 Binder 使用時的注意事項。若欲知更多深入的 Snakemake 功能，可以上 Snakemake 官方網站閱讀更詳細的說明。