Try   HackMD

Sequence Fragment Geometry Description Language

The sequence fragment geometry descriptions language (FGDL) is a basic grammar designed to describe the layout of information encoded in sequenced fragments. Specifically, it is initially designed to support parsing different sequencing chemistries that are common in single-cell transcriptomics. It is capable of processing both "simple" and "complex" geometries — where the definitions of these terms are outlined below.

Examples of some geometries in this format.

Below are examples of some common geometries (chemistries) and how they are translated into FGDL.

  • 10x Chromium v3 3' : 1{b[16]u[12]}2{r:}
  • 10x Chromium v2 3' : 1{b[16]u[10]}2{r:}
  • SciSeq3 : 1{b[9-10]f[CAGAGC]u[8]b[10]}2{r:}

Preventing improper parsing

By virtue of the grammar, some things are not allowed. For example, due to the resulting ambiguity, it is not possible to have a variable length segment followed by a variable length or unbounded (i.e. :) segment. The following description, for example, will (properly) fail to parse.

  • 1{u[15]b[9-10]u:}2{r:}

Formal grammar

Below is the formal grammar for descriptions accepted by the FGDL. This is the actual grammar, and the syntax used below is the syntax of the pest library.

// hidden tokens
bopen = _{ "[" }
bclose = _{ "]" }
rsep = _{ "-" }
usep = _{ ":" }
dopen = _{ "{" }
dclose = _{ "}" }

read_num   =  { "1" | "2" }
single_len =  { ASCII_DIGIT+ }
len_range  =  ${ single_len ~ rsep ~ single_len }
nucstr     =  { ("A" | "C" | "G" | "T")+ }

fixed_barcode_segment = { "b" ~ bopen ~ single_len ~ bclose }
fixed_umi_segment     = { "u" ~ bopen ~ single_len ~ bclose }
fixed_seq_segment     = { "f" ~ bopen ~ nucstr ~ bclose }
fixed_read_segment    = { "r" ~ bopen ~ single_len ~ bclose }
fixed_discard_segment = { "x" ~ bopen ~ single_len ~ bclose }

ranged_barcode_segment = { "b" ~ bopen ~ len_range ~ bclose }
ranged_umi_segment     = { "u" ~ bopen ~ len_range ~ bclose }
ranged_read_segment    = { "r" ~ bopen ~ len_range ~ bclose }
ranged_discard_segment = { "x" ~ bopen ~ len_range ~ bclose }

unbounded_barcode_segment = { "b" ~ usep }
unbounded_umi_segment     = { "u" ~ usep }
unbounded_read_segment    = { "r" ~ usep }
unbounded_discard_segment = { "x" ~ usep }

fixed_segment = {
    (fixed_umi_segment | fixed_read_segment | fixed_barcode_segment | fixed_seq_segment | fixed_discard_segment)
}

ranged_segment = {
    (ranged_umi_segment | ranged_read_segment | ranged_barcode_segment | ranged_discard_segment)
}

bounded_segment = _{
    (fixed_segment | (ranged_segment ~ fixed_seq_segment) | (unbounded_segment ~ fixed_seq_segment))
}

unbounded_segment = {
    (unbounded_umi_segment | unbounded_read_segment | unbounded_barcode_segment | unbounded_discard_segment)
}

read_desc = {
    dopen ~ ((bounded_segment)+ ~ (ranged_segment | unbounded_segment)? | unbounded_segment | ranged_segment) ~ dclose
}
read_1_desc = { "1" ~ read_desc }
read_2_desc = { "2" ~ read_desc }
frag_desc = _{ SOI ~ read_1_desc ~ read_2_desc ~ EOI}