The sequence fragment geometry descriptions language (FGDL) is a basic grammar designed to describe the layout of information encoded in sequenced fragments. Specifically, it is initially designed to support parsing different sequencing chemistries that are common in single-cell transcriptomics. It is capable of processing both "simple" and "complex" geometries — where the definitions of these terms are outlined below.
Below are examples of some common geometries (chemistries) and how they are translated into FGDL.
1{b[16]u[12]}2{r:}
1{b[16]u[10]}2{r:}
1{b[9-10]f[CAGAGC]u[8]b[10]}2{r:}
By virtue of the grammar, some things are not allowed. For example, due to the resulting ambiguity, it is not possible to have a variable length segment followed by a variable length or unbounded (i.e. :
) segment. The following description, for example, will (properly) fail to parse.
1{u[15]b[9-10]u:}2{r:}
Below is the formal grammar for descriptions accepted by the FGDL. This is the actual grammar, and the syntax used below is the syntax of the pest library.
// hidden tokens
bopen = _{ "[" }
bclose = _{ "]" }
rsep = _{ "-" }
usep = _{ ":" }
dopen = _{ "{" }
dclose = _{ "}" }
read_num = { "1" | "2" }
single_len = { ASCII_DIGIT+ }
len_range = ${ single_len ~ rsep ~ single_len }
nucstr = { ("A" | "C" | "G" | "T")+ }
fixed_barcode_segment = { "b" ~ bopen ~ single_len ~ bclose }
fixed_umi_segment = { "u" ~ bopen ~ single_len ~ bclose }
fixed_seq_segment = { "f" ~ bopen ~ nucstr ~ bclose }
fixed_read_segment = { "r" ~ bopen ~ single_len ~ bclose }
fixed_discard_segment = { "x" ~ bopen ~ single_len ~ bclose }
ranged_barcode_segment = { "b" ~ bopen ~ len_range ~ bclose }
ranged_umi_segment = { "u" ~ bopen ~ len_range ~ bclose }
ranged_read_segment = { "r" ~ bopen ~ len_range ~ bclose }
ranged_discard_segment = { "x" ~ bopen ~ len_range ~ bclose }
unbounded_barcode_segment = { "b" ~ usep }
unbounded_umi_segment = { "u" ~ usep }
unbounded_read_segment = { "r" ~ usep }
unbounded_discard_segment = { "x" ~ usep }
fixed_segment = {
(fixed_umi_segment | fixed_read_segment | fixed_barcode_segment | fixed_seq_segment | fixed_discard_segment)
}
ranged_segment = {
(ranged_umi_segment | ranged_read_segment | ranged_barcode_segment | ranged_discard_segment)
}
bounded_segment = _{
(fixed_segment | (ranged_segment ~ fixed_seq_segment) | (unbounded_segment ~ fixed_seq_segment))
}
unbounded_segment = {
(unbounded_umi_segment | unbounded_read_segment | unbounded_barcode_segment | unbounded_discard_segment)
}
read_desc = {
dopen ~ ((bounded_segment)+ ~ (ranged_segment | unbounded_segment)? | unbounded_segment | ranged_segment) ~ dclose
}
read_1_desc = { "1" ~ read_desc }
read_2_desc = { "2" ~ read_desc }
frag_desc = _{ SOI ~ read_1_desc ~ read_2_desc ~ EOI}