Reduced Alignment Data (RAD) format specification

NOTE: This is a working specification, and is subject to modifications and revisions. However, to the extent possible, this specification aims to agree with and conform to the RAD files currently being produced by the tools using this format (the otherwise de facto specification).

Data types

Shorthand for the data types of fields used in this spec.

(x1, x2, … , xn) — a tuple of n data types (simply encoded as x1 followed by x2, etc.)

The following data types can be specified by their corresponding assigned type id specified below.

b — boolean (type id 0)
u8 — unsigned 8-bit integer (type id 1)
u16 — unsigned 16-bit integer (type id 2)
u32 — unsigned 32-bit integer (type id 3)
u64 — unsigned 64-bit integer (type id 4)
f32 — 32-bit IEEE floating point number (type id 5)
f64 — 64-bit IEEE floating point number (type id 6)
(length: type id 1, values: [type id 2]) — an array of length (<= max length allowed by type id1) of datatype type id2. Type id 1 must be of type 1—4. Type id 2 may be anything other than another array (i.e. nested arrays are not yet supported). (type id 7)
string — (u16, [u8]) where the first element is the string length and the array encodes the string’s characters. (type id 8)
u128 — unsigned 128-bit integer (type id 9)

Overall file layout

A RAD file consists of several sections (described in more detail below). The file starts with a prelude, which consists of a header, and a collection of 3 tag description segments (describing, respectively, the file, read, and alignment level tags that this file will contain).

The predule is followed by a section providing the values corresponding to the file tag description given in the prelude.

After the file tag value segment follows a series of record chunks, with each chunk consisting of a header, followed by a determined number of records (the number specified in the header). Each record records read-level information (as specified in the read-tag description given in the prelude) and alignment-level information (as specified in the alignment-tag description given in the prelude). In addition to the specified tags, each read record has a single, mandatory field, which specifies the number of alignment records belonging to this read.

A high-level sketch of this overall structure is provided below:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The RAD format is designed for efficient, parallel binary parsing. Record chunks are independent, and given the metadata provided by the prelude, each chunk can be parsed independently by separate threads.

Currently, the RAD format (with different tag sets, and so providing different information) is used by alevin-fry, piscem and piscem-infer for transferring and recording relevant mapping information.

The header contains the following information

isPaired (b) — Are the mappings to follow for paired-end or single-end fragments
refCount (u64) — The number of reference sequences being aligned against
refNames ( (refCount, [string]) ) — An array of refCount number of pairs (strings) encoding the names of the target references. The null terminator is not included in the string.
numChunks (u64) — the number of chunks in the file following the header. If this value is 0, then the number of chunks is not known or not recorded.

Tag definitions

A tagging system for listing optional properties that are global to the file.

There are three types of tags:

file-level tags : exist once in the file, and provide information about the file as a whole, or metadata about how to interpret other tags.
read-level tags : exist once per read, all alignments for a read share the same read-level tags.
alignment-level tags : exist once per alignment (akin to tags in a SAM/BAM file).

For tags in the latter 2 categories, all records should have all of these tags, in the same order in which they appear here.

The tag definition section of the file shall contain

Count of file-level tags (u16) — The number,
$n t_{f}$ of file-level tags that will appear.
A list of
$n t_{f}$ tag descriptions that specify the types of the file-level tags.
Count of read-level tags (u16) — The number
$n t_{r}$ of tags that will be used for the read records in this file.
A list of
$n t_{r}$ tag descriptions that specify the types of the read-level tags.
Count of alignment-level tags (u16) — The number
$n t_{a}$ of tags that will be used for the alignment records in this file.
A list of
$n t_{a}$ tag descriptions that specify the types of the alignment-level tags.

Tags definitions

A tag definition shall consist of:

Tag name (u16, [u8]) and type id (u8) — The first part of the tag info records the tag name, and the second records the tag type. In the tag name, the first u16 encodes the length of the name, the following array encodes the tag name. If the tag is of type 7, then the description of the rest of the tag follows (type id1 and type id2). Otherwise this is simply a single u8 recording the type id (0—6 or 8,9). Tags 0—6 and 9 are fixed length, while 7 and 8 are variable length. Note: all fixed sized global tags should precede all variable length tags.

Example tags (these are not required or part of the spec, but an example of what some tags might be):

Tag “ReadName” ((8, [‘R’,’e’,’a’,’d’,’N’,’a’,’m’,’e’], 8)) — This tag is present if the read
record records the original name of the read. The name itself will be a “string” type encoded as (u16, [u8]).

Tag “AvgReadQuality” ((11, [‘A’,’v’,g’,R’,’e’,’a’,’d’,’Q’,’u’,’a’,’l’,’i’,’t’,’y’], 5)) —The
average quality score of the bases in this read.

File-level tag values

A list of

n t_{f}

tag values, corresponding one-to-one to the file-level tags declared in the previous section. The number of tags, and their order and types must match was is provided in the file-tag description section in the prelude.

Chunks

After the prelude, and file-level tag values, the file consists of a series of chunks. Each chunk has a chunk-specific header of the following form:

Chunk header ((u32,u32)) — This header is simply a tuple where the first element specifies the number of bytes, B_c, occupied by the chunk, and the second element specifies the number of reads, R_c, whose records are present in the chunk. Note: The B_c field includes the number of bytes occupied by the header itself, so it counts the total number of bytes in all records present in this chunk + 8 (the size of the header encoding B_c and R_c).

Each record (the chunk will have R_c such records comprising B_c bytes) shall consist of:

Alignment count (u32) — number of alignment records, NA, to follow for this read.
Array of specified read-level tags for this read.

For each of the N_A alignments of this read:

The collection (in the appropriate order) of the alignment-level tags specified in the alignment-tag description of the header.

NoorPratap

2024/06/28 20:01:35

(type id1, length, type id2, [x])

what is x, further type id 1 or type id is supposed to be 7 is not clear

2024/06/28 20:11:36

type id1 must be of type 1—4

is there a reason why array of boolean is not supported

Rob Patro

2024/07/02 04:18:30

It is supported. Type ID 1 encodes the length, while type id 2 encodes the element type. This says the length must be one of the integer types, but the array itself can be of any (non-aggregate) type — i.e. anything other than another array. For the time being, nested arrays are not supported.

2024/06/28 20:23:31

Some of the color choices make the document hard to parse

2024/07/02 04:20:36

This is just using one of the built-in themes. We can change it if you can suggest a better “dark mode” theme.

Reduced Alignment Data (RAD) format specification

Data types

Overall file layout

Header

Tag definitions

Tags definitions

File-level tag values

Chunks

Read more

Notes for Rust at RECOMB meetup

unrolled LCP

Sequence Fragment Geometry Description Language