Try   HackMD

DefraDB Query Specification [Archived]

Status: Draft
Version: v0.2.0-beta
By: John-Alan Simmons
Created: July 28th, 2020

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The purpose of this document is to specify the DefraDB Query language only. The intent, design decisions, and other auxiliary information is a secondary objective, and added only if it is needed by the primary specification objective to add context. See the DefraDB Design Doc for further details on DefraDB as a whole, and the Query Language beyond the concrete specification, such as implementation details.


Query Language Overview

The following is a succinct description of the query language, its goals, and design choices to place this document in the correct context.

The DefraDB Query Language (DQL) is a GraphQL defined API to access and query data residing inside a DefraDB node. GraphQL was used due to its malleable nature which provides the developer with as much freedom and rigidity as possible. GraphQL is an API specification designed at Facebook as a replacement for RESTful API, aiming to reduce overhead, N+1 queries, extra network round trips, and to simplify the correlation between application state, and API representation. Initially, it was not designed as a database query language, but has started to be adopted within the Database community compared to traditional SQL or more proprietary query languages like MongoDB Query and others. Notable databases that have led the charge for GraphQL's adoption have been DGraph and Fauna. These databases expose a native GraphQL API as a query language to read and write data from/to the database. Each has its own approaches and differences; most notably, DGraph is a Graph database, which imposes different relational structure and data modeling requirements. Alternatively, Fauna tries to maintain the more traditional SQL relational model for its database which is replicated to the GraphQL domain.

DefraDB, while using GraphQL like the other databases mentioned above, is designed as a Document Store database, which is again different from the Graph and SQL databases of DGraph and Fauna, respectively. As such, during our literature review, we kept in mind our unique requirements imposed by this model.

Our Query Language exposes every functionality of the database directly, without the need for any additional APIs. Including; reading, writing, and modifying data, describing data structures, schemas, and architecting data models via index’s and other schemas independent, application-specific requirements. With the notable exception of DefraDBs PeerAPI (outlined in the DefraDB Specification), which is used to interact with other peers' databases, and with the underlying CRDTs, in the case of collaborative text editing.

Our initial design relies only on the currently available GraphQL specification, which has a stable version tagged June 2018 Edition. Initially the GraphQL Query Language will utilize standard GraphQL Schemas, with any additional directives exposed by DefraDB. In the future, we will migrate to Source Schemas, which can compile to GraphQL schemas, but include their semantic and syntactic structures, content addressable composability, and others. DefraDBs CRDT types will initially be automatically mapped to GraphQL types; over time, we plan to have mechanisms to control the underlying CRDT types.

Document Model

As mentioned above, DefraDB uses a Document API model, which is distinctive from traditional SQL models. The main differences rely on the normative nature of data within SQL versus the more simplistic Document model. The benefits of the Document Model are because how closely your data state is able to represent your application state, which gives rise to the concept that Document stores in the NoSQL world are more “Developer Friendly”. SQL was designed from an academic, formally defined perspective, which makes it very “Teachable” and easier. for professors to teach it within an educational setting. In most Database 101 classes, we first learn about Relational Algebra, a strictly defined formal language for expressing relationships between data, and SQL is a widespread implementation of this formal language. It allows for fine-tuned query expressions; however, it is very disconnected from both the Application state and conceptual models humans rely on to reason about data.

As a result, GraphQL is a perfect pairing for Document Model databases as the GraphQL schema system closely resembles both Application state, and common Document model Schemas, which results in the least number of context switches when trying to reason about how your application state persists to your datastore. The downside is the lack of expressionism compared to SQL; however, SQLs powerful expressions are inherently required by its complex data model. Using a more simplistic data model doesn’t require as powerful expressions to reason about. SQL relies on normalized data, indirection, join tables, and more, which make constructing queries hyper-specific to SQL, and even implementation-specific between SQL options. Document Models have a more flexible system, which doesn’t enforce normalization, and can more easily represent relations and embeddings between types.

DefraDB Query Language - (DQL)

High-Level Design

DefraDBs GraphQL Query Language, or DQL for shit, is used to read, write, modify, and delete data residing within a DefraDB instance. The entire language is compliant with the GraphQL specification, and will attempt to use GraphQL best practices. However, since GraphQL was not originally intended as a Database query language, small divergences are necessary from traditional API implementations.

Currently, all query language functionality requires a predefined data schema, using the GraphQL Schema grammar. This will allow us to auto-generate all the necessary types, filters, inputs, payloads, etc to operate the query language with complete type safety guarantees. There may be instances where “Loose Typing” is used in place of Strong Typing, due to the dynamic nature of query languages, and the powerful and expressive features it enables.

Additionally, DefraDBs query language will need to cover 99.99% of developer use cases with respect to interacting with their data layer. This is a notable distinction between this query language, and traditional application GraphQL API endpoints. Traditional API endpoints assume that the developers have the freedom to control their data as they see fit, before returning it through a GraphQL response object. With DefraDB, the binding between GraphQL input/output objects, and the underlying data layer is very tight.

As an example, application GraphQL APIs often enable filtering and aggregation only with respect to their specific application domain. Instead of directly exposing a fully encompassing filter and aggregate layer, they proxy all necessary functionality through custom Query resolvers, that are use case dependent. DefraDB has no such option, since it must provide full functionality and expressiveness directly to the developer. It, has no concept of the domain-specific context the application resides in.

However, DQL still needs to be as developer-friendly as possible, enabling many query language features commonly expected from a data layer.

The major distinction between GraphQL and other database query languages, is the explicit distinction from read-only operations, and write operations, represented as Queries and Mutations respectively. The only hindrance this exposes is the use of subqueries within mutations, commonly found in SQL databases, where one might insert data based on a subquery results. This ideally will be resolved in future versions of DQL.

DQL will make notable use of custom reserved keywords within the GraphQL that will exist only to enable database query language features further. For the most part, reserved keywords will be prefixed by an “_” (underscore) to indicate some kind of special functionality. For example, when filtering or sorting data, the “_lt” (less than) or “_asc” (ascending) keywords will be used to indicate some desired functionality or behavior.

This Query Language specification will try not to make any attempt to rely on implementation details between deployment environments, as DefraDB is intended to operate in several very different deployment environments. If any feature is either deployment dependent, or deviates between implementations, it will be noted as much. Additionally, this specification will try not to indicate best practices (will be covered in another document), and only provide the direct features and references available. This may be most notable when discussing indexes, DefraDB will offer many different types of indexes, each of which has its own pros and cons, and their use is very dependent on the application use case. For completeness, the use of indexes on schemas via directives will be discussed, as well as what features specific indexes provide, but only so far as they relate to the use of them in a Query/Mutation, and not how they might be best used for a particular application.

Query Block

The Query Block is the read-only GraphQL operation designed to soley request information from the database, and cannot mutate the databases state. Query Blocks can contain multiple subqueries that are executed concurrently, unless there is some variable dependency between them (See: Variable Blocks).

Queries support most traditional database query operations like: Filtering, sorting, grouping, skipping/limiting, aggregation, and more. Many of the available query operations can be used on different GraphQL object levels, most notably on fields that have some relation or embedding to other objects.

Collections

Each and every developer-defined type is attached to a Collection. A Collection is similar to a SQL Table, which represents a group of documents with similar structures.

Each collection gets an auto-generated query field, allowing users and developers to query, filter, select, and interact with documents in several different ways.

Filtering

Provided a collection of documents, we can filter on a number of fields and predicates, including compound predicates via conditional keywords like (_and, _or, _not).

Basic filtering is exposed through the filter keyword, and can be supplied as arguments to a root level field, or subfields that contain some kind of embedding or relation.

The following shows the use of the filter keyword on the root level books field. An empty filter object is equivalent to no filters at all, so it will match all objects it is applied to. In this case, it will return all books.

{
	books(filter: {}) {
		title
		genre
		description
	}
}

Some of the available filtering options depend on the available indexes on a field, but, we will omit this for now.

To apply a filter to a specific field, we can specify it within the filter object.

The following only returns books with the title “A Painted House”

{
	books(filter: { title: { _eq: "A Painted House" }}) {
		title
		genre
		description
	}
}

We can apply filters to as many fields as are available. Note, each additional listed field in the filter object implies an AND conditional relation.

{
	books(filter: { title: { _eq: "A Painted House"}, genre: { _eq: "Thriller" }}) {
		title
		genre
		description
	}
}

The above query is equivalent to the statement: Return all the books with the title “A Painted House” AND whose genre is “Thriller”.

Filtering is also possible on subfields that have relational objects within them.

Given a Book object, with an Author field which is a many-to-one relationship to the Author object, we can query and filter based on Author field value.

{
	books(filter: { genre: { _eq: "Thriller" }, author: {name: { _eq: "John Grissam"}}}) {
		title
		genre
		description
		author {
			name
			bio
		}
}

This query returns all books authored by “John Grissam” with the genre “Thriller”.

Filtering from the root object, compared to sub objects results in different semantics. Root filters that apply to sub objects, like the author section of the above query, only return the root object type if there is a match to the sub object filter. E.g. We will only return books, if the author filter condition accepts.

This applies to both single sub-objects, and array sub-objects. Meaning, if we apply a filter on a sub-object, and that sub-object is an array, we will only return the root object, if at least one sub-object matches the given filter instead of requiring every sub-object to match the query.

The following query will only return authors, if they have at least one book whose genre is a thriller

{
    authors(filter: {book: {genre: {_eq: "Thriller"}}}) {
        name
        bio
    }
}

Moreover, if we include the sub-object we filter on, which is an array, in the selection set, the filter is then implicitly applied unless specified otherwise.

From the above query, let's add books to the selection set.

{
    authors(filter: {book: {genre: {_eq: "Thriller"}}}) {
        name
        bio
        books {
            title
            genre
        }
    }
}

Here, the books section will only contain books that match the root object filter, E.g.: {genre: {_eq: "Thriller"}}. If we wish to return the same authors from the above query and include all their books, we can directly add an explicit filter to the sub-object instead of just the sub-filters.

{
    authors(filter: {book: {genre: {_eq: "Thriller"}}}) {
        name
        bio
        books(filter: {}) {
            title
            genre
        }
    }
}

Here, we only return authors who have at least one book with the genre set to "Thriller". However, we then return all that specific authors' books, instead of just their thrillers.

Applying filters solely to sub-objects, which is only applicable for Array types, is computed independently from the root object filters.

{
	authors(filter: {name: { _eq: "John Grissam" }}) {
		name
		bio
		books(filter: { genre: { _eq: "Thriller" }}) {
			title
			genre
			description
		}
}

The above query returns all authors with the name “John Grisham”, then filters and returns all the returned authors' books whose genre is “Thriller”. This is similar to the previous query, but an important distinction is that it will return all the matching Author objects regardless of the book's sub-object filter.

The first query, will only return if there are Thriller books AND its author is “John Grisham”. The second query always returns all authors named “John Grishams”, and if they have some books with the genre “Thriller”, return those as well.

So far, we have only looked at EXACT string matches, but we can filter on any scalar type or object fields, like booleans, integers, floating points, and more. To use other comparison operators, we can specify Greater Than, Less Than, Equal or Greater than, Less Than or Equal to, and of course, EQUAL.

Let's query for all books with a rating greater than or equal to 4.

{
	books(filter: { rating: { _gte: 4 } }) {
		title
		genre
		description
}

Note, our filter expression contains a new object { _gte: 4 }, instead of before, where we had a simple string value. If a field is a scalar type, and its filter is given an object value, that object's first and only key must be a comparator operator like _gte. If the filter is given a simple scalar value, like “John Grisham”, “Thriller”, or FALSE, the default operator is _eq (EQUAL).

See the following table for a reference of available operators

Operator Description
_eq Equal to
_neq Not Equal to
_gt Greater Than
_gte Greater Than or Equal to
_lt Less Than
_lte Less Than or Equal to
_in In the List
_nin Not in the List
_like Sub-String Like
Table 1. Supported operators.

See the following table for a reference of which operators can be used on which types

Scalar Type Operators
String _eq, _neq, _like, _in, _nin
Integer _eq, _neq, _gt, _gte, _lt, _lte, _in, _nin
Floating Point _eq, _neq, _gt, _gte, _lt, _lte, _in, _nin
Boolean _eq, _neq, _in, _nin
Binary _eq, _neq
DateTime _eq, _neq, _gt, _gte, _lt, _lte, _in, _nin
Table 2. Operators supported by Scalar types.

To use multiple filters, as we saw above, the AND predicate is implied given a list of fields to filter on, however, we can explicitly indicate what conditional we want to use. This includes _and, _or, and _not. The _and and _or conditionals accept a list of predicate filters to apply to, whereas the _not conditional accepts an object.

Lets' query all books that are a part of the Thriller genre, or have a rating between 4 to 5.

{
    books(
        filter: { 
            _or: [ 
                {genre: { _eq: "Thriller"}}, 
                { _and: [
                    {rating: { _gte: 4 }},
                    {rating: { _lte: 5 }},
                ]},
            ]
        }
    )
	title
	genre
	description
}

An important thing to note about the above query is the _and conditional. Even though AND is assumed, if we have two filters on the same field, we MUST specify the _and operator, since JSON objects cannot contain duplicate fields.

Invalid:
filter: { rating: { _gte: 4 }, rating { _lte: 5 } }
Valid:
filter: { _and: [ {rating: {_gte: 4}}, {rating: {_lte: 5}} ]}

The _not conditional accepts an object instead of an array.

Filter all objects that do not have the genre "Thriller"
filter: { _not: { genre: { _eq: "Thriller" } } }

Using the _not operator should be used only when the available filter operators like _neq do not fit the use case.

Sorting & Ordering

Sorting is an integral part of any Database and Query Language. The sorting syntax is similar to the filter syntax, in that we use objects, and sub-objects to indicate sorting behavior, instead of filter behavior.

Let's find all books ordered by their latest published date

{
    books(sort: { published_at: DESC}) {
        title
        description
        published_at
    }
}

Here we indicated which field we wanted to sort on (published_at) and which direction we wanted to order by (descending).

Fields may be sorted either Ascending or Descending.

Sorting can be applied to multiple fields in the same query. The order of which sort is applied first, is the same as the field order in the sort object.

Let's find all books ordered by earliest published date and then by title descending

{
    books(sort: { published_at: ASC, title: DESC }) {
        title
        genre
        description
    }
}

Additionally we can sort on sub-object fields along with root object fields.

Let’s find all books ordered by earliest published date and then by the latest authors' birthday

{
    books(sort: { published_at: ASC, author: { birthday: DESC }}) {
        title
        description
        published_at
        author {
            name
            birthday
        }
    }
}

Sorting on multiple fields at once behavior is primarily driven by the first indicated sort field, called the primary field. In the above example, this would be the “published_at” date. The following sort field, called the secondary field, is used in the case that more than one record has the same value for the primary sort field. Suppose than two sort fields are indicated. In that case, the same behavior applies, except the primary, secondary pair shifts by one element, so the 2nd field is the primary, and the 3rd is the secondary, until we reach the end of the sort fields.

If only a single sort field is given, and objects have the same value, then by default the documents identifier (DocKey) is used as the secondary sort field. This generally applies regardless of how many sort fields are given. So long as the DocKey is not already included sort fields, it is always the final tie-breaking secondary field.

A direct result of the DocKey sort behavior algorithm is that if the DocKey is included in the queries sort fields, any field included afterward will never be evaluated, since all DocKeys are unique. Given fields (published_at, id, birthday), the birthday sort field will never be evaluated, and should be removed from the list.

Sorting on sub-objects from the root object is only allowed if the subobject is not an array. If it is an array, the sort must be applied to the object field directly instead of through the root object.

So, instead of

{
    authors(sort: { name: DESC, books: { title: ASC }}) {
        name
        books {
            title
        }
    }
}

We need

{
    authors(sort: { name: DESC }) {
        name
        books(sort: { title: ASC }) {
            title
        }
    }
}

This is because root level filters and order apply to the root object only. If we did allow this version, it would be ambigiuous from the developers' point of view if the ordering applied to the order of the root object compared to its sibling objects or if the ordering applied solely to the sub-object. Suppose we allowed it, and enforced the semantics of root level sorting on array sub objects to act as a sorting mechanism for the root object. In that case, there is no obvious way to determine which value in the array is used for the root order.

This means that if we had the following objects in the database:

 [
     author {
         name: "John Grissam"
         books: [
            { title: "A Painted House" },
            { title: "The Guardians" }
         ]
     }
     author {
         name: "John Grissam"
         books: [
            { title: "Camino Winds" },
         ]
     }
     author {
         name: "John LeCare"
         books: [
             { title: "Tinker, Tailor, Soldier, Spy"}
         ]
     }
 ]

and the following query

{
    authors(sort: { name: DESC, books: { title: ASC }}) {
        name
        books {
            title
        }
    }
}
books(filter: {_id: [1]}) {
    title 
    genre
    description
}

Since we have two authors with the same name (John Grisham), the sort object (sort: { name: "desc", books: { title: "asc" }} would suggest we sort duplicate authors using books: { title: "asc" } as the secondary sort field. However, because the books field is an array of objects, there is no single value for the title to compare easily.

Therefore, sorting on array sub objects from the root field is strictly not allowed.

Limiting & Pagination

After filtering and sorting a query, we can then limit and skip elements from the returned set of objects.

Let us get the top 10 rated books

{
    books(sort: {rating: DESC}, limit: 10) {
        title
        genre
        description
    }
}

Here we introduce the limit function, which accepts the max number of items to return from the resulting set.

Next, we can skip elements in the set, to get the following N objects from the return set. This can be used to create a pagination system, where we have a limit of items per page, and skip through pages.

Let's get the next top 10 rated books after the previous query

{
    books(sort: {rating: DESC}, limit:10, offset: 10) {
        title
        genre
        description
    }
}

Limits, and offsets can be combined to create several different pagination methods.

Relationships

DefraDB supports a number of common relational models that an application may need. These relations are expressed through the Document Model, which has a few differences from the standard SQL model. There are no manually created "Join Tables" which track relationships. Instead, the non-normative nature of NoSQL Document objects allows us to embed and resolve relationships as needed automatically.

Relationships are defined through the Document Schemas, using a series of GraphQL directives, and inferencing. They are always defined on both sides of the relation, meaning both objects involved in the relationship.

One-to-One

The simplest relationship is a "one-to-one". Which directly maps one document to another.

Let us define a one-to-one relationship between an Author and their Address

type Author {
    name: String
    address: Address @primary
}

type Address {
    number: Integer
    streetName: String
    city: String
    postal: String
    author: Author
}

Here we simply include the respective types in both objects, and DefraDB infers the relationship. This will create two different objects, each of which is independently queryable, and each of which provides field level access to its related object. The notable distinction of "one-to-one" relationships is that only the DocKey of the corresponding object is stored.

If we wanted to simply embed the Address within the Author type, without the internal relational system, we can include the @embed directive, which will embed it within. Objects embedded inside another using the @embed directive do not expose a query endpoint, so they can only be accessed through their parent object. Additionally they are not assigned a DocKey.

One-to-Many

A "one-to-many" relationship allows us to relate several of one type, to a single instance of another.

Let us define a one-to-many relationship between an author and their books. This example differs from the above relationship example because we relate the author to an array of books, instead of a single address.

type Author {
    name string
    books [Book]
}

type Book {
    title string
    genre string
    description string
    author Author
}

Here we define the books object within the Author object to be an array of books, which indicates the Author type has a relationship to many Book types. Internally, much like the one-to-one model, only the DocKeys are stored. However in this relationship, the DocKey is only stored on one side of the relationship, specifically the child type. In this example only the Book type keeps a reference to its associated Author DocKey.

We are investigating methods for storing a non-normative array on the parent object that is efficient to maintain (inserts, deletes) and, as well as providing a secondary index for one-to-many relationships. The non-normative array method should only be used for small cardinality lists. E.g., Lists with a low and bounded number of elements.

Many-to-Many

Multiple Relationships

It is possible to define a collection of different relationship models. Additionally, we can define multiple relationships within a single type.

Relationships that contain unique types, can simply be added to the types without issue. Like the following.

type Author {
    name: String
    address: Address
    books: [Book] @relation("authored_books") @index
}

However, if we have multiple relationships using the same types, we need to annotate the differences. We will use the @relation directive to be explicit.

type Author {
    name: String
    written: [Book] @relation(name: "written_books")
    reviewed: [Book] @relation(name: "reviewed_books")
}

type Book {
    title: String
    genre: String
    author: Author @relation(name: "written_books")
    reviewedBy: Author @relation(name: "reviewed_books")
}

Here we have two relations of the same type; by default, their association would conflict because internally type names are used to specify relations. We use the @relation to add a custom name to the relation. @relation can be added to any relationship, regardless of if it's a duplicate type relationship. It exists to be explicit, and to change the default parameters of the relation.

Inline Arrays

In addition to traditional relationships between complex objects, DefraDB also supports simple arrays of scalars defined directly on an object. Currently arrays of types Boolean, Int, Float and String are supported.

For example:

type Users {
    name: String
    likedIndexes: [Boolean]
    favouriteIntegers: [Int]
    favouriteFloats: [Float]
    preferredStrings: [String]
}

Aliases

Often the returned structure of a query isn't in the ideal context for a given application. In this instance we can rename fields and entire query results to better fit our use case. This is particularly useful, and sometimes necessary when using multiple queries within a single request

{
    topTenBooks: books(sort: {rating: DESC}, limit: 10) {
        title
        genre
        description
    }
}

Here we have renamed the books result to topTenBooks, which can be useful for semantic reasoning about the request, and for organizational purposes. It is suggested in production deployments to name your queries properly.

{
    topTenBooks: books(sort: {rating: DESC}, limit: 10) {
        title
        genre
        description
    }
    
    bottomTenBooks: books(sort: {rating: ASC}, limit: 10) {
        title
        genre
        description
    }
}

In this query we have two returned results named topTenBooks and bottomTenBooks respectively. When dealing with multiple queries of the same type (e.g. books) it is required to alias one from another.

Additionally, we can alias individual fields within our returned types. Aliasing a field works the same as aliasing a query.

{
    books {
        name: title
        genre
        description
    }
}

Here we have renamed the title field to name. Unlike query aliases, there is no requirement in any context since name collisions are impossible within a defined query return type.

Variables

Variables are used in DefraDB to store both input placeholder values, and interim query results. DefraDB makes use of interim queries to allow for query composition, that is, using query results as inputs to other queries. This is similar to nested SQL SELECT statements. Variables can be used in place of string interpolation to create more efficient queries.

Input Variables

Input variables can be used in place of complex string interpolation to allow for operational simplicity for developers. DefraDB will natively handle the string interpolation internally, creating an easy workflow. Variables are strongly typed at the time of definition, and queries that are given incorrect variable types will return an error. Variables are defined in the query keyword, along with their types, and are indicated by a $ prefix.

query($myVar: Int) {
    books(filter{ rating: { _eq: $myVar}}) {
        title
        genre
        description
    }
}

{
    "myVar": 4
}

Here we define the variable $myVar which is of type Int, used by the books filter on the rating field. We supply the variables as a JSON object at the end of the query. With, its keys matching the defined variables, any additional keys provided that don't match the defined variables will result in an error.

Sub Queries

Subqueries are used to store interim query results to be used later on in the currently executing query. They can be used to logically break up a query into components or create a new set of data to filter, which would otherwise be cumbersome.

Subqueries are defined as any other additional query, except using a special alias name indicated by a reserved variable prefix $_[sub query name]. All subqueries are not returned in the result, but their contents can be utilized in other parts of the query.

The values of the subquery are stored in the special subquery variable defined using the $_ prefix. They are managed as a map of type Map<ID, Object> a map with IDs as the key, and Objects as the value, where the Object is the return type of the subquery. Subqueries have no concept of an ordering since any desired ordering can be applied on the final query, so any sort input given to a subquery will return an error.

A subquery must always return at least the _id field of the returned type, any other field is optional. Unless other fields are necessary, it is best to solely use the _id field, due to the optimizations it affords the engine.

Subquery results, represented by the prefixed variable, can be used as many other arguments within the filter object. It can act like both an array of IDs, allowing operators like _in and _nin, as well as being able to access the fields of the object through a map operator

** More on-map operations for subqueries to be added! **

// Select all books published by X reviewed by authors belonging to a publisher Y
{
    _authorsOfPubY: authors(filter: { written: { publishedBy: {name: { _eq: "Y"}}}}) {
        _id
    }
    
    books(filter: {publishedBy: { name: { _eq: "X" }}, reviewedBy: {_in: $_authorsOfPubY}}) {
        title
        genre
        description
    }
}

This query utilizes a subquery, defined as $_authorsOfPubY which is all the authors who have books published by the publisher with the name "Y". The results of the subquery are then later used in the main query as a list of IDs to compare against the reviewedBy field using the array _in operator, which returns true if the value at reviewedBy is anywhere within the subquery $_authorsOfPubY. This query will only return the books result, as subqueries are not included in the results output.

Grouping

Grouping allows a collection of results from a query to be "grouped" into sections based on some field. These sections are called sub-groups, and are based on the equality of fields within objects, resulting in clusters of groups. Any object field may be used to group objects together. Additionally, multiple fields may be used in the group by clause to further segment the groups over multiple dimensions.

Once one or more group by fields have been selected using the groupBy argument, which accepts an array of length one or more you may only access certain fields in the return object. Only the indicated groupBy fields and aggregate function results may be included in the result object. If you wish to access the sub-groups of individual objects, a special return field is available called _group which matches the root query type, and can access any field in the object type.

Here we query for all the books whose author's name begins with 'John', then group the results by genre, and finally return the genre name and the sub-groups title and rating.

{
    books(filter: {author: {name: {_like: "John%"}}}, groupBy: [genre]) {
        genre
        _group {
            title
            rating
        }
    }
}

In this example we see how the groupBy argument is provided, and that it accepts an array of field names, and, how the special _group field can be used to access the sub-group elements.

It's important to note that in this example the only available field from the root Book type is the groupBy field genre, along with the special group and aggregate proxy fields.

Grouping on Multiple Fields

As mentioned, we can include any number of fields in the groupBy argument to segment the data further. Which can then also be accessed in the return object, like so.

{
    books(filter: {author: {name: {_like: "John%"}}}, groupBy: [genre, rating]) {
        genre
        rating
        _group {
            title
            description
        }
    }
}

Objects often have related objects within their type definition indicated by the @relation directive on the respective object. We can use the grouping system to split results over the related objectand the root type fields.

Like any other group query, we are limited in which fields we can access indicated by the groupBy argument's fields. If we include a subtype that has a @relation directive in the groupBy list, we can access the entire relations fields.

Only "One-to-One" and "One-to-Many" relations can be used in a groupBy argument.

Given a type definition defined as:

type Book {
    title: String
    genre: String
    rating: Float
    author: Author @relation
}

type Author {
    name: string
    written: [Book] @relation
}

We can create a group query over books and their authors like so.

{
    books(groupBy: [author]) {
        author {
            name
        }
        _group {
            title
            genre
            rating
        }
    }
}

As you can see, we can access the entire Author object in the main return object without having to use any special proxy fields.

Group operations can include any combination, single or multiple, individual field or related object, that a developer needs.

Aggregate Functions

DefraDB natively supports aggregation functions over any child collection - including groups, inline-arrays and relationships. Supported aggregate functions are: count, max, min, sum, and avg. The following table breaks down which base scalar types support which aggregate functions

Scalar Type Aggregate Functions
String count
Integer count, max, min, sum, avg
Float count, max, min, sum, avg
Boolean count
Binary count
DateTime count, max, min
Table 3: Aggregate functions available on each base scalar type

Like the special _group field, aggregate functions are defined and returned using special fields. If we had the field rating, we could access the average value of all sub-group ratings by including the special field _avg { field: { _group: rating } } in our return object.

All aggregates should be specified along with the path to the property that should be aggregated. _count only requires the name of the collection, however all others targeting collections of objects also require the name of the child property. Aggregates on inline-arrays also only require the name of the array. The property to be aggregated does not need to be returned in order to be aggregated, for example:

Let us augment the previous grouped books by genre example and include an aggregate function on the sub-groups ratings.

query {
    books(filter: {author: {name: {_like: "John%"}}}, groupBy: [genre]) {
        genre
        
        _avg(field: { _group: rating })
        _count(field: _group)
        
        _group {
            title
            rating
            price
        }
    }
}

Here we group books by genre whose authors name begins with "John", with an average rating and count of each group.

Having - Filtering & Ordering on Groups

When using group queries, it is often necessary to further filter the returned set based on aggregate function results' properties. The Having argument can be used to filter aggregate and group results further. A new system of filtering is required because the filter argument is applied before any grouping and aggregate, so it isn't able to provide the necessary functionality.

In addition to filtering using the having argument, we can still limit and order in the same way, since those operations are applied after the grouping and aggregation. Further explanation of the query execution pipeline is below.

Let us get all the books from the author John LeCare, grouped by genre, calculate the average rating of those books, select the groups with at least an average rating of 3.5, and order them from highest to lowest.

{
    books(filter{ author: {name: { _eq: "John LeCare"} } }, groupBy: [genre], having: { _avg: {rating: {_gt: 3.5}}}, order: { _avg: {rating: DESC}}) {
        genre
        _avg(field: { _group: rating })
        
        books: _group{
            title
            rating
            description
        }
    }
}

Here we've combined several different query functionalities, including filtering data, followed by grouping and aggregating over the filtered data, further filtering the results again, and finally ordering the return set.

The having argument functions similar to the filter argument; it, takes an object as its parameter. The difference is what fields can be included inside the having object parameter. Unlike the filter argument, only the fields from the groupBy argument, and any included aggregate field function can be used in having. Beyond that limitation, everything else functions the same, from operators' use and structure, to compound conditionals. This includes sub-objects that exist in the groupBy argument; they can then have their respective fields included within a having argument.

Distinct

Functions

Execution Flow

Understanding the execution flow of a query can help understand its structure, and help developers craft their queries. Query execution is broken down into 3 phases, parsing, planning, and executing.

The parsing phase is the simplest; it parses the given query as a string and returns a structured AST representation. Additionally, it does some necessary semantic validation of the structure against the schema.

The planning phase analyzes the query and, the given storage structure, including any additional indexes, and determines how it will execute the query. This phase is highly dependant on the deployment environment, and underlying storage engine as it attempts to exploit the available features and structure to provide optimal performance. Specific schemas will automatically create certain secondary indexes. Additionally, the developer can create custom secondary indexes to fit their use cases best, and the planning phase will automatically use them when available.

Finally, the execution phase is the bulk of the request time, as it does the data scanning, filtering, and formatting. The execution phase has a deterministic process regarding the steps it takes to produce the results. This is due to the priority each argument and its parameters have for one another.

The filter argument has the highest priority so it's executed first, breaking down the entire target collection based on its provided parameters and fields into the output result set. Next is the groupBy argument, that further chunks the result set into subgroups across potentially several dimensions after the aggregate phase computes some function over a subgroup's given fields. Next is the having argument, which further filters data based on the grouped fields or aggregate results. After the order argument which structures the result set based on the ordering, either ascending or descending of one or more field values. Finally, the limit argument and its associated arguments restrict the number of the finalized, filtered, ordered result set.

Here is the basic ordering of functions

Created with Raphaël 2.2.0FilterFilterGroupingGroupingAggregateAggregateHavingHavingOrderingOrderingLimitingLimitingFiltered DataSubgroupsSubgroupsFiltered DataOrdered Data

Here is a simplified example of the execution order:

Mutation Block

Mutations are the write side of the DefraDB Query Language. Mutations rely on the query system to properly function, as updates, upserts, and deletes all require filtering and finding data before taking action.

The data and payload format that mutations use is fundemental to maintaining the designed structure of the database.

All mutation definitions are generated for each defined type in the Database. This is similar to the read query system.

Mutations are similar to SQL INSERT INTO ... or UPDATE statements.

Much like the Query system, all mutations exist inside a mutation { ... } block. Several mutations can be run at the same time, independently of one another.

Insert

Insert is used to create new documents from scratch, which involves many necessary steps to ensure all the data is structured properly and verifiable. From a developer's perspective, it's the easiest of all the mutations as it doesn't require any queries or document lookups before execution.

type Book { ... } mutation { createBook(data: createBookPayload) [Book] }

This is the general structure of an insert mutation. You call the createTYPE mutation, with the given data payload.

Payload Format

All mutations use a payload to update the data. Unlike the rest of the Query system, mutation payloads aren't typed. Instead, they use a standard JSON Serialization format.

Removing the type system from payloads allows flexibility in the system.

JSON Supports all the same types as DefraDB, and it's familiar for developers, so i's an obvious choice for us.

Here's an example with a full type and payload

type Book {
    title: String
    description: String
    rating: Float
}

mutation {
    createBook(data: "{
        'title': 'Painted House',
        'description': 'The story begins as Luke Chandler ...',
        'rating': 4.9
    }") {
        title
        description
        rating
    }
}

Here we can see how a simple example of creating a Book using an insert mutation. Additionally, we can see, much like the Query functions, we can select the fields we want to be returned.

The generated insert mutation returns the same type it creates, in this case, a Book type. So we can easily include all the fields as a selection set so that we can return them.

More specifically, the return type is of type [Book]. So we can create and return multiple books at once.

Update

Updates are distinct from Inserts in several ways. Firstly, it relies on a query to select the correct document or documents to update. Secondly it uses a different payload system.

Update filters use the same format and types from the Query system, so it easily transferable.

Heres the structure of the generated update mutation for a Book type.

mutation {
    update_book(id: ID, filter: BookFilterArg, data: updateBookPayload) [Book]
}

See above for the structure and syntax of the filter query. You can also see an additional field id, that will supersede the filter; this makes it easy to update a single document by a given ID.

More importantly than the Update filter, is the update payload. Currently all update payloads use the JSON Merge Patch system.

JSON Merge Patch is very similar to a traditional JSON object, with a few semantic differences that are important for Updates. Most significantly is how to remove or delete a field value in a document. To remove a JSON Merge Patch field. we provide a nil value in the JSON object.

Here's an example.

{
    "name": "John",
    "rating": nil
}

This Merge Patch sets the name field to "John" and deletes the rating field value.

Once we create our update , and select which document(s) to update, we can query the new state of all documents affected by the mutation. This is because our update mutation returns the type it mutates.

Heres a basic example

mutation {
    update_book(id: '123', data: "{'name': 'John'}") {
        _key
        name
    }
}

Here we can see after applying the mutation; we return the _key and name fields. We can return any field from the document, not just the ones updated. We can even return and filter on related types.

Beyond, updating by an ID or IDs, we can use a query filter to select which fields to apply our update to. This filter works the same as the queries.

mutation {
    update_book(filter: {rating: {_le: 1.0}}, data: "{'rating': '1.5'}") {
        _key
        rating
        name
    }
}

Here, we select all documents with a rating less than or equal to 1.0, update the rating value to 1.5, and return all the affected documents _key, rating, and name fields.

For additional filter details, see the above Query Block section.

Upsert

Delete

Delete mutations allow developers and users to remove objects from collections. You can delete using specific Document Keys, a list of doc keys, or a filter statement.

The document selection interface is identical to the Update system. Much like the update system, we can return the fields of the deleted documents.

Heres the structure of the generated delete mutation for a Book type.

mutation {
    delete_book(id: ID, ids: [ID], filter: BookFilterArg) [Book]
}

Here we can delete a document with ID '123':

mutation {
    delete_user(id: '123') {
        _key
        name
    }
}

This will delete the specific document, and return the _key and name for the deleted document.

DefraDB currently uses a Hard Delete system, which means that it is completely removed from the database when a document is deleted.

Similar to the Update system, you can use a filter to select which documents to delete.

mutation {
    delete_user(filter: {rating: {_gt: 3}}) {
        _key
        name
    }
}

Here we delete all the matching documents (documents with a rating greater than 3).

Database API

So far, all the queries and mutations have been specific to the stored and managed developer or user-created objects. However, that is only one aspect of DefraDBs GraphQL API. The other part is the auxiliary APIs, which include MerkleCRDT Traversal, Schema Management, and more.

MerkleCRDTs

Internally, all objects in DefraDB are stored in some kind of MerkleCRDT (See DefraDB Tech Specs for more details). These MerkleCRDTs are represented as a series of small updates or Deltas, connected in a MerkleDAG. The MerkleDAG is a Merklized version of a DAG (Directed Acyclical Graph), which means each node in the DAG, references a parent node through some kind of Content Identifier (CID).


Here is an example structure of a MerkleDAG.

The Head CID represents the "current" or "latest" state of a MerkleDAG.

DefraDB allows users and developers to query, traverse, and validate the DAG structure, allowing for self-verifying data structures.

In the DefraDB Database API, the DAG nodes are represented as Commit, CommitLink, and Delta types. They are defined as follows:

// Commit is an individual node in a CRDTs MerkleDAG
type Commit {
    cid: String            // cid is the Content Identifier of this commit
    height: Int            // height is the incremental version of the current commit
    delta: Delta           // delta is the delta-state update generated by a CRDT mutation 
    previous: [Commit]     // previous is the links to the previous node in the MerkleDAG
    links: [CommitLink]    // links are any additional commits this commit may reference.
}

// CommitLink is a named link to a commit
type CommitLink {
    name: String           // name is the name of the CommitLink
    commit: Commit         // commit is the linked commit
}

// Delta is the differential state change from one node to another
type Delta {
    payload: String        // payload is a base64 encoded byte-array.
}

We can query for the latest commit of an object (with id: '123') like so:

query {
    latestCommit(docid: "123") {
        cid
        height
        delta {
            payload
        }
    }
}

We can query for the all the commits of an object (with id: '123') like so:

query {
    allCommits(docid: "123") {
        cid
        height
        delta {
            payload
        }
    }
}

We can query for a specific commit

query {
    commit(cid: 'Qm123') {
        cid
        height
        delta {
            payload
        }
    }
}

In addition to using the Commit specific queries, we can also include some commit version sub-fields in our object queries.

query {
    user {
        _key
        name
        age
        
        _version {
            cid
            height
        }
    }
}

This example shows how we can query for the additional _version field that's generated automatically for each added schema type. The _version is the same execution as latestCommit.

Both _version and latestCommit return an array of Commit types. This is because the HEAD of the MerkleDAG can actually point to more than one DAG node. This is caused by two concurrent updates to the DAG at the same height. More often the DAG will only have a single head, but it's imporant to understand that it can have multiple.