Understanding the Datalog Bottom-Up Algorithm

# Understanding the Datalog Bottom-Up Algorithm In this document, we will briefly dive into how our implementation of Datalog works. We will begin with a brief introduction of its major syntactical and semantic elements and then explore how they are combined to do increasingly interesting things. ## Reading and Writing Datalog The primary semantic units of Datalog, that is, the individual things which meaningfully make up a Datalog program, are terms, literals (or atoms), and clauses (or rules). **Terms** are either *constants* of type `'T`, where `'T` is often the string type, or *variables*, which hold no information themselves aside from an identifying integer. The identifying integer is used in place of differing variable names, so instead of e.g. `parent(X, Y)`, one might see `parent(X0, X1)` in the output of our implementation. However, the use of differing variable names in program input is completely acceptable. **Literals**, or atoms, are collections of terms. Literals are often used for representing *facts*. For example, `parent(john, gracie)` is a *fact*, but its representation is merely a collection of constant terms: `[| Const "parent" ; Const "john" ; Const "gracie" |]`. It is implicitly understood that the first term in a literal, known as the *head*, provides a name or description for the kind of fact that the literal describes. **Clauses**, or rules, are collections of literals. Clauses are typically used to designate that when a collection of facts hold for a term or a set of terms, a new fact may be assumed to be true about said terms. For example, take the following set of clauses: ```prolog ancestor(X, Y) :- parent(X, Y). ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y). ``` This clause, defined recursively with a base case, designates the following: A person `X` is an ancestor of the person `Y` if either sets of the following facts are known: 1. That `X` is a parent to `Y`. 2. That `X` is a parent to some `Z` who is themselves an ancestor to `Y`. If we find that some person `X` does match the above criteria with respect to a person `Y`, a new fact is known about the person `X` as well as about the person `Y`, and that brings us to the question: what can we do with all these declarations and inferences of facts? ## Pattern-matching, Querying, and Solving Goals At its core, the point of Datalog is to provide an interface wherein a number of facts and rules can be stated outright and added to an internal "database" which can then be queried arbitrarily. Consider the following example file included in the original OCaml implementation: ```prolog mother(claudette, ann). mother(jeannette, bill). father(john, ann). father(john, bill). father('jean-jacques', alphonse). father(alphonse, mireille). mother(mireille, john). % married a Hollywood actor! father(brad, john). parent(X,Y) :- mother(X,Y). parent(X,Y) :- father(X,Y). ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(X,Z), ancestor(Z,Y). ``` Now, we have enumerated many a fact in the form of declarations of who is whose father or mother, as well as a couple of rules designating further relationships that can be inferred through the stated facts. Now suppose we want to have our implementation enumerate to *us* some things that it ought to be able to infer from our facts and rules. For example, say we'd like to know to whom Brad is an ancestor. We may query our implementation like so (from within the folder `src/bottom_up_cli`): `dotnet run small.pl --pattern "ancestor(brad, X)"`, which returns to us the following result: ``` % parse file small.pl.% process 12 clauses. % computing fixpoint.... % done.. % facts matching pattern ancestor(brad, X0): ancestor(brad, john) ancestor(brad, ann) ancestor(brad, bill) ``` Indeed, Brad is a father to John, who is himself a father to Ann and Bill. Thus, our implementation has correctly shown us that Brad is an ancestor to John, Ann and Bill. This gives us a rough idea of what basic interaction with a Datalog fact database looks like. There are some additional interfaces that provide even more interesting ways to seek information from our database. Consider, for example, the query functionality which allows us to express a query as specific as: "Find me some person X and some person Y such that X is an ancestor to John, X is a father to Y, and Y is not a mother." This query would be passed on to the program as follows: `dotnet run small.pl --query "(X,Y) :- ancestor(X,john), father(X,Y), not mother(Y,Z)"`. Let's quickly break down the query to see how it matches up to the description provided: * `(X, Y) :-` reads "Find X and Y such that ..." * `ancestor(X, john)` states that X must be an ancestor to John * `father(X, Y)` states that X must be a father to Y * `not mother(Y, Z)` states that Y must not be a mother to Z, who could be anyone. Thus, Y must not be a mother in general. The result is the following: ``` % parse file small.pl.% process 12 clauses. % computing fixpoint.... % done.. % query plan: ... // Removed for brevity % query answer: Const "brad" , Const "john" ; Const "'jean-jacques'" , Const "alphonse" ; ] ``` The implementation correctly finds that the two pairs of individuals matching the query are Brad and John, and Jean-Jacques and Alphonse. ### Goals and Goal Handlers The last interface we will talk about before moving on to implementation details is that of goals. Take, for example, the following example file from the original implementation: ```prolog reachable(X,Y) :- edge(X,Y). reachable(X,Y) :- edge(X,Z), reachable(Z,Y). increasing(X,Y) :- edge(X,Y), lt(X,Y). increasing(X,Y) :- edge(X,Z), lt(X,Z), increasing(Z,Y). edge(0, 1). edge(1, 2). edge(2, 3). edge(3, 4). edge(4, 5). edge(5, 6). edge(6, 7). edge(7, 8). edge(8, 9). edge(9, 10). edge(10, 0). ``` Note that the `increasing` clause makes use of a fact called `lt` or "less than", but that this fact is *never* explicitly declared with respect to any of the numbers in the graph. The bottom up algorithm has support for "goals". A goal is a literal which may be ground (only containing constants) or non-ground (containing variables) that represents one or more facts in the database which can only be found to be true through some additional computation that falls outside of the confines of simple matching via clauses. In this case, there is no simple relation that can be used to define the relationship described by `lt`. At some level, we need to be able to perform the computation comparing two numbers to see if one is less than the other, and that simply cannot be described from within an input file to our program. Goals allow us to solve this issue by providing support for "handlers": functions that run for every discovered goal (note that goals can often lead to the discovery of sub-goals that help fulfill the original goal) and can help solve the goal. Take, for example, the following goal handler defined within our implementation: ```fsharp let handleGoal (db: DLogic.Database<string>) lit = match (DLogic.Datalog.open_literal lit) with | "lt", [ DLogic.Const a; DLogic.Const b ] when compare (int a) (int b) < 0 -> db.AddFact lit | "le", [ DLogic.Const a; DLogic.Const b ] when compare (int a) (int b) <= 0 -> db.AddFact lit | "equal", [ DLogic.Const a; DLogic.Const b ] when (int a) = (int b) -> db.AddFact lit | _ -> () ``` This handler essentially says "If the goal (or sub-goal) states that `a` is less than `b` and `a` *is* in fact less than b, add the fact `lt(a, b)` to the database, and so forth." This ensures that even though the fact `lt(1, 2)` for example is not explicitly stated in the input file, it will eventually be added as a fact to the database, and can thus be used in the following invocation: `dotnet run graph10.pl --goal "increasing(X,Y)" --pattern "increasing(3,X)"` which produces the following output: ``` % parse file graph10.pl.% process 15 clauses. % computing fixpoint.... % done.. % facts matching pattern increasing(3, X0): increasing(3, 4) increasing(3, 5) increasing(3, 6) increasing(3, 7) increasing(3, 8) increasing(3, 9) increasing(3, 10) ``` By declaring that our goal is to find out whether the clause `increasing(X,Y)` holds for any `X` and `Y`, we ensure that the algorithm only bothers with finding out whether some number `X` is less than `Y` in as many cases as is necessary to answer any query about whether `increasing(X,Y)` holds. To clarify what exactly this entails, suppose we issued the following invocation: `dotnet run graph10.pl --goal "increasing(X,Y)" --pattern "lt(1, X)"`. This should correctly show us the following output: ``` % parse file graph10.pl.% process 15 clauses. % computing fixpoint.... % done.. % facts matching pattern lt(1, X0): lt(1, 2) ``` Note that although, e.g., `lt(1, 3)` is technically true, it wasn't necessary to determine this fact in order to answer any potential queries about whether `increasing(X,Y)` holds, based on the set of facts already in the database. Note that the definition of `increasing` only cares whether `X` is less than `Y` or `Z` if `X` forms an edge with `Y` or `Z`. Since 1 never forms an edge with 3, and our goal is to be able to answer queries about the clause `increasing`, the fact `lt(1,3)` is never added to the database. Similarly, we note that if we explicitly make our goal, e.g., `increasing(3,X)` as in the following invocation: `dotnet run graph10.pl --goal "increasing(3,X)" --pattern "lt(1,X)`, then the fact `lt(1,2)` is actually never discovered, as it was completely unnecessary to fulfilling the goal. We see that this is the case in the following output: ``` % parse file graph10.pl.% process 15 clauses. % computing fixpoint.... % done.. % facts matching pattern lt(1, X0): ```