A Datalog rule is similar to a view definition in relational database systems. It states that if certain tuples exist in specific relations, then an additional tuple is assumed to exist in a “virtual” or derived relation. It can also be considered an inference rule for deducing new facts from existing ones. A rule has a head followed by a body, separated by the symbol :–, which is intended to resemble a left-pointing implication arrow. The head indicates the tuple being defined, and the body is a comma-separated list of facts, in which the commas are interpreted as “and”. A rule generally contains one or more logical variables, which begin with uppercase letters.2 Note that, under these conventions, First
would be a variable, while 'First'
a constant symbol. The rule asserts that the implication holds for all possible substitutions of the logical variables. For example, suppose we want to define a relation with certain information about people who received a Ph.D. in the area of Computer Science. We could use the following Datalog rule:
One way to understand a Datalog rule is as standing for all ground instances of the rule obtained by consistently substituting constants for the logical variables. Two such instances of the rule above are
The effect of a rule is to extend the database by the fact in the head for every such ground instance whenever all the facts in the body have been previously established—either because they are given explicitly as facts or because they were obtained from an instance of some rule using the same reasoning. Considering the two instances above, the first establishes the fact
as all the facts in the body are known to be true. However, the second rule instance does not establish the fact
as the fact person(jbf,'Joyce','Hansen')
is not one listed in this example program (nor can it be established by the rules above).
Another way to regard this rule is as a view definition, which we could write in SQL3 as
Note that there is a difference in Datalog and SQL in referring to attributes: Datalog relies on positional notation, while SQL has explicit naming. Also, in SQL variables range over tuples while in Datalog they range over individual domain elements. For those familiar with relational database theory, the difference is essentially that between the domain and tuple relational calculi.
Datalog allows multiple rules with the same predicate in the head, in which case the database is extended by the facts established by any of the rules. For example, if we wanted to derive all people in an adjacent generation to a given person (that is, the person’s direct advisors or advisees), we could use the two rules
We can view a Datalog program as describing a database, with some tuples given by facts and others established by rules. There are various conventions for expressing a query and the result against that database. For the moment, we will use a single predicate pattern, with zero or more variables, preceded by ?-
, as a query. The result of the query will be the set of all matching facts in the database specified by the Datalog program. Thus, if our program is all the previous facts and rules, then the query
has the result set
where the first two tuples are established by the first adjacent
rule, and the other two by the second. A query with no variables functions as a membership test, returning a singleton set if the corresponding tuple is in the database of the Datalog program, and an empty set if not. So the query
returns the singleton answer
but the query
returns an empty result.
While the two rules for adjacent
can be captured by a view using UNION
in SQL, Datalog can express virtual relations that SQL cannot, via recursion in rules.4 Suppose we want to consider two people “related” if they have a common academic ancestor. We can describe this relation with three rules:
We see that two of the rules establish related
-facts recursively, from other related
-facts. So, for example, we can use the following instance of the first rule
to establish the fact related(lvk, ks)
, then use that fact with the following instance of the second rule
to establish the fact related(jmy, kv)
. Note that while a newly established related
-fact might be able to establish a further related
-fact, the process must eventually stop. Every related
-fact involves person IDs from the original advised
-facts, and there are a finite number of pairs of such IDs.
More generally, we view the meaning of a Datalog program P consisting of facts and rules as the database of ground facts obtained starting with the original facts and applying the rules to derive additional facts until there are no changes. In this context it is useful to view P as a transformation (sometimes called the immediate consequence operator) that takes a set of ground facts as input and augments them with the additional facts that can be established with a single application of any fact or rule in P. If we start with an empty input, the first “application” of P adds all the facts in P. A second application adds all facts that can be established with one rule application, a third application adds facts established with two rule applications, and so on. Eventually this process converges when some application of P adds no new facts. Thus, we can view the meaning of P as the least fixed point [Lloyd 1993, Van Emden and Kowalski 1976] under application of P’s rules and facts. It turns out that, for Datalog, that the meaning for P is also the least model for P, in the sense that it is the minimal set of facts that includes all the facts for P and makes every ground instance of a rule in P true. (We will see later that this equivalence must be adjusted in order to handle negation.)
We note that the process of repeated applications of P as described above is actually a viable procedure for computing the meaning of a Datalog program, and is referred to as bottom-up evaluation. A query is then matched against the final set of established facts to find its answers. An alternative approach is top-down evaluation,