Читать книгу Statistical Relational Artificial Intelligence - Luc De Raedt - Страница 11
ОглавлениеCHAPTER 2
Statistical and Relational AI Representations
Artificial intelligence (AI) is the study of computational agents that act intelligently [Russell and Norvig, 2010, Poole and Mackworth, 2010] and, although it has drawn on many research methodologies, AI research arguably builds on two formal foundations: probability and logic.
The basic argument for probability as a foundation of AI is that agents that act under uncertainty are gambling, and probability is the calculus of gambling in that agents who do not use probability will lose to those that do use it (see [Talbott, 2008], for an overview). While there are a number of interpretations of probability, the most suitable for the present book is a Bayesian or subjective view of probability: our agents do not encounter generic events, but have to make decisions in particular circumstances, and only have access to their percepts (observations) and their beliefs. Probability is the calculus of how beliefs are updated based on observations (evidence).
The basic argument for first-order logic is that an agent needs to reason about individuals, properties of individuals, and relations among these individuals, and at least be able to represent conjunctions, implications, and quantification. These requirements arise from designing languages that allow one to easily and compactly represent the necessary knowledge for a non-trivial task. First-order logic has become widely used in the theory of knowledge representation. It assumes that the world consists of individuals, among which various relations hold or do not hold. Logic provides a semantics, a specification of meaning, which agent designers can then choose to implement in various ways.
Both probability and predicate calculus can be seen as extending propositional logic, one by adding measures over the worlds, and the other by extending the propositional calculus to include individuals and relations. Since, statistical relational AI (StarAI) aims to seamlessly combine these two strands of research, we will first introduce each of these separately before moving on to presenting their possible integrations.
2.1 PROBABILISTIC GRAPHICAL MODELS
The semantics of probability [Halpern, 1990, 2003] can be defined in terms of possible worlds and random variables (although they are neither random nor variable). We can either define random variables in terms of possible worlds (a random variable is a function on possible worlds) or define possible worlds in terms of random variables (a possible world is an assignment of a value to every random variable). Either way, a random variable has a value in every world. A random variable having a particular value is a proposition. Probability is defined in terms of a non-negative measure over sets of possible worlds that follow some intuitive axioms (see, e.g., [Halpern, 2003]). This means that a probabilistic graphical model defines a probability distribution over its possible worlds, or equivalently a joint probability distribution over the propositions, or the assignments of values to the random variables.
In Bayesian probability, we make explicit assumptions and the conclusions are consequences of the specified knowledge and assumptions. The assumptions are about the space of possible hypotheses and the prior probabilities. One particular explicit assumption is the assumption of conditional independence. graphical models [Pearl, 1988, Lauritzen, 1996, Koller and Friedman, 2009] provide representations of such conditional independence of random variables in terms of graphs. A Bayesian network [Pearl, 1988, Darwiche, 2009] is an acyclic directed graphical model of probabilistic dependence where the nodes are random variables. A Bayesian network encapsulates a directed form of independence between the random variables in the graph: a variable is conditionally independent of its non-descendants given its parents. This has turned out to be a very useful assumption in practice. A Bayesian network requires a representation of the conditional probability of each variable given its parents.
Undirected graphical models, or Markov networks, [Pearl, 1988] are defined in terms of factors (or potentials) among random variables. A factor represents a function of a set of random variables; for example, a factor on variables {X, Y, Z} is a function that given a value for each variable, returns a non-negative number. The nodes in a Markov network correspond to random variables, and the variables in a factor are neighbors of each other in the graph. These models encapsulate the assumption that a variable is independent of other variables given all of its neighbors. The next two sections give more details.
2.1.1 BAYESIAN NETWORKS
Formally, a Bayesian network (or belief network) is an augmented, acyclic directed graph (DAG), where each node corresponds to a random variable Xi and each edge indicates a direct influence among the random variables. The influence for each variable Xi is quantified with a conditional probability distribution (CPD) P(Xi | Pa(Xi)), where Pa(Xi) are the parents of Xi in the graph. The Bayesian network represents a set of independence assumptions: a variable is independent of the other variables that are not its descendents given its parents.
Example 2.1 As an example of a Bayesian network, consider Judea Pearl’s famous burglary alarm network. The Bayesian network in Fig. 2.1 contains the random variables burglary, earthquake, mary_calls, john_calls, and alarm. The network specifies that burglary and earthquake are independent, and that john_calls is independent of the other variables given alarm. Assume that the random variables are all Boolean, i.e., they can have the range {true, false}, then the Bayesian network defines a probability distribution over truth-assignments to {alarm, earthquake, mary_calls, john_calls, burglary}. To do so, it makes use of CPDs associated to each of the nodes are specified in Table 2.1. They include the CPDs P(alarm | burglary, earthquake), P(earthquake), etc.
Figure 2.1: A Bayesian network for burglary alarms (adapted from Judea Pearl). Nodes denote random variables and edges denote direct influences among the random variables.
Table 2.1: Conditional probability distributions associated with the nodes in the burglary alarm network, cf. Fig. 2.1
The Bayesian network thus has two components: a qualitative one—the directed acyclic graph—and a quantitative one—the conditional probability distributions. Together they specify the joint probability distribution.
Because of the conditional independence assumption, we can write down the joint probability density as
The independence assumption, which states that each variable is independent of its non-descendents given its parents, implies other independencies. These are axiomatized by d-separation [Pearl, 1988], which allows to infer all independencies that are implied from the independence assumptions of Bayesian networks.
2.1.2 MARKOV NETWORKS AND FACTOR GRAPHS
Conceptually, a Markov random field [Pearl, 1988] is an undirected analogue of a Bayesian network. A Markov network is defined in terms of a set of discrete-valued random variables, X = {X1, X2, …, Xn} and a set of factors {f1, …, fm}, where a factor is a non-negative function of a subset of the variables.
A Markov network represents the joint distribution over X as proportional to the product of the factors, i.e.,
where fk(Xk) is a factor on Xk ⊆ X, and xk is x projected onto Xk. Z is a normalization constant known as the partition function.
As long as the factors are all non-zero (they always return a positive number), the distribution can be represented as a log-linear model:
where the factors gk(·) are functions of (a subset of) the configuration x, and wi are real-valued weights.
A Markov network is a graphical representation of a Markov random field where the nodes are the random variables and there is an arc between any two variables that are in a factor together.
As for Bayesian networks, the structure of a Markov network encodes conditional independencies. Basically, two nodes X and Y in a network are conditionally independent given variables Z1, ldots, Zn (i.e., P(X | Y, Z1, …, Zn) = P(X | Z1, …, Zn)) if and only if all paths from X to Y contain one of the variables Zi. The Markov network for the alarm example is illustrated in Fig. 2.2.
A Bayesian network can be seen as a Markov random field, where each conditional probability is a factor. Thus any inference algorithm developed for Markov networks can be applied to Bayesian networks. Many of the inference algorithms are designed for Markov networks for this reason. However, Bayesian networks allow for specialized algorithms, such as recursively pruning variables that are not observed, not queried, and have no children. Any Markov network can also be represented as a Bayesian network by scaling each factor to be in the range [0, 1] (by multiplying by a constant), creating a new variable tk for each factor fk and defining P(tk = true | xk) = fk(xk), and conditioning on tk = true.
An alternative way to depict a Markov random field is in terms of a factor graph. A factor graph is a bipartite graph, which contains a variable node for each random variable xi and a factor node for each factor fi. There is an edge between a variable node and a factor node if and only if the variable appears in the factor. This is often the preferred representation to describe probabilistic inference methods sending messages between variables and factors. The factor graph for the alarm Bayesian network is shown in Fig. 2.3.
Figure 2.2: The Markov network for the burglary alarm Network.
Figure 2.3: The factor graph representing the burglary alarm Bayesian network.
A Markov network can be obtained from a factor graph, by adding arcs between all of the neighbors of each factor node, and dropping the factor nodes. A factor graph can be obtained from a Markov network by creating a factor for each clique in the Markov network. However, mapping to a Markov network loses information. For example, the Markov random field with a single factor f0(X, Y, Z) has the same Markov network as the Markov random field with factors {f1(X, Y), f2(Y, Z), f3(X, Z)}, but they have different factor graphs. The network with a single factor can represent more distributions than the one with only pairwise factors, e.g., the distribution where the variables are independent of each other except that they have even parity.
Figure 2.4: A logic program defining grandparent.
Figure 2.5: A logic program defining nat.
2.2 FIRST-ORDER LOGIC AND LOGIC PROGRAMMING
The probabilistic models introduced in the previous section have a fixed finite number of random variables. Boolean random variables correspond to propositions in logic, and propositions can also represent more complex random variables. First-order logic allows for an unbounded number of propositions by including logical variables, constants, and functions. The same ideas can be used for random variables. In order to understand this we now introduce first-order logic, and the directed variant, which are logic programs.
To introduce logic and logic programs, consider Figs. 2.4 and 2.5, which contain two programs, grandparent and nat. In these programs grandparent/2, parent/2, and nat/1 are predicates (with their arity, the number of arguments, listed explicitly). The strings jef, paul, ann, and 0 are constants, whereas X, Y, and Z are logical variables, which we will just call variables when there is no confusion with random variables. All constants and variables are also terms. In addition, there exist structured terms such as s(X), which contains the functor s/1 of arity 1 and the term X. Constants are often considered as functors of arity 0. A first-order alphabet ∑ is a set of predicate symbols, constant symbols and functor symbols.
Atoms are predicate symbols followed by the necessary number of terms, e.g., parent(jef, paul), nat(s(X)), parent(X, Z), etc. A literal is an atom, e.g., nat(s(X)), (called a positive literal) or the negation of an atom, e.g., ¬nat(s(X)) (called a negative literal).
First-order logical formulas are recursively constructed from atoms using logical connectives (¬, ⋁ and ⋀) and quantifiers (∀ and ∃). If f1 and f2 are formulas then the following are formulas, too:
• ¬f1 (negation), which is true iff f1 is not true;
• f1 ⋀ f2 (conjunction), which is true iff both f1 and f2 are true;
• f1 ⋁ f2 (disjunction), which is true iff either f1 or f2 is true;
• f1 → f2, or f2 ← f1 (implication), which is true if f1 is false or f2 is true;
• f1 ↔ f2 (equivalence), which is shorthand for (f1 → f2) ∧ (f2 → f1);
• ∀Xf1 (universal quantification), which is true if f1 is true for every individual that X could refer to; and
• ∃Xf1 (existential quantification), which is true if f1 is true for at least one individual.
An important subset of first-order logic is that of clausal logic. A clausal theory consists of a set of clauses. A clause is a formula of the form
where the Aj and the Bi are atoms. All variables in clauses are understood to be universally quantified. Clause (2.3) is logically equivalent to
Example 2.2 Two example clauses are:
The first clause states that for all X if X is human, X is either male or female. The second clause states that for all X, X cannot be both male and female, that is, no-one is both male and female. Here, false is an atom that is always false or is the empty disjunction n = 0.
A practical subset of clausal logic is that of logic programs, which are a subset that contains no uncertainty; what is given up is the ability to state that a ∨ b is true, without saying which one of a or b is true. Logic programs are of interest because they have an intuitive declarative and procedural interpretation [Kowalski, 2014].
A definite clause is a clause with exactly one atom in the conclusion part of the clauses (that is, n = 1). Following the Prolog tradition, definite clauses are usually written with the conjunction represented by a comma, and “if” represented by “: -”
A is the head of the clause, and B1, …, Bm is the body of the clause. If m = 0 the implication “: -” can be omitted and the clause is called a fact. A logic program is a set of definite clauses.
Example 2.3 The definite clause
can be read as: for all X, for all Y and for all Z: X is a grandparent of Y if X is a parent of Z and Z is a parent of Y. grandparent(X, Y) in the head of the clause, and parent(X, Z), parent(Z, Y) is the body of the clause.
Example 2.4 Figure 2.4 shows a logic program defining grandparent/2 with two parent/2 facts and Fig. 2.5 a program defining nat/1.
Reasoning with logic involves manipulating the formula as data structures. An expression is a term, atom, conjunction, or clause. The set of logical variables in expression E is denoted as Var(E). For example, in the clause c defined in Example 2.3, Var(c) = {X, Y, Z}. An expression E is ground when there is no variable occurring in E, i.e., Var(E) = ∅. To connect variables and (structure) ground terms, we make use of substitutions.
A substitution is of the form θ = {V1/t1, …, Vn/tn}, e.g., {Y/ann}, where the Vi are distinct logical variables, and ti are terms. Applying a substitution θ to an expression e yields the expression eθ where all occurrences of the variables Vi are simultaneously replaced by the term ti, e.g., if c is the definite clause (2.3) above, c{Y/ann} is
A substitution θ is called a unifier for two expressions e1 and e2 if e1θ = e2θ, that is, if the expressions become identical after applying the substitution. For instance, the substitution {X/jef, Y/ann} is a unifier for parent(X, ann) and parent(jef, Y).
It may seem that it is impossible to reason about clauses where variables can represent anything, e.g., numbers, people, or symbols. It turns out that if a set of clauses has a model where the variables can represent anything, it has a model where the variables denote symbolic objects in the Herbrand base. The Herbrand base of a set of clauses P, denoted as hb(P), is the set of all ground atoms constructed with the predicate, constant and function symbols in the alphabet of P (and inventing a new constant if P doesn’t contain any constants).
Example 2.5 The Herbrand bases of the grandparent and nat logic programs in Figs. 2.4 and 2.5 are
hb(grandparent) has 18 elements and hb(nat) is infinite.
When considering the constants a and b, the clausal theory for Example 2.2 has the following Herbrand base:
A Herbrand interpretation for a set of clauses P is a subset of hb(P). Herbrand interpretation I represents the possible world where all of the elements in I are true and the elements of hb(P) \ I are false. A Herbrand interpretation I is a model of clause c, if and only if, for all substitutions, θ that grounds all variables in c such that body(c)θ ⊆ I holds, it also holds that head(c)θ ∈ I. Interpretation I is a model of a set of clauses P if I is a model of all clauses in P. A clausal theory P entails a clausal theory P′, denoted as P ⊨ P′, if and only if, each model of P is also a model of P′.
Example 2.6 The following two interpretations are models for Example 2.2, which shows that a set of clauses can have more than one model:
These models are also minimal w.r.t. set inclusion.
While clausal theories may have no, one, or more minimal models, a definite clause program has a unique least Herbrand model. The least Herbrand model LH(P), consists of all ground atoms f ∈ hb(P) such that P logically entails f, i.e., P ⊨ f. It is also a minimal model. All ground atoms in the least Herbrand model are provable.
Example 2.7 The least Herbrand models of our programs from Example 2.5 are
That is, nat(a) is provable for all atoms a in the Herbrand base of nat.
In Part II on inference, we will discuss how to use clausal logic for inference and for reasoning. This will also include the computation of the least Herbrand model of a logic program. For a detailed introduction to logic programming, see Lloyd [1989], for a more gentle introduction, see Flach [1994], and for a detailed discussion of Prolog, see Sterling and Shapiro [1986].