Читать книгу Statistical Relational Artificial Intelligence - Luc De Raedt - Страница 12
ОглавлениеCHAPTER 3
Relational Probabilistic Representations
Probability theory can be seen as extending the propositional calculus to include uncertainty; we can ask for the probability of a proposition conditioned on a proposition. Likewise, the (first-order) predicate calculus can be seen as extending propositional calculus to include relations (represented by predicate symbols), individuals, and logical variables that can be universally or existentially quantified. Relational probabilistic representations can be seen as extending the predicate calculus to include probabilities, or extending probabilistic models to include relations, individuals, and variables.
Probability has seen a resurgence in AI due mainly to exploiting the independency inherent in graphical models, and most relational probabilistic models are built from these graphical models. In relational probabilistic models, the random variables are built from individuals and relationships among individuals. Just as the first-order predicate calculus extends the propositional calculus to reason about individuals and relations, relational probabilistic models extend probabilistic models, such as Bayesian networks and Markov networks, to allow reasoning about individuals and relations.
With such models, a reasoning agent observes individuals, their properties, and relations among them, and conditioning on these allows probabilistic predictions about individuals, properties and relationships. We often want to build the models before we know which individuals exist in a population so that the models can be applied to diverse populations. When we condition on observations of particular individuals and relations among particular individuals, we can learn something about those individuals, which can then be used for subsequent reasoning and learning. In the open-universe case, we can be unsure about whether two descriptions or names refer to the same individual, whether an individual that fits a description actually exists or how many individuals there are.
These are key properties of StarAI domains that we have encountered already in the examples used in the introduction.
Example 3.1 Education modeling In an educational domain (see, e.g., Fig. 1.2), the individuals may be students, courses, tests, or particular questions on tests. We want to condition on the observations of what tests the students took, what answers they provided, and any other pertinent information to predict what they understand.
Example 3.2 Medical diagnosis In a medical diagnosis system (see, e.g., Fig. 1.3), the individuals may be patients, treatments, drugs, tests, particular lumps on a person’s body, etc. A system can condition on a patient’s electronic health record to predict the outcome of some treatment. Having a relational model is sensible since electronic health records are not summaries, but contain a history of physical observations, treatments, tests, etc. They differ greatly in the amount of detail they contain per person.
In both of these examples the models are not about a particular patient or student, but are models that can be applied to any patient or student. The set of observations is not fixed, as for instance an electronic health record may contain an unbounded number of observations and test results about multiple lumps that may have appeared (and disappeared) on a patient.
The main property relational models exploited is that of exchangeability: those individuals about which we have the same information should be treated identically. This is the idea behind (universally quantified) variables; some statements are true for all individuals. In terms of probabilistic models, this means we can exchange the constants in any grounding, and still get the same probabilities for any proposition. This implies that before we know anything particular about any of the individuals, they all should share their probabilistic parameters. It also provides a form of symmetry that can be exploited for representations, inference and learning. De Finetti [1974] shows how distributions on exchangeable random variables can be represented in terms of directed models with latent variables. Note that exchangeability of individuals corresponds to exchangeability of random variables only for properties (relations on single individuals), but not for properties of multiple individuals (but see [Kallenberg, 2005] for results in such situations).
Over the years, a multitude of different languages and formalisms for probabilistic relational modeling have been devised. However, there are also some general design principles underlying this “alphabet soup,” which we will discuss first.
3.1 A GENERAL VIEW: PARAMETERIZED PROBABILISTIC MODELS
Relational probabilistic models are typically defined in terms of parameterized random variables [Poole, 2003], which are often drawn in terms of plates [Buntine, 1994, Jordan, 2010]. A parameterized random variable corresponds to a predicate or a function symbol in logic.
We use a first-order alphabet consisting of logical variables, constants, function symbols, and predicate symbols. We assume that the logical variables are typed, where the domain of the type (i.e., the set of individuals of the type), is called the population. Recall from Section 2.2 that a term is either a logical variable, a constant or of the form f(t1, …, tk), where f is a function symbol and each ti is a term. We treat constants as function symbols with no arguments (and so are not treated specially). Each function has a range, which is a set of values. In the following we treat relations as Boolean functions (with range {true, false}).
A parameterized random variable (PRV) is of the form f(t1, …, tk) where each ti is a term and f is a function (or predicate) symbol. A random variable is a parameterized random variable which does not contain a logical variable. An atom is of the form f(t1, …, tk) = v where v is in the range of f. When the range of f is {True, False}, (i.e., when f is a predicate symbol), we write f(t1, …, tk) = True as f(t1, …, tk), and f(t1, …, tk) = False as ¬f(t1, …, tk). We can build a formula from relations using the standard logical connectives. The grounding of a PRV is the set of random variables obtained by uniformly replacing each logical variable by each individual in the population corresponding to the type of the logical variable.
A lifted graphical model, also-called a template-based model [Koller and Friedman, 2009], is a graphical model with parameterized random variables as nodes, and a set of factors among the parameterized random variables, called parameterized factor. A lifted model, together with an assignment of a population to each logical variable means its grounding: the graphical model where the set of random variables is the set of groundings of the PRVs in the model, and the factor in the ground model is the grounding of the corresponding factor of the lifted model. The details differ in the various representation languages, because what is allowed as factors varies, but the issues can be highlighted by a few examples.
Example 3.3 Consider a model of the performance of students in courses. With such a model, we could, for example, condition on the data of Fig. 1.2 to predict how students s3 and s4 will perform in course c4. Suppose we have the types: student and course. The population of student is the set of all students, and the population of course is the set of all courses. The parameterized random variable grade(S, C) could denote the grade of student S in course C. For a particular student sam, and course cs101, the instance grade(sam, cs101) is a random variable denoting the grade of Sam in the course cs101. The range of grade could be the set of possible grades, say {a, b, c, d, f}. Similarly, int(S) could be a parameterized random variable denoting the intelligence of student S. The PRV diff(C) could represent the difficulty of course c.
If there are n students and m courses, the grounding contains nm random variables that are instances of grade(S, C), n instances of int(S) and m instance of diff(C). Thus, there are nm + n + m random variables in the grounding.
Figure 3.1 gives a plate representation of a directed model to predict the grades of students in courses. In this figure, s is a logical variable that denotes a student and c is a logical variable that denotes a course. Note that this figure redundantly includes the logical variables in the plates as well as arguments to the parameterized random variables. Such parameterized models represent their grounding, where there is an instance of each random variable for each assignment of an individual to a logical variable. The factors of the ground network are the groundings of the corresponding factors of the relational model. Figure 3.2 shows such a grounding where there are three students Sam, Chris, and Kim and two courses (c1 and c2).
Note the conditional independence assumptions in this example (derived from the independence inherent in the underlying Bayesian network): the intelligence of the students are independent of each other given no observations. The difficulty of the courses are independent of each other, given no observations. The grades are independent given the intelligence and the difficulty. Given no observations, a pair of grades that share a student or a course are dependent on each other, as they have a common parent, but a pair of grades about different students and courses are independent. Given observations on grades, the intelligence variables and the difficulty variables can become interdependent.
Figure 3.1: Plate representation of the grades model.
Figure 3.2: Grounding of the grades model for 3 people (Sam, Chris, and Kim) and 2 courses (c1 and c2).
The basic principle used by these models is that of parameter sharing: the instances of the parameterized factors created by substituting constants for logical variables share the same probabilistic parameters. This is a consequence of exchangeability: the variables can denote any individual, and so need to be the same for each instance (before we have observed anything to distinguish the individuals). The various languages differ in how to specify the conditional probabilities of the parameterized random variables given their parents, or the other parameters of the probabilistic model.
Often in plate models [Buntine, 1994, Jordan, 2010] the numerical parameters are made explicit, to emphasize that the probabilistic parameters do not depend on the individuals.
Figure 3.3: Plate representation of the grades model, with shared parameters explicit.
Example 3.4 Figure 3.3 shows the plate model of Fig. 3.1 with the numerical parameters made explicit. Parameter θi specifies the prior probability of int(S) that is shared among the students, parameter θd specifies the prior probability of diff(C), and θg specifies the numerical parameters of the conditional probability P(grade(S, C) | int(S), diff(C)). The figure makes explicit that these parameters do not depend on the individuals.
There are some issues that relational models must face, beyond the non-relational case. We highlight one such issue now.
Suppose there is a directed model where for some parameterized random variable X, the parents of X include a logical variable that does not appear in X. In the grounding, X has an unbounded number of parents, and we need some way to represent the conditional probability of a variable given an unbounded number of parents; such methods are called aggregation functions. While aggregation does occur in the non-relational case, it cannot be avoided in the directed relational case.
Example 3.5 Consider the problem of determining guilt in a shooting. Figure 3.4 depicts an example where someone shot person Y if there exists a person X who shot Y. In this model, someone_shot(Y) has the number of parents equal to the population size. In this case, the appropriate aggregator is a logical “or.” The PRV someone_shot(Y) has the meaning ∃X shot(X, Y), and it may be appropriate to write it in this way. Whether X shot Y depends on whether X had motive, opportunity and a gun.
The instances of shot(X, Y) for each X are dependent because they share a common ancestor, has_gun(X). The instances of shot(X, Y) are dependent for some Y if someone_shot(Y) is observed.
Figure 3.4: Plate representation of determining guilt from Example 3.5.
Often there is much more structure about how the instances interact than can be represented in a model that assumes that the instances are only dependent given shared ancestors or observed shared descendants, as in the following two examples.
Example 3.6 Consider the case of diagnosing students’ performance in adding multi-digit numbers of the form
A student, given values for the digits xi and yi, provides values for the zi digits. The aim is to determine whether the students have mastered addition from observations of their performance.
Whether a student gets the correct carry depends on the previous x, y and carry, and whether the student knows how to carry. This dependency can be seen in Fig. 3.5. Here x(D, P) is a parameterized random variable representing digit D of the first summand for problem P. There is a random variable in the grounding for each digit D and each problem P. A ground instance, such as x(3, problem57), is a random variable that may represent the third digit from the right of the topmost number of problem 57. Similarly, there is a z-variable for each digit D, problem P, student S, and time T. The plate notation can be read as duplicating the random variable for each tuple of individual the plate is parameterized by. Whether the student knows how to carry or knows how to add depends on the student and the time, but not on the digit or the problem.
For each problem, student and time, the carry for digit D, namely c(D, P, S, T), depends, in part, on c(D − 1, P, S, T), that is, on the carry from the previous digit. There is a special case for the first digit. Similarly, whether a student knows how to add or carry at any time depends on whether they knew how to add or carry at the previous time. This is depicted by cycles in the plate notation, but it results in an acyclic grounding. The plate notation does not convey all of the information about the dependencies.
Figure 3.5: Belief network with plates for multidigit addition.
Example 3.7 Consider determining the probability that two authors are collaborators, which may depend on whether they have written papers in common, or even whether they have written papers apart from each other. That is, collaborates(A1, A2) may depend on the existence of, and properties of any paper P such that author(A1, P) ∧ author(A2, P) is true. A representation has to be able to specify what happens when there is more than one paper that they are co-authors of. The existential quantification used in Example 3.5 does not seem to be appropriate here. It may also depend on other co-authors who may have collaborated with each of them. Any such rules are only applicable if A1 ≠ A2, that is if A1 and A2 are different people. Examples such as these require more sophisticated languages than the plate models specified above.
The first such languages either had explicit languages for constructing the ground network [Breese, 1992, Goldman and Charniak, 1990] or defined templates for prior and conditional probabilities [Horsch and Poole, 1990] directly in terms of tables, and required a combination function (such as noisy-and or noisy-or) when the number of parents depends on the number of individuals. These template models turn out not to be very flexible as they cannot represent the subtleties involved in how one random variable can depend on others.
To represent such examples, it is useful to be able to specify how the logical variables interact, as is done in logic programs. The independent choice logic (ICL) [Poole, 1997, 2008] (originally called probabilistic Horn abduction (PHA) [Poole, 1991, 1993b]) allows for arbitrary (acyclic) logic programs (including negation as failure) to be used to represent the dependency. The conditional probability parameters are represented as independent probabilistic inputs to the logic program. A logic program that represents Example 3.6 is given in Chapter 14 of [Poole and Mackworth, 2010]. This mix of probabilities and logic programs also forms the foundations for PRISM [Sato and Kameya, 1997, 2008], which has concentrated on learning, and for Problog [De Raedt et al., 2007], a project to build an efficient and flexible language.
Other languages are based on entity-relationship models [Friedman et al., 1999, Heckerman et al., 2004], fragments of first-order logic [Muggleton, 1996, Kersting and De Raedt, 2007, Laskey, 2008], or even in terms of programming languages such as in IBAL [Pfeffer, 2007] or Church [Goodman et al., 2008]. Undirected models, exemplified by Markov logic networks (MLNs) [Richardson and Domingos, 2006], have a similar notion of parameterized random variables, but the probabilities are represented as weights of first-order clauses. The meaning of these relational models is represented by their grounding, but for MLNs the grounding is a Markov network rather than a Bayesian network (see Pearl [1988] for a discussion of the relationship between Bayesian and Markov networks). Yet other languages are extensions of the maximum entropy principle to the relational setting [Kern-Isberner and Thimm, 2010, Beierle et al., 2015] or incorporate probabilistic reasoning about similarities and relational structures [Bröcheler et al., 2010, Bach et al., 2013].
All of the representations are required to be finite (for otherwise they cannot be written down), but the grounding does not need to be finite. The logic-programming based languages such as ICL and Problog allow function symbols which means the grounding is infinite. MLNs have been extended to infinite domains [Singla and Domingos, 2007]. So-called nonparametric Bayesian approaches such as NP-BLOG [Carbonetto et al., 2005], infinite (hidden) relational models [Kemp et al., 2006, Xu et al., 2006], and relational Gaussian processes [Chu et al., 2006, Yu et al., 2006, Yu and Chu, 2007, Silva et al., 2007, Xu et al., 2009b] also allow for an unbounded number of individuals. These models use stochastic processes to specify how the random variables and their associated parameters are generated.
3.2 TWO EXAMPLE REPRESENTATIONS: MARKOV LOGIC AND PROBLOG
We now review two relational probabilistic model formalisms, which are probably most representative of the two main streams in StarAI: Markov logic and ProbLog. Markov logic [Richardson and Domingos, 2006] upgrades Markov network toward first-order logic, whereas ProbLog [De Raedt et al., 2007] is a probabilistic Prolog based on what has been called the distribution semantics [Poole, 1993b, Sato, 1995]. While Markov logic is a typical example of an undirected knowledge-based model construction, ProbLog is a directed model that can be seen as a relational probabilistic programming language. Furthermore, while Markov Logic is an example of an approach that extended a probabilistic graphical model with logic, ProbLog extends a logic programming language with probabilistic primitives. For implementations of Markov Logic and Problog as well as further materials on how to use them, see http://alchemy.cs.washington.edu/
and https://dtai.cs.kuleuven.be/problog/
.
Both representations induce a probability distribution over possible worlds, where a world corresponds to a Herbrand model. The probability of a formula is the sum of the measures of the possible worlds in which it is true.
3.2.1 UNDIRECTED RELATIONAL MODEL: MARKOV LOGIC
Markov logic combines first-order logic with Markov networks. A Markov logic network consists of set of weighted first-order formulae. The probability of a world is proportional to the exponential of the sum of the formulae that are true in the world.
The idea is to view logical formulae as soft constraints on the set of possible worlds. A positive weight on a formula increases the probability of the worlds that satisfy the formula, and a negative weight on a formula decreases the probability of the worlds that satisfy the formula.
Example 3.8 Consider the following example (adopted from [Richardson and Domingos, 2006]). Friends & Smokers is a small Markov logic network that computes the probability of a person having lung cancer on the basis of her friends smoking. This can be encoded using the following weighted formulas: