Читать книгу Semantic Web for the Working Ontologist - Dean Allemang - Страница 10

Оглавление

3 RDF—The Basis of the Semantic Web

Resource Description Framework (RDF), Resource Description Framework Schema (RDFS), and Web Ontology Language (OWL) are the basic representation languages of the Semantic Web, with RDF serving as the foundation. RDF addresses one fundamental issue in the Semantic Web: managing distributed data. All other Semantic Web standards build on this foundation of distributed data. RDF relies heavily on the infrastructure of the Web, using many of its familiar and proven features, while extending them to provide a foundation for a distributed network of data and the resulting paradigm of linked data on the Web will be explained in detail in Chapter 5.

The Web that we are accustomed to is made up of hypertext documents that are linked to one another. Any connection between a document and the thing(s) in the world it describes is made only by the person who reads it. There could be a link from a document about Shakespeare to a document about Stratford-upon-Avon, but there is no notion of an entity that is Shakespeare or linking it to the thing that is Stratford.

In the Semantic Web we refer to the things in the world as resources; a resource can be anything that someone might want to talk about. Shakespeare, Stratford, “the value of X,” and “all the cows in Texas” are all examples of things someone might talk about and that can be resources in the Semantic Web. This is admittedly a pretty odd use of the word “resource,” but alternatives like “entity” or “thing,” which might be more accurate, have their own issues. In any case, resource is the word used in the Semantic Web standards. In fact, the name of the base technology in the Semantic Web (RDF) uses this word in an essential way: RDF stands for Resource Description Framework.

In a web of information, anyone can contribute to our knowledge about a resource. It was this aspect of the current Web that allowed it to grow at such an unprecedented rate. To implement the Semantic Web, we need a model of data that allows information to be distributed over the Web.

3.1 Distributing Data Across the Web

Data are most typically represented in tabular form, in which each row represents some item we are describing, and each column represents some property of those items. The cells in the table are the particular values for those properties. Table 3.1 shows a sample of some data about works completed around the time of Shakespeare.

Let’s consider a few different strategies for how these data could be distributed over the Web. In all of these strategies, some part of the data will be represented on one computer, while other parts will be represented on another. Figure 3.1 shows one strategy for distributing information over many machines. Each networked machine is responsible for maintaining the information about one or more complete rows from the table. Any query about an entity can be answered by the machine that stores its corresponding row. One machine is responsible for information about Sonnet 78 and Edward II, whereas another is responsible for information about As You Like It.

This distribution solution provides considerable flexibility, since the machines can share the load of representing information about several individuals. But because it is a distributed representation of data, it requires some coordination between the servers. In particular, each server must share information about the columns. Does the second column on one server correspond to the same information as the second column on another server? This is not an insurmountable problem, and, in fact, it is a fundamental problem of data distribution. There must be some agreed-on coordination between the servers. In this example, the servers must be able, in a global way, to indicate which property each column corresponds to.

Figure 3.2 shows another strategy, in which each server is responsible for one or more complete columns from the original table. In this example, one server is responsible for the publication dates and medium, and another server is responsible for titles. This solution is flexible in a different way from the solution of Figure 3.1. The solution in Figure 3.2 allows each machine to be responsible for one kind of information. If we are not interested in the dates of publication, we needn’t consider information from that server. If we want to specify something new about the entities (say, how many pages the manuscript is), we can add a new server with that information without disrupting the others.

This solution is similar to the solution in Figure 3.1 in that it requires some coordination between the servers. In this case, the coordination has to do with the identities of the entities to be described. How do I know that row 3 on one server refers to the same entity as row 3 on another server? This solution requires a global identifier for the entities being described.

Table 3.1 Tabular data about Elizabethan literature and music.



Figure 3.1 Distributing data across the Web, row by row.

The strategy outlined in Figure 3.3 is a combination of the previous two strategies, in which information is neither distributed row by row nor column by column but instead is distributed cell by cell. Each machine is responsible for some number of cells in the table. This system combines the flexibility of both of the previous strategies. Two servers can share the description of a single entity (in the figure, the year and title of Hamlet are stored separately), and they can share the use of a particular property (in Figure 3.3, the Medium of rows 6 and 7 are represented on different servers).


Figure 3.2 Distributing data across the Web, column by column.


Figure 3.3 Distributing data across the Web, cell by cell.

This flexibility is required if we want our data distribution system to really support the AAA slogan that “Anyone can say Anything about Any topic.” If we take the AAA slogan seriously, any server needs to be able to make a statement about any entity (as is the case in Figure 3.2), but also any server needs to be able to specify any property of an entity (as is the case in Figure 3.1). The solution in Figure 3.3 has both of these benefits.

Table 3.2 Sample triples.

Subject Predicate Object
Row 7 Medium Poem
Row 2 Title Hamlet
Row 2 Year 1604
Row 4 Author Shakespeare
Row 6 Medium Play

Table 3.3 Sample triples.

Subject Predicate Object
Shakespeare wrote King Lear
Shakespeare wrote Macbeth
Anne Hathaway married Shakespeare
Shakespeare livedIn Stratford
Stratford isIn England
Macbeth setIn Scotland
England partOf UK
Scotland partOf UK

But this solution also combines the costs of the other two strategies. Not only do we now need a global reference for the column headings, but we also need a global reference for the rows. In fact, each cell has to be represented with three values: a global reference for the row, a global reference for the column, and the value in the cell itself. This third strategy is the strategy taken by RDF. We will see how RDF resolves the issue of global reference later in this chapter, but for now, we will focus on how a table cell is represented and managed in RDF.

Since a cell is represented with three values, the basic building block for RDF is called the triple. The identifier for the row is called the subject of the triple (following the notion from elementary grammar, since the subject is the thing that a statement is about). The identifier for the column is called the predicate of the triple (since columns specify properties of the entities in the rows). The value in the cell is called the object of the triple. Table 3.2 shows the triples in Figure 3.3 as subject, predicate, and object.

Triples become more interesting when more than one triple refers to the same entity, such as in Table 3.3. When more than one triple refers to the same thing, sometimes it is convenient to view the triples as a directed graph in which each triple is an edge from its subject to its object, with the predicate as the label on the edge, as shown in Figure 3.4. The graph visualization in Figure 3.4 expresses the same information presented in Table 3.3, but everything we know about Shakespeare (either as subject or object) is displayed at a single node.


Figure 3.4 Graph display of triples from Table 3.3. Eight triples appear as eight labeled edges.

3.2 Merging Data from Multiple Sources

We started off describing RDF as a way to distribute data over several sources. But when we want to use that data, we will need to merge those sources back together again. One value of the triples representation is the ease with which this kind of merger can be accomplished. Since information is represented simply as triples, merged information from two graphs is as simple as forming the graph of all of the triples from each individual graph, taken together. Let’s see how this is accomplished in RDF.

Suppose that we had another source of information that was relevant to our example from Table 3.3—that is, a list of plays that Shakespeare wrote or a list of parts of the United Kingdom (UK). These would be represented as triples as in Tables 3.4 and 3.5. Each of these can also be shown as a graph, just as in the original table, as shown in Figure 3.5.

What happens when we merge together the information from these three sources? We simply get the graph of all the triples that show up in Figures 3.4 and 3.5. Merging graphs like those in Figures 3.4 and 3.5 to create a combined graph like the one shown in Figure 3.6 is a straightforward process—but only when it is known which nodes in each of the source graphs match.

3.3 Namespaces, URIs, and Identity

The essence of the merge process comes down to answering the question “When is a node in one graph the same node as a node in another graph?” In RDF, this issue is resolved through the use of Uniform Resource Identifiers (URIs).

In the figures so far, we have labeled the nodes and edges in the graphs with simple names like Shakespeare or Wales. On the Semantic Web, this is not sufficient information to determine whether two nodes are really the same. Why not? Isn’t there just one thing in the universe that everyone agrees refers to as Shakespeare? When referring to agreement on the Web, never say, “everyone.” Somewhere, someone will refer not to the historical Shakespeare but to the title character of the feature film Shakespeare in Love, which bears very little resemblance to the historical figure. And “Shakespeare” is one of the more stable concepts to appear on the Web; consider the range of referents for a name like “Washington” or “Bordeaux.” To merge graphs in a Semantic Web setting, we have to be more specific: In what sense do we mean the word Shakespeare?

Table 3.4 Triples about the parts of the UK.

Subject Predicate Object
Scotland part Of The UK
England part Of The UK
Wales part Of The UK
Northern Ireland part Of The UK
Channel Islands part Of The UK
Isle of Man part Of The UK

Table 3.5 Triples about Shakespeare’s plays.

Subject Predicate Object
Shakespeare wrote As You Like It
Shakespeare wrote Henry V
Shakespeare wrote Love’s Labour’s Lost
Shakespeare wrote Measure for Measure
Shakespeare wrote Twelfth Night
Shakespeare wrote The Winter’s Tale
Shakespeare wrote Hamlet
Shakespeare wrote Othello

RDF borrows its solution to this problem from foundational Web technology—in particular, the URI. The basic syntax and format of a URI are familiar even to casual users of the Web today because of the special, but typical, case of the URL—for example, http://www.workingontologist.org/Examples/Chapter3/Shakespeare#Shakespeare. But the significance of the URI as a global identifier for a Web resource is often not appreciated. A URI provides a global identification for a resource that is common across the Web. This is not a stipulation that is particular to the Semantic Web but to the Web in general; global naming leads to global network effects. Of course, in the jungle that is the Web, we can’t expect that every data source that refers to Shakespeare will use the same URI. In Chapter 5 we will explore when and why we might use different URIs for the same individual, and what capabilities the Semantic Web provides to manage them.


Figure 3.5 Graphic representation of triples describing (a) Shakespeare’s plays and (b) parts of the UK.


Figure 3.6 Combined graph of all triples about Shakespeare and the UK.

URIs and URLs look exactly the same, and, in fact, a URL is just a special case of the URI. Why does the Web have both of these ideas? Simplifying somewhat, the URI is an identifier with global (i.e., “World Wide” in the “World Wide Web” sense) scope. Any two Web applications in the world can refer to the same thing by referencing the same URI. But the syntax of the URI makes it possible to “dereference” it—that is, to use all the information in the URI (which specifies things like server name, protocol, port number, file name, etc.) to locate a file (or a location in a file) on the Web1. This dereferencing succeeds if all these parts work; the protocol locates the specified server running on the specified port and so on. When this is the case, we can say that the URI is not just a URI, but an effective HTTP URI. From the point of view of modeling, the distinction is not important. But from the point of view of having a model on the Semantic Web, the fact that a URI can potentially be dereferenced allows the models to participate in a global Web infrastructure as we will see in Chapter 5.

The URI can be generalized further as an Internationalized Resource Identifier, or IRI. The IRI is a generalization of the URI that uses all the character representations for languages on the Web, so an IRI can include characters with accents or indeed characters from any language that has a standard web encoding.

RDF applies the notion of the URI to resolve the identity problem in graph merging. The application is quite simple: A node from one graph is merged with a node from another graph exactly if they have the same URI. On the one hand, this may seem disingenuous, “solving” the problem of node identity by relying on another standard to solve it. On the other hand, since issues of identity appear in the Web in general and not just in the Semantic Web, it would be foolish not to use the same strategy to resolve the issue in both cases.

Expressing URIs in print

URIs work very well for expressing identity on the World Wide Web, but they are typically a bit of a pain to write out in detail when expressing models, especially in print. So for the examples in this book, we use a simplified version of a URI abbreviation scheme called CURIEs (standing for Compact URI). In its simplest form, a URI expressed as a CURIE has two parts: a namespace and an identifier, written with a colon between. So the CURIE representation for the identifier England in the namespace geo is simply geo:England. The RDF standard syntaxes include elaborate rules that allow programmers to map namespaces to other URI representations (such as the familiar http:// notation). For the examples in this book, we will use the simple CURIE form for all URIs. It is important, however, to note that CURIEs are not global identifiers on the Web; only fully qualified URIs (for example, http://www.WorkingOntologist.org/Examples/Chapter3/Shakespeare#Shakespeare) are global Web names. Thus, any representation of a CURIE must, in principle, be accompanied by a declaration of the namespace correspondence.

It is customary on the Web in general to insist that URIs contain no embedded spaces. For example, an identifier “part of” is typically not used in the Web. Instead, we follow the InterCap convention (sometimes called CamelCase), whereby names that are made up of multiple words are transformed into identifiers without spaces by capitalizing each word. Thus, “part of” becomes partOf, “Great Britain” becomes GreatBritain, “Measure for Measure” becomes MeasureForMeasure, and so on.

There is no limitation on the use of multiple namespaces in a single source of data, or even in a single triple. Selection of namespaces is entirely unrestricted as far as the data model and standards are concerned. It is common practice, however, to refer to related identifiers in a single namespace. For instance, all of the literary or geographical information from Table 3.5 or Table 3.4 would be placed into one namespace per table, with a suggestive name—say, lit or geo—respectively. Strictly speaking, these names correspond to fully qualified URIs—for example, lit stands for http://www.WorkingOntologist.com/Examples/Chapter3/Shakespeare#, and geo stands for http://www.WorkingOntologist.com/Examples/Chapter3/geography#.

For the purposes of explaining modeling on the Semantic Web, the detailed URIs behind the CURIEs are not important, so for the most part, we will omit these bindings from now on. In many examples, we will take this notion of abbreviation one step further; in the cases when we use a single namespace throughout one example, we will assume there is a default namespace declaration that allows us to refer to URIs simply with a symbolic name preceded by a colon (:), such as :Shakespeare, :JamesDean, :Researcher.

Using CURIEs, our triple sets now look as shown in Tables 3.6 and 3.7. Compare Table 3.6 with Table 3.5, and compare Table 3.7 with Table 3.4. But it isn’t always that simple; some triples will have to use identifiers with different namespaces, as in the example in Table 3.8, which was taken from Table 3.3.

In Table 3.8, we introduced a new namespace, bio:, without specifying the actual URI to which it corresponds. For this model to participate on the Web, this information must be filled in. But from the point of view of modeling, this detail is unimportant. For the rest of this book, we will assume that the prefixes of all CURIEs are defined, even if that definition has not been specified explicitly in print.

Standard namespaces

Using the URI as a standard for global identifiers allows for a worldwide reference for any symbol. This means that we can tell when any two people anywhere in the world are referring to the same thing.

This property of the URI provides a simple way for a standard organization (like the World Wide Web Consortium [W3C]) to specify the meaning of certain terms in the standard. As we will see in coming chapters, the W3C standards provide definitions for terms such as type, subClassOf, Class, inverseOf, and so forth. But these standards are intended to apply globally across the Semantic Web, so the standards refer to these reserved words in the same way as they refer to any other resource on the Semantic Web, as URIs.

Table 3.6 Plays of Shakespeare with CURIEs.

Subject Predicate Object
lit:Shakespeare lit:wrote lit:AsYouLikeIt
lit:Shakespeare lit:wrote lit:HenryV
lit:Shakespeare lit:wrote lit:LovesLaboursLost
lit:Shakespeare lit:wrote lit:MeasureForMeasure
lit:Shakespeare lit:wrote lit:TwelfthNight
lit:Shakespeare lit:wrote lit:WintersTale
lit:Shakespeare lit:wrote lit:Hamlet
lit:Shakespeare lit:wrote lit:Othello

Table 3.7 Geographical names with CURIEs.

Subject Predicate Object
geo:Scotland geo:partOf geo:UK
geo:England geo:partOf geo:UK
geo:Wales geo:partOf geo:UK
geo:NorthernIreland geo:partOf geo:UK
geo:ChannelIslands geo:partOf geo:UK
geo:IsleOfMan geo:partOf geo:UK

Table 3.8 Triples referring to URIs with a variety of namespaces.

Subject Predicate Object
lit:Shakespeare lit:wrote lit:KingLear
lit:Shakespeare lit:wrote lit:MacBeth
bio:AnneHathaway bio:married lit:Shakespeare
bio:AnneHathaway bio:livedWith lit:Shakespeare
lit:Shakespeare bio:livedIn geo:Stratford
geo:Stratford geo:isIn geo:England
geo:England geo:partOf geo:UK
geo:Scotland geo:partOf geo:UK

The W3C has defined a number of standard namespaces for use with Web technologies, including xsd: for XML schema definition; xmlns: for XML namespaces; and so on. The Semantic Web is handled in exactly the same way, with namespace definitions for the major layers of the Semantic Web. Following standard practice with the W3C, we will use CURIEs to refer to these terms, using the following definitions for the standard namespaces.

rdf : Indicates identifiers used in RDF. The set of identifiers defined in the standard is quite small and is used to define types and properties in RDF. The global URI for the rdf namespace is http://www.w3.org/1999/02/22-rdf-syntax-ns#.

Table 3.9 Using rdf:type to describe playwrights.

Subject Predicate Object
lit:Shakespeare rdf:type lit:Playwright
lit:Ibsen rdf:type lit:Playwright
lit:Simon rdf:type lit:Playwright
lit:Miller rdf:type lit:Playwright
lit:Marlowe rdf:type lit:Playwright
lit:Wilder rdf:type lit:Playwright

rdfs: Indicates identifiers used for the RDFS. The scope and semantics of the symbols in this namespace are the topics of future chapters. The global URI for the rdfs namespace is http://www.w3.org/2000/01/rdf-schema#.

skos: Indicates identifiers used for the Simple Knowledge Organization System (SKOS), a schema for distributed management of vocabularies on the Web. Chapter 11 provides a detailed discussion on SKOS and its use. The global URI for SKOS is http://www.w3.org/2004/02/skos/core#.

owl: Indicates identifiers used for OWL. The scope and semantics of the symbols in this namespace are the topics of future chapters. The global URI for the OWL namespace is http://www.w3.org/2002/07/owl#.

These URIs provide a good example of the interaction between a URI and a URL. For modeling purposes, any URI in one of these namespaces (for example, http://www.w3.org/2000/01/rdf-schema#subClassOf, or rdfs:subClassOf for short) refers to a particular term that the W3C makes some statements about in the RDFS standard. But the term can also be dereferenced—that is, if we look at the server www.w3.org, there is a page at the location 2000/01/rdf-schema with an entry about subClassOf, giving supplemental information about this resource. From the point of view of modeling, it is not necessary that it be possible to dereference this URI, but from the point of view of Web integration, it is critical that it is. The underlying standards and principles to weave such a web of linked data will be detailed in Chapter 5.

3.4 Identifiers in the RDF Namespace

The RDF data model specifies the notion of triples and the idea of merging sets of triples as just shown. With the introduction of namespaces, RDF uses the infrastructure of the Web to represent agreements on how to refer to a particular entity. The RDF standard itself takes advantage of the namespace infrastructure to define a small number of standard identifiers in a namespace defined in the standard, a namespace called rdf.

Table 3.10 Defining types of names.

Subject Predicate Object
lit:Playwright rdf:type bus:Profession
bus:Profession rdf:type hr:Compensation

Table 3.11 rdf:Property assertions for Tables 3.5 to 3.8.

Subject Predicate Object
lit:wrote rdf:type rdf:Property
geo:partOf rdf:type rdf:Property
bio:married rdf:type rdf:Property
bio:livedIn rdf:type rdf:Property
bio:livedWith rdf:type rdf:Property
geo:isIn rdf:type rdf:Property

rdf:type is a property that provides an elementary typing system in RDF. For example, we can express the relationship between several playwrights using type information, as shown in Table 3.9. The subject of rdf:type in these triples can be any identifier, and the object is understood to be a type. There is no restriction on the usage of rdf:type with types; types can have types ad infinitum, as shown in Table 3.10.

When we read a triple out loud (or just to ourselves), it is understandably tempting to read it (in English, anyway) in subject/predicate/object order so that the first triple in Table 3.9 would read, “Shakespeare type Playwright.” Unfortunately, this is pretty fractured syntax no matter how you inflect it. It would be better to have something like “Shakespeare has type Playwright” or maybe “The type of Shakespeare is Playwright.”

This issue really has to do with the choice of name for the rdf:type resource; if it had been called rdf:isInstanceOf instead, it would have been much easier to read out loud in English. But since we never have control over how other entities (in this case, the W3C) chose their names, we don’t have the luxury of changing these names. When we read out loud, we just have to take some liberties in adding in connecting words. So this triple can be pronounced, “Shakespeare [has] type Playwright,” adding in the “has” (or sometimes, the word “is” works better) to make the sentence into somewhat correct English. Later in this chapter, we’ll see the Turtle syntax for writing RDF, in which a shortcut has been introduced for this particular case: the keyword “a” can be used instead of rdf:type which makes the reading ever easier “Shakespeare [is] a Playwright”.

rdf:Property is an identifier that is used as a type in RDF to indicate when another identifier is to be used as a predicate rather than as a subject or an object. We can declare all the identifiers we have used as predicates so far in this chapter as shown in Table 3.11.

3.5 CHALLENGES: RDF and Tabular Data

We began this chapter by motivating RDF as a way to distribute data over the Web—in particular, tabular data. Now that we have all of the detailed mechanisms of RDF (including namespaces and triples) in place, we can revisit tabular data and show how to represent it consistently in RDF.

Challenge 1

Given a table from a relational database, describing products, suppliers, and stocking information about the products (see Table 3.12), produce an RDF graph that reflects its contents in such a way that the information intent is preserved but the data are now amenable for RDF operations like merging an RDF query.

Solution

Each row in the table describes a single entity, all of the same type. That type is given by the name of the table itself, Product. We know certain information about each of these items, based on the columns in the table itself, such as the model number, the division, and so on. We want to represent these data in RDF.

Since each row represents a distinct entity, each row will have a distinct URI. Fortunately, the need for unique identifiers is just as present in the database as it is in the Semantic Web, so there is a (locally) unique identifier available—namely, the primary table key, in this case the column called ID. For the Semantic Web, we need a globally unique identifier. The simplest way to form such an identifier is by having a single URI for the database itself (perhaps even a URL if the database is on the Web). We use that URI as the namespace for all the identifiers in the database. We will discuss the minting of URIs more in details in Chapter 5. Since this is a database for a manufacturing company, let’s call that namespace mfg:.

Then we can create an identifier for each line by concatenating the table name “Product” with the unique key and expressing this identifier in the mfg: namespace, resulting in identifiers mfg:Product1, mfg:Product2, and so on.

Each row in the table says several things about that item—namely, its model number, its division, and so on. To represent this in RDF, each of these will be a property that will describe the Products. But just as is the case for the unique identifiers for the rows, we need to have global unique identifiers for these properties. We can use the same namespace as we did for the individuals, but since two tables could have the same column name (but they aren’t the same properties!), we need to combine the table name and the column name. This results in properties like mfg:Product_ModelNo, mfg:Product_Division, and so on.

Table 3.12 Sample tabular data for triples.


With these conventions in place, we can now express all the information in the table as triples. There will be one triple per cell in the table—that is, for n rows and c columns, there will be n × c triples. The data shown in Table 3.12 have 7 columns and 9 rows, so there are 63 triples, as shown in Table 3.13.

The triples in the table are a bit different from the triples we have seen so far. Although the subject and predicate of these triples are RDF resources (complete with CURIE namespaces!), the objects are not resources but literal data—that is, strings, integers, and so forth. This should come as no surprise, since, after all, RDF is a data representation system. RDF borrows from XML all the literal data types as possible values for the object of a triple; in this case, the types of all data are strings or integers.

The usual interpretation of a table is that each row in the table corresponds to one individual and that the type of these individuals corresponds to the name of the table. In Table 3.12, each row corresponds to a Product. We can represent this in RDF by adding one triple per row that specifies the type of the individual described by each row, as shown in Table 3.14.

Table 3.13 Triples representing some of the data in Table 3.12.

Subject Predicate Object
mfg:Product1 mfg:Product_ID 1
mfg:Product1 mfg:Product_ModelNo ZX-3
mfg:Product1 mfg:Product_Division Manufacturing support
mfg:Product1 mfg:Product_Product_Line Paper machine
mfg:Product1 mfg:Product_Manufacture_Location Sacramento
mfg:Product1 mfg:Product_SKU FB3524
mfg:Product1 mfg:Product_Available 23
mfg:Product2 mfg:Product_ID 2
mfg:Product2 mfg:Product_ModelNo ZX-3P
mfg:Product2 mfg:Product_Division Manufacturing support
mfg:Product2 mfg:Product_Product_Line Paper machine
mfg:Product2 mfg:Product_Manufacture_Location Sacramento
mfg:Product2 mfg:Product_SKU KD5243
mfg:Product2 mfg:Product_Available 4

Table 3.14 Triples representing type information from Table 3.12.

Subject Predicate Object
mfg:Product1 rdf:type mfg:Product
mfg:Product2 rdf:type mfg:Product
mfg:Product3 rdf:type mfg:Product
mfg:Product4 rdf:type mfg:Product
mfg:Product5 rdf:type mfg:Product
mfg:Product6 rdf:type mfg:Product
mfg:Product7 rdf:type mfg:Product
mfg:Product8 rdf:type mfg:Product
mfg:Product9 rdf:type mfg:Product

The full complement of triples from the translation of the information in Table 3.12 is shown in Figure 3.7. The types (i.e., where the predicate is rdf:type, and the object is the class mfg:Product) are shown as links in the graph; triples in which the object is a literal datum are shown (for sake of compactness in the figure) within a box labeled by their common subject.

3.6 Higher-Order Relationships

It is not unusual for someone who is building a model in RDF for the first time to feel a bit limited by the simple subject/predicate/object form of the RDF triple.


Figure 3.7 Graphical version of the tabular data from Table 3.12.

They don’t want to just say that Shakespeare wrote Hamlet, but they want to qualify this statement and say that Shakespeare wrote Hamlet in 1604 or that Wikipedia states that Shakespeare wrote Hamlet in 1604. In general, these are cases in which it is, or at least seems, desirable to make a statement about another statement. This process is called reification. Reification is not a problem specific to Semantic Web modeling; the same issue arises in other data modeling contexts like relational databases and object systems. In fact, one approach to reification in the Semantic Web is to simply borrow the standard solution that is commonly used in relational database schemas, using the conventional mapping from relational tables to RDF given in the preceding challenge. In a relational database table, it is possible to simply create a table with more columns to add additional information about a triple. So the statement Shakespeare wrote Hamlet is expressed (as in Table 3.1) in a single row of a table, where there is a column for the author of a work and another column for its title. Any further information about this event is done with another column (again, just as in Table 3.1). When this is converted to RDF according to the example in Challenge 1, the row is represented by a number of triples, one triple per column in the database. The subject of all of these triples is the same: a single resource that corresponds to the row in the table.

An example of this can be seen in Table 3.13, where several triples have the same subject and one triple apiece for each column in the table. This approach to reification has a strong pedigree in relational modeling, and it has worked well for a wide range of modeling applications. It can be applied in RDF even when the data have not been imported from tabular form. That is, the statement Shakespeare wrote Hamlet in 1601 (disagreeing with the statement in Table 3.2) can be expressed with these three triples:

Subject Predicate Object
bio:n1 bio:author lit:Shakespeare
bio:n1 bio:title “Hamlet”
bio:n1 bio:publicationDate 1601

This approach works well for examples like Shakespeare wrote Hamlet in 1601, in which we want to express more information about some event or statement. It doesn’t work so well in cases like Wikipedia says Shakespeare wrote Hamlet, in which we are expressing information about the statement itself, Shakespeare wrote Hamlet. This kind of metadata about statements often takes the form of provenance (information about the source of a statement, as in this example), likelihood (expressed in some quantitative form like probability, such as It is 90 percent probable that Shakespeare wrote Hamlet), context (specific information about a project setting in which a statement holds, such as Kenneth Branagh played Hamlet in the movie), or time frame (Hamlet plays on Broadway January 11 through March 12). In such cases, it is useful to explicitly make a statement about a statement. This process, called explicit reification, is supported by the W3C RDF standard with three resources called rdf:subject, rdf:predicate, and rdf:object.

Let’s take the example of Wikipedia says Shakespeare wrote Hamlet. Using the RDF standard, we can refer to a triple as follows:

Subject Predicate Object
q:n1 rdf:subject lit:Shakespeare
q:n1 rdf:predicate lit:wrote
q:n1 rdf:object lit:Hamlet

Then we can express the relation of Wikipedia to this statement as follows:

Subject Predicate Object
web:Wikipedia m:says q:n1.

Notice that just because we have asserted the reification triples about q:n1, it is not necessarily the case that we have also asserted the triple itself:

Subject Predicate Object
lit:Shakespeare lit:wrote lit:Hamlet

This is as it should be; after all, if an application does not trust information from Wikipedia, then it should not behave as though that triple has been asserted. An application that does trust Wikipedia will want to behave as though it had.

3.7 Naming RDF Graphs

So far, we have seen how a collection of triples can be considered as a graph, either for display purposes (as in many of the figures in this chapter), or as we will see in Chapter 6, for querying. But we haven’t been very specific about what exactly we mean by a graph.

Informally, a graph is a diagram with nodes and edges. In RDF, this corresponds directly to a set of triples. When the same URI is used in many triples (as in, for example, Figure 3.7), the drawing of the graph is highly connected.

From a more formal point of view in RDF, a graph is simply a set of triples. They might be highly connected, or not at all, it doesn’t matter; a graph is just a set of triples.

When we manage data sets, we might just refer to all the triples in our data, as we have done with all the examples in this chapter so far. For most situations, this is fine. But we might want to single out a set of triples (i.e., a graph) and give that a name. Since this is the Web, that name will be in the form of a URI. The RDF standard provides a means for doing this—it is called the named graph.

The idea of a named graph is quite simple; we refer to a set of triples with a name, which itself is a URI.

Why would we want to name a graph? There are a few basic use cases:

One file, one graph. So far, we have seen examples of how we can extract RDF data from spreadsheets. We can extract RDF data from other sources as well, and indeed, we can create data natively as RDF. In the next section, we’ll see how to write down RDF data into a plain text file. When we load this data into an RDF data store, we might want to keep data from different sources separate. A convenient way to do this is to put all the data from one source into a single named graph. The name of the graph (as a URI) can even give information as to where we can find that source.

Reification In Section 3.6, we saw the need for higher-order relationships, in which we want to make statements about statements. Named graphs provide another way to accomplish this. We put a set of triples about which we want to make some statement into a named graph, and make the statement about that graph.

Context Sometimes when we have a set of triples, we would like to consider them in some context; for example, earlier we considered the fact Kenneth Brannagh played Hamlet in the movie. In this example, in the movie (where by the movie we are referring specifically to https://www.imdb.com/title/tt0116477/) represents a context for the assertion Kenneth Brannagh played Hamlet.

As an example of reification with named graphs, let’s return to the statement, Wikipedia says Shakespeare wrote Hamlet. Suppose we start with the single triple stating that fact:

Subject Predicate Object
lit:Shakespeare lit:wrote lit:Hamlet

Now, let’s add a column to this table to specify which named graph this is in. Furthermore, we’ll just use the URI https://www.wikipedia.org/ for Wikipedia (since that’s the URL for Wikipedia itself). Then we have


This is a bit of a degenerate example, since we have a graph that contains a single triple, but there is no reason not to have graphs this small. Of course, there are a lot of other facts that are in the Wikipedia graph. In fact, there is a resource on the Web called dbpedia that does just this—it makes all the data of Wikipedia available as RDF data. We describe it in detail in Chapter 5.

Named graphs are a simple extension to the RDF formalism, and really don’t change any of the basics; RDF still links one named resource to another, where each name is global in scope (i.e., on the Web). Named graphs simply allow us to manage sets of these links, and to name them as well. Sometimes when we are using named graphs, we refer to quads instead of triples; this is because it is possible to represent a triple and its graph as a four-tuple (as shown in the table above). The name of the fourth entry in the quad is usually called the graph (as it is here), but is sometimes referred to as the context, anticipating a particular use for the named graph.

3.8 Alternatives for Serialization

So far, we have expressed RDF triples in subject/predicate/object tabular form or as graphs of boxes and arrows. Although these are simple and apparent forms to display triples, they aren’t always the most compact forms, or even the most human-friendly form, to see the relations between entities.

The issue of representing RDF in text doesn’t only arise in books and documents about RDF; it also arises when we want to publish data in RDF on the Web. In response to this need, there are multiple ways of expressing RDF in textual form.

One might wonder why we have so many different ways to express RDF, and how they differ. It is useful to compare different serializations to different ways to write the same language; in English and other European languages, the same sentence can be printed or written in cursive script. These don’t look at all alike, and there are good reasons for why we might use one instead of the other in any particular situation. But we can copy a message from cursive to print without any loss of content. The same is true with the serializations; we can express the same triples in one serialization or the other, depending on taste, expediency, availability of tools, and so on.

N-Triples

The simplest form is called N-Triples and corresponds most directly to the raw RDF triples. It refers to resources using their fully unabbreviated URIs. Each URI is written between angle brackets (< and >). Three resources are expressed in subject/predicate/object order, followed by a period (.). For example, if the names-pace mfg corresponds to http://www.WorkingOntologist.org/Examples/Chapter3/Manufacture#, then the first triple from Table 3.14 is written in N-Triples as follows:


It is difficult to print N-Triples on a page in a book—the serialization does not allow for new lines within a triple (as we had to do here, to fit it in the page). An actual ntriple file has the whole triple on a single line. The advantages of N-Triples are that they are easy to read from a file (parse) and to write into a file for importing and exporting.

Turtle/N3

In this book, we use a more compact serialization of RDF called Turtle which is itself a subset of a syntax called N3. Turtle combines the apparent display of triples from N-Triples with the terseness of CURIEs. We will introduce Turtle in this section and describe just the subset required for the current examples. We will describe more of the language as needed for later examples. For a full description of Turtle, see the W3C Recommendation [Carothers and Prud’hommeaux 2014].

Since Turtle uses CURIEs, there must be a binding between the (local) CURIEs and the (global) URIs. Hence, Turtle begins with a preamble in which these bindings are defined; for example, we can define the CURIEs needed in the Challenge example with the following preamble:


Once the local CURIEs have been defined, Turtle provides a simple way to express a triple by listing three resources, using CURIE abbreviations, in subject/predicate/object order, followed by a period, such as the following:


The final period can come directly after the resource for the object, but we often put a space in front of it, to make it stand out visually. This space is optional.

It is quite common (especially after importing tabular data) to have several triples that share a common subject. Turtle provides for a compact representation of such data. It begins with the first triple in subject/predicate/object order, as before; but instead of terminating with a period, it uses a semicolon (;) to indicate that another triple with the same subject follows. For that triple, only the predicate and object need to be specified (since it is the same subject from before). The information in Tables 3.13 and 3.14 about Product1 and Product2 appears in Turtle as follows:



When there are several triples that share both subject and predicate, Turtle provides a compact way to express this as well so that neither the subject nor the predicate needs to be repeated. Turtle uses a comma (,) to separate the objects. So the fact that Shakespeare had three children named Susanna, Judith, and Hamnet can be expressed as follows:


There are actually three triples represented here—namely:


Turtle provides some abbreviations to improve terseness and readability; in this book, we use just a few of these. One of the most widely used abbreviations is to use the word a to mean rdf:type. The motivation for this is that in common speech, we are likely to say, “Product1 is a Product” or “Shakespeare is a playwright” for the triples,


respectively. Thus we will usually write instead:


RDF/XML

While Turtle is convenient for human consumption and is more compact for the printed page, many Web infrastructures are accustomed to representing information in HTML or, more generally, XML. For this reason, the W3C historically started by recommending the use of an XML serialization of RDF called RDF/XML. The information about Product1 and Product2 just shown looks as follows in RDF/XML.

In this example, the subjects (Product1 and Product2) are referenced using the XML attribute rdf:about; the triples with each of these as subjects appear as subelements within these definitions. The complete details of the RDF/XML syntax are beyond the scope of this discussion and can be found in the W3C Recommendation [Schreiber and Gandon 2014].


The same information is contained in the RDF/XML form as in the Turtle, including the declarations of the CURIEs for mfg: and rdf:. RDF/XML includes a number of rules for determining the fully qualified URI of a resource mentioned in an RDF/XML document. These details are quite involved and will not be used for the examples in this book.

JSON-LD

A more modern way to pass data from one component to another in a Web application is using JSON, the Javascript Object Notation citebray2014javascript. In order to make linked data in RDF more available to applications that use JSON, the W3C has recommended JSON-LD, JSON for Linked Data [Kellogg et al. 2014]. There is a direct correspondence between JSON-LD and RDF triples, making JSON-LD another serialization format for RDF.

One of the motivations for having a JSON-based serialization for RDF is that developers who are accustomed to JSON but are not familiar with graph data or distributed data can build applications purely in JSON, which are nevertheless compatible with linked data.

The information about Product1 and Product2 looks as follows in JSON-LD:


The document is organized as a @graph and a @context; the context is like the prefix declarations in Turtle, in that it defines namespaces abbreviations and their expansions. The graph section describes the data, organizing it into object structures as much as possible. Each object has an @id, which defines the URI of the resource (in triple terms, the subject of each triple). The rest of the object structure is in the same form as a JSON object, with the fields corresponding to predicates and the values corresponding to objects.

Optionally, each object can have a @type declaration, which corresponds to the rdf:type predicate in triples. In this case, the value is expected to be another resource, and is interpreted as such.

JSON-LD has a provision for referring to other objects as well, by using a JSON Object syntax, specifying the identity of a referred object with @id. So, if we were to say that Product1 is a part of Product2, we could say


JSON-LD provides a valuable way to exchange graph data from one application to another, while staying entirely in a conventional Javascript environment. Its consistency with RDF allows these applications to smoothly integrate into a web of distributed data.

3.9 Blank Nodes

So far, we have described how RDF can represent sets of triples, in which each subject, predicate, and object is either a source or (in the case of the object of a triple) a literal data value. Each resource is given an identity according to the Web standard for identity, the URI. RDF also allows for resources that do not have any Web identity at all. But why would we want to represent a resource that has no identity on the Web?

Sometimes we know that something exists, and we even know some things about it, but we don’t know its identity. For instance, suppose we want to represent the conjecture that Shakespeare had a mistress, whose identity remains unknown. But we know a few things about her; she was a woman, she lived in England, and she was the inspiration for Sonnet 78.

It is simple enough to express these statements in RDF, but we need an identifier for the mistress. In Turtle, we could express them as follows:


But if we don’t want to have an identifier for the mistress, how can we proceed? RDF allows for a blank node, or bnode for short, for such a situation. If we were to indicate a bnode with a ?, the triples would look as follows:


The use of the bnode in RDF can essentially be interpreted as a logical statement, “there exists.” That is, in these statements we assert “there exists a woman, who lived in England, who was the inspiration for ‘Sonnet78’.”

But this notation (which does not constitute a valid Turtle expression) has a problem: If there is more than one blank node, how do we know which ? references which node? For this reason, Turtle instead includes a compact and unambiguous notation for describing blank nodes. A blank node is indicated by putting all the triples of which it is a subject between square brackets ([ and ]), so:


It is customary, though not required, to leave blank space after the opening bracket to indicate that we are acting as if there were a subject for these triples, even though none is specified.

We can refer to this blank node in other triples by including the entire bracketed sequence in place of the blank node. Furthermore, the abbreviation of a for rdf:type is particularly useful in this context. Thus, our entire statement about the mistress who inspired “Sonnet 78” looks as follows in Turtle:


This expression of RDF can be read almost directly as plain English: that is, “Sonnet78 has [as] inspiration a Woman [who] lived in England.” The identity of the woman is indeterminate. The use of the bracket notation for blank nodes will become particularly important when we come to describe OWL, the Web Ontology Language, since it makes very particular use of bnodes. While RDF allows for the use of blank nodes in many circumstances, other than the specific use of blank nodes in OWL, their use is discouraged in general.

Ordered information in RDF

The children of Shakespeare appear in a certain order on the printed page, but from the point of view of RDF, they are in no order at all; there are just three triples, one describing the relationship between Shakespeare and each of his children. What if we do want to specify an ordering. How would we do it in RDF?

RDF provides a facility for ordering elements in a list format. An ordered list can be expressed quite easily in Turtle as follows:


This translates into the following triples, where _:a, _:b, and _:c are bnodes, and the order is indicated using two reserved properties in RDF called rdf:first and rdf:rest. The list is terminated with a reference to the resource rdf:nil:



This rendition preserves the ordering of the objects but at a cost of considerable complexity of representation. Fortunately, the Turtle representation is quite compact, so it is not usually necessary to remember the details of the RDF triples behind it.

N-Quads

So far, we have talked about how to serialize triples. But what if we want to serialize triples in the context of one or more named graphs? The W3C provides simple extensions of the serializations for triples for use with named graphs. The simplest is called N-Quads.

Like N-Triples, N-Quads uses no CURIEs and no prefixes. A triple is written in the form of a quad, that is, with Subject, Predicate, Object and Graph, in that order. So, to extend our example from N-Triples, if we were to say that Product1 is a Product in graph http://www.WorkingOntologist.org/Examples/Chapter3/Manufacture, we would simply write


This is a very long line indeed (even longer than it was in N-Triples), so it doesn’t show up well in a book, but it is simple to write an parse—no shortcuts, no prefixes, just subject, predicate, object, graph, period.

TriG

TriG is an extension of the Turtle format from Section 3.8. It includes all of the abbreviations for namespaces and elisions with semicolons and commas as Turtle does. The main difference is that it is possible to specify a URI for a graph for all triples in the file. This is done simply by putting all the triples in a graph between braces ({ and }), and then prefix the name of the graph.

If we take the example from Section 3.8 about manufacturing products, and put them into a graph with name <http://www.WorkingOntologist.org/Examples/Chapter3/Manufacture>, this can be expressed in TriG as follows:


In this way, we can express several graphs in a single file.

3.10 Summary

RDF is a simple standard; its job is to model distributed data on the Web. It accomplishes this job, and nothing more. If you want to model data in a distributed way, you can either use RDF or you will re-invent it; there isn’t very much to it. The basic idea of RDF is that, in a distributed setting, you need a global identifier for anything you refer to. If you want to connect that to some other thing, you will need a name for that. And if you want to connect them, you’ll need a name for the connection. All of these names have to be global.

The hypertext Web has already given us the global identifiers; we’ll see the details of how the Web infrastructure processes these identifiers, but that isn’t part of the Semantic Web, that’s part of the Web we all use every day. RDF doesn’t solve the distributed identity problem; the Web solved that, and RDF re-uses that solution.

Given this, the very minimum required for distributed represenation of data on the Web is a way to connect one Web identifier (i.e., a URI) to another, with a link that is also named with a URI. This is the basis of the RDF triple. Everything else is just plumbing. If you are distributing data, you’ll want a way to store it (those are the serializations). You’ll want a way to convert data from non-distributed forms (like tables) into distributed form. RDF simply provides the infrastructure to deal with the simplest way of representing distributed information.

As a data model, RDF provides a clear specification of what has to happen to merge information from multiple sources. It does not provide algorithms or technology to implement those processes. These technologies are the topic of subsequent chapters.

Fundamental concepts

The following fundamental concepts were introduced in this chapter.

RDF (Resource Description Framework)—This distributes data on the Web.

Triple—The fundamental data structure of RDF. A triple is made up of a subject, predicate, and object.

Graph—A nodes-and-links structural view of RDF data.

Merging—The process of treating two graphs as if they were one.

URI (Uniform Resource Indicator)—A generalization of the URL (Uniform Resource Locator), which is the global name on the Web.

namespace—A set of names that belongs to a single authority. Namespaces allow different agents to use the same word in different ways.

CURIE—An abbreviated version of a URI, it is made up of a namespace identifier and a name, separated by a colon.

rdf:type—The relationship between an instance and its type.

rdf:Property—The type of any property in RDF.

Reification—The practice of making a statement about another statement. It is done in RDF using

rdf:subject, rdf:predicate, and Nonerdf:object.

N-Triples, Turtle, RDF/XML—The serialization syntaxes for RDF.

Blank nodes—RDF nodes that have no URI and thus cannot be referenced globally. They are used to stand in for anonymous entities.

1. We are primarily discussing files here, but a URI can refer to other resources. The Wikipedia article on URIs includes more than 50 different resource types that can be referenced by URIs—see http://en.wikipedia.org/wiki/URI_scheme.

Semantic Web for the Working Ontologist

Подняться наверх