Читать книгу Designing Geodatabases for Transportation - J. Allison Butler - Страница 41
Normalization
ОглавлениеThe last topic in this chapter is perhaps the toughest of all. Geodatabases that support data maintenance benefit from normalization, which is the process of removing data redundancies and dependencies. A database with data redundancies and dependencies is denormalized. You have already seen examples of normalization, such as the conversion of a many-to-many relationship to a pair of one-to-many relationships. The use of foreign keys is also part of normalization. This section will describe the breadth of normalization and why it is important.
Each table included in a geodatabase has a primary key, OBJECTID, which is used to uniquely identify each row by unambiguously identifying each member of the class. You cannot rely on record numbers as with sequential flat files or the row numbers that appear in spreadsheets because there is no ordering of rows in a relational database. Typically, each new or modified row is added to the end of the table. No two rows can have the same primary key value. Primary keys can be simple (one field) or complex (multiple fields), but the result must always be unique; no other row can have the same value. You do not want to use a manually entered attribute as a primary key. People can make mistakes and enter the same primary key value twice. You will want to use the computer to create the values in your primary key through such RDBMS functions as integer counters and date/time functions.
In addition to a primary key, a relational table can include one or more foreign keys. A foreign key in one table points to a related record in another table. It does this by having the same values in both tables. Say, for example, that you have a roadway inventory that includes a functional classification column. Rather than store the full name for each functional class, you will likely store the two-digit coded value in the inventory table, and put the coded values and their meanings in a functional class domain. Functional class is a foreign key. There will be lots of road segments with any given functional class, but each segment record must contain only one. Therefore, this is a one-to-many relationship: one functional class to many roads.
By the way, a foreign key is always a candidate primary key. This means that you do not have to pick the real primary key to enforce the relationship, just a column that could serve as the primary key. (The primary key must be unique. You cannot take any chances there.) Using functional class as an example, the coded value assigned to each class is unique to each record, which makes it a candidate primary key. You could store either the actual primary key or the coded value in the roadway inventory table. You will usually choose to store the coded value, as it provides information to the knowledgeable user without having to go to the functional class domain to see what it means.
Most spatial databases are normalized to the extent that they store information about discrete features, such as a parcel, a house, or a lake. Transportation data modelers are not so lucky. Linear transportation facilities are not really discrete. Agencies impose various segmentation methods to make linear facilities discrete, but in the process create denormalized data structures, such as when a street is subdivided into block-length segments. Denormalized data structures are fine, almost required, for mapping, but they are very difficult to maintain if you are trying to store a lot of data. The problem is data redundancy and the resulting need to change the same piece of data in several places. If the entire street has an asphalt surface, then you will need to repeat that information for each segment.
Data dependencies and redundancies are inefficient, but data redundancies can be downright dangerous, as they can undermine database integrity. Try to put the same information in more than one place, and you will likely find those places have different information. Put a road’s speed limit in several places, and you must search for places where the speed limit changes. You also have to change them all at once. Otherwise, some users will get different answers when they read the record. Normalization is a big deal with large datasets.
Normalization eliminates redundancies and dependencies so that each piece of data is in only one place for editing. There are five cumulative forms of normalization:
• First Order (First Normal Form, or 1NF) = Attribute domains consist of only scalar values (field contains only a single value) and each row has the same columns
• Second Order (2NF) = 1NF plus every nonkey attribute is irreducibly dependent on the primary key
• Third Order (3NF) = 2NF plus every nonkey attribute is nontransitively dependent on the primary key
• Fourth Order (4NF) = 3NF plus cannot contain two or more independent multivalued facts
• Fifth Order (5NF) = 4NF plus a symmetry constraint
Do not worry about trying to figure out what all this jargon means. An example will make it simple. If you are a computer scientist, you can skip ahead to the end of the chapter.
The example involves an employee database for a national company with offices in multiple cities. The primary key is EmployeeID. The Employee table stores information about each employee, such as where they work.
It is your job to ensure a normalized database design for the Employee table. First Normal Form (1NF) is easy to do. You just have to make sure that each row includes the same columns, and that all the columns contain only scalar values. A scalar value is one with a single value that is atomic, or indivisible. It means here that you cannot just pile a bunch of different people into one row. You get one employee ID, one department, one building, and one city for each employee in the table. There are two skill fields, but they are scalar. 1NF: Mission accomplished.
Table 3.1a Employee table
The Employee table is now in 1NF. The next step, to reach 2NF, removes nonkey dependencies, which means you must eliminate fields that depend on each other rather than the thing you are talking about. The Employee table talks about employees. Reaching 2NF means removing fields from the table that do not relate to an employee. In this case, the city in which the building is located is determined by the building, not the employee. (One city to a building.) So, you can create a lookup table for buildings that lists the building name and the city it is in.
Table 3.1b Employee table
Table 3.1c Building table
Achieving 3NF requires more work. The City field is an attribute that is unique to each building; i.e., building determines city. It turns out that each department is contained in one and only one building, which makes building dependent on department, reducing city to a secondary dependency. Reaching 3NF involves eliminating secondary dependencies, which is where the value for one attribute depends on an attribute that is a nonkey dependency. The Building field has a nonkey dependency in that the value of Dept, not EmployeeID, determines it. A Department table needs to be added to the design.
Table 3.1d Employee table
Table 3.1e Department table
Table 3.1f Building table
Incidentally, you cannot simply turn the Building table into a Department table. That would cause the Building table to violate 3NF. The City field would have a secondary dependency on DeptID.
4NF removes redundant columns, like the two employee skill columns in the Employee table. The original design with two skill columns would force you to look in both columns to see if an employee had a particular skill. You need to work smarter, not harder, and move employee skills to another table where you can have multiple records for each employee, one row for each skill an employee possesses. The database design shown below is in 4NF. You now have four tables, one for assigning employees to departments, one for identifying the building in which a department is located, one for saying where a building is located, and one for describing the special skills each employee possesses. The database design is now fully normalized because there are no redundancies or secondary dependencies.
employeeID | dept |
101 | shipping |
104 | personnel |
102 | sales |
106 | shipping |
Table 3.1g Employee table
Table 3.1h Department table
Table 3.1i Building table
Table 3.1j EmployeeSkill table
Normalization puts each piece of information in one place. If you move the Personnel Department to a new building in Chicago, updating the entire list of employee locations requires only a single change in the Department table. With the original design, you would have needed to change the record for each employee in that department.
Linear transportation facilities are segmented to provide a one-to-one relationship between attribute records and geometry. Such a segmented transportation database is not normalized in that changing a single attribute will often require that multiple records be updated. Consider the following example Street table rows, where SegmentID is the primary key.
Table 3.2a STREET table
There are four rows with information about Sesame Street, which is segmented according to address block. The STREET table has redundancies. The problem this time is with rows, not columns. There are multiple rows with exactly the same values in the Name, Speed, Lanes, and Pave_Cond columns. There must be only one row for each unique piece of information for the database to be normalized. The values in Speed, Lanes, and Pave_Cond are dependent on street name and address range; i.e., they are nonkey dependencies. Normalizing this database will require creating separate tables for each linear attribute that may span multiple street segments.
Table 3.2b STREET table
Table 3.2c SPEED_LIMIT table
Table 3.2d NUMBER_OF_LANES table
Table 3.2e PAVEMENT_CONDITION table
There is a little more work to do. Look at the SegmentID column. Each table has its own SegmentID sequence. Imagine this database actually being composed of shapefiles or feature classes rather than relational tables. Instead of having four nonoverlapping, block-length street segments to define the four blocks of Sesame Street, there are now 10 overlapping street segments. What a maintenance nightmare!
The whole concept of linear measurement systems was created to provide some degree of normalization for transportation databases. An event table database design using linear measures would have this structure:
Table 3.3a STREET table
Table 3.3b LINEAR_EVENT table
This design is fully normalized because you only need to go to one record to change any piece of information. Of course, some changes may require you to write additional records when you subdivide an existing linear event or create a new one, but to find and change existing data is easy. Turn the SECTION table into a feature class and add dynamic segmentation and you have the traditional route-milelog method of managing data used at many state DOTs.
The design is fully normalized but the fifth form of normalization has yet to be used. 5NF cannot always be achieved. This step in normalization is actually the elimination of data altogether. You do that by using relationships to reconstruct the data you dropped.
Suppose you have assigned each customer of your business to one salesman. One customer has one salesman, and all the purchases by that customer are recorded under the name of the salesman. But you also need to know what each customer purchased so you can ship it and keep track of orders when the customer calls. This might lead you to believe that you needed to keep sales records by salesman and customer, but 5NF says that you only need to store purchases one time—the customer—you also store the name of the salesman who serves that customer. (You cannot, by the way, store the sales data only by the salesman, since each salesman has more than one customer.)
Table 3.4a CUSTOMER table
Table 3.4b SALESMAN table
Table 3.4c ORDER table
Table 3.4d ORDER_ITEM table
Using this structure, you can find the items sold by a particular salesman or to a specific customer by querying the ORDER table. You can then do operations like calculate salesman commission by multiplying the total value of all sales by the commission rate. You could also track the sales of specific items by salesman for special promotional incentives. This database is in 5NF.