Читать книгу Managing Data Quality - Tim King - Страница 29
ОглавлениеThe data asset
13
Table 1.1, using children’s toy bricks, illustrates how to use these data quality dimensions to identify appropriate requirements for data.
Table 1.1 An example data set
ID
Type
Length
Width
Height
Colour
Studs
Purchase Date
Cost
010
Wood
59.5
29.0
29.0
Yellow
-
012
Wood
59.5
28.9
28.9
N/A
01-09-2001
£8.42
014
Plastic
79.8
31.8
9.6
Black
10 × 4
015
Plastic
31.8
15.8
11.4
Blue
4 × 2
12-23-91
£2
044
Plastic
47.8
7.8
9.6
Grey
6 × 1
27/4/14
£7.12
045
Wood
60.0
29.5
28.6
Yellow
15/7/15
£4.21
Accuracy: Whether the data reflect the real object it represents. For example, looking at the records in Table 1.1, by inspecting the real object (the bricks) we can confirm that brick 045 is a yellow wooden block with the dimensions L 60 × W 29.5 × H 28.6. If the real object turns out to be a green brick or to have different dimensions from those in the table, then the data are inaccurate.
Completeness: Whether all relevant items are recorded and all their attributes are populated. For example, the attributes for brick 010 are not complete. Similarly, if the toy box contains a brick 017, the list of bricks is not complete.
Consistency: Whether an entity recorded in more than one data store is comparable across data stores. For example, brick 012 has a purchase date of 01-09-2001, but in the purchasing system the transaction date is 04-12-2001. If that’s the case, then the data are inconsistent.
Validity: Whether data conform to the specified format. For example, the Purchase Date field contains many different date formats; which is the valid format?
Timeliness: Whether data are up to date and are available to users in a timely manner. For example, the entry for brick 045 could have been added two months after the purchase date, which is slower than the required update frequency. Additionally, if bricks are being purchased daily, then an absence of new data could indicate that the data update process has failed.
Uniqueness: Whether a single representation exists for each physical entity. For example, in the table, no ID appears twice, therefore it is likely that all entries for these bricks are unique.
This example analysis is the starting point for data quality, but further work would need to be done to provide a complete technical approach to ensure data are fit for purpose. This involves generating an explicit data specification to capture all the identified requirements and a set of tests to ensure the data meet these requirements. These tests vary from simple (e.g. comparing the content of a data set to the formal definition in the data specification of the required syntax) to complex (e.g. identifying if, for all