Читать книгу Managing Data Quality - Tim King - Страница 29

Оглавление

The data asset

13

Table 1.1, using children’s toy bricks, illustrates how to use these data quality dimensions to identify appropriate requirements for data.



Table 1.1 An example data set

ID

Type

Length

Width

Height

Colour

Studs

Purchase Date

Cost

010

Wood

59.5

29.0

29.0

Yellow

-

012

Wood

59.5

28.9

28.9

N/A

01-09-2001

£8.42

014

Plastic

79.8

31.8

9.6

Black

10 × 4

015

Plastic

31.8

15.8

11.4

Blue

4 × 2

12-23-91

£2

044

Plastic

47.8

7.8

9.6

Grey

6 × 1

27/4/14

£7.12

045

Wood

60.0

29.5

28.6

Yellow

15/7/15

£4.21

Accuracy: Whether the data reflect the real object it represents. For example, looking at the records in Table 1.1, by inspecting the real object (the bricks) we can confirm that brick 045 is a yellow wooden block with the dimensions L 60 × W 29.5 × H 28.6. If the real object turns out to be a green brick or to have different dimensions from those in the table, then the data are inaccurate.

Completeness: Whether all relevant items are recorded and all their attributes are populated. For example, the attributes for brick 010 are not complete. Similarly, if the toy box contains a brick 017, the list of bricks is not complete.

Consistency: Whether an entity recorded in more than one data store is comparable across data stores. For example, brick 012 has a purchase date of 01-09-2001, but in the purchasing system the transaction date is 04-12-2001. If that’s the case, then the data are inconsistent.

Validity: Whether data conform to the specified format. For example, the Purchase Date field contains many different date formats; which is the valid format?

Timeliness: Whether data are up to date and are available to users in a timely manner. For example, the entry for brick 045 could have been added two months after the purchase date, which is slower than the required update frequency. Additionally, if bricks are being purchased daily, then an absence of new data could indicate that the data update process has failed.

Uniqueness: Whether a single representation exists for each physical entity. For example, in the table, no ID appears twice, therefore it is likely that all entries for these bricks are unique.

This example analysis is the starting point for data quality, but further work would need to be done to provide a complete technical approach to ensure data are fit for purpose. This involves generating an explicit data specification to capture all the identified requirements and a set of tests to ensure the data meet these requirements. These tests vary from simple (e.g. comparing the content of a data set to the formal definition in the data specification of the required syntax) to complex (e.g. identifying if, for all

Managing Data Quality

Подняться наверх