Читать книгу Enterprise AI For Dummies - Zachary Jarvinen - Страница 81

Cleaning the data

Оглавление

What’s worse than no data? Dirty data. Dirty data is poorly structured, poorly formatted, inaccurate, or incomplete.

For example, you might expect it would be easy for a system to scan a document and extract a date — until you reflect that Microsoft Excel alone has 17 different date formats, as show in in Figure 3-6.

Table 3-3 shows several ways that dirty data can manifest.


FIGURE 3-6: Microsoft Excel supports 17 date formats.

TABLE 3-3 Types of Dirty Data

Type Example
Incomplete Empty or null values — the most prevalent type of bad data
Incorrect A date with a 47 in the month or day position
Inaccurate A data with a valid month value (1-12) but the wrong month
Inconsistent Different formats or terms for the same meaning
Duplicate One or more occurrences of the same record
Rule violation Starting date falls after ending date

Why is dirty data worse? Because it costs you more.

For most companies, bad data costs from 15 to 25 percent of revenue as workers research a valid source, correct errors, and deal with the complications that result from relying on bad data.

The solution is to focus on data, not models. Not surprisingly, in a recent CrowdFlower survey, data scientists said the top two time-consuming tasks were cleaning and organizing data (60 percent) and collecting datasets (19 percent). However, in the survey, they also identified as the least enjoyable part of their job cleaning and organizing data (57 percent) and collecting datasets (21 percent).

Here’s a particularly trenchant example of the importance of data quality. In 2015, the International Classification of Diseases, Ninth Revision (ICD-9) coding system used for medical claims was replaced by the more robust, more specific ICD-10 system. ICD-10 provides a higher level of specificity that includes diagnoses, symptoms, site, severity, and treatments. Health providers had the option to use the simpler unspecified ICD-9 codes during the first year as they learned and became accustomed to the more complex system.

During the one-year grace period, many providers just continued to use the ICD-9 codes rather than transition to the more accurate ICD-10 codes, and their automated claims submissions reflected the less specific data. Claim denials increased, which meant more work for the providers who had to retroactively collect supporting documentation to appeal the denial or face loss of revenue. If they had submitted the claims with the more accurate, although a bit more complex, ICD-10 codes, the extra work wouldn’t have been necessary.

You can take this anecdote a step further. Imagine that a few years later, the facility that didn’t upgrade to ICD-10 codes decides to transition to an AI-enabled medical records system to not only streamline document intake, but also serve as a database for medical history and diagnosis. They lose all the potential benefit of diagnostic insights from the history for the “dark year.”

An Alegion study found that two of the top three problems with training data relate to dirty data.

Enterprise AI For Dummies

Подняться наверх