Data capture: The second place to look is the data entering your organization. It can arrive in many forms, but you can use data extraction, metadata extraction, and categorization to supplement data. For example, you can run paper documents, whether handwritten or printed, through an optical character recognition system to digitize them in preparation for processing. Then they can join the rest of the digital data, such as emails, PDF files, Word documents, images, voice mail messages, videos, and other formats to be classified and populate the data store that will feed your AI insights.
Data as a service (DaaS): If there are still holes in your data requirements, you can turn to third-party data for purchase, either commercial datasets such as Accuweather or public datasets such as data.gov and Kaggle.com. Broadening your datasets can increase the insights lurking in your own data.
Cleaning the data
What’s worse than no data? Dirty data. Dirty data is poorly structured, poorly formatted, inaccurate, or incomplete.
For example, you might expect it would be easy for a system to scan a document and extract a date — until you reflect that Microsoft Excel alone has 17 different date formats, as show in in Figure 3-6.
Table 3-3 shows several ways that dirty data can manifest.
FIGURE 3-6: Microsoft Excel supports 17 date formats.
TABLE 3-3 Types of Dirty Data
Type | Example |
Incomplete | Empty or null values — the most prevalent type of bad data |
Incorrect | A date with a 47 in the month or day position |
Inaccurate | A data with a valid month value (1-12) but the wrong month |
Inconsistent | Different formats or terms for the same meaning |
Duplicate | One or more occurrences of the same record |
Rule violation | Starting date falls after ending date |
For most companies, bad data costs from 15 to 25 percent of revenue as workers research a valid source, correct errors, and deal with the complications that result from relying on bad data.
The solution is to focus on data, not models. Not surprisingly, in a recent CrowdFlower survey, data scientists said the top two time-consuming tasks were cleaning and organizing data (60 percent) and collecting datasets (19 percent). However, in the survey, they also identified as the least enjoyable part of their job cleaning and organizing data (57 percent) and collecting datasets (21 percent).
Here’s a particularly trenchant example of the importance of data quality. In 2015, the International Classification of Diseases, Ninth Revision (ICD-9) coding system used for medical claims was replaced by the more robust, more specific ICD-10 system. ICD-10 provides a higher level of specificity that includes diagnoses, symptoms, site, severity, and treatments. Health providers had the option to use the simpler unspecified ICD-9 codes during the first year as they learned and became accustomed to the more complex system.
During the one-year grace period, many providers just continued to use the ICD-9 codes rather than transition to the more accurate ICD-10 codes, and their automated claims submissions reflected the less specific data. Claim denials increased, which meant more work for the providers who had to retroactively collect supporting documentation to appeal the denial or face loss of revenue. If they had submitted the claims with the more accurate, although a bit more complex, ICD-10 codes, the extra work wouldn’t have been necessary.
You can take this anecdote a step further. Imagine that a few years later, the facility that didn’t upgrade to ICD-10 codes decides to transition to an AI-enabled medical records system to not only streamline document intake, but also serve as a database for medical history and diagnosis. They lose all the potential benefit of diagnostic insights from the history for the “dark year.”
An Alegion study found that two of the top three problems with training data relate to dirty data.
Defining Use Cases
“I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.”
— Abraham Maslow, “The Psychology of Science: A Reconnaissance” 1966
To hear some tell it, AI is the panacea to solve all the problems of the world; to hear others tell it, AI will lead to the singularity and the destruction of human civilization. As usual, the truth lies somewhere in between.
As with any tool, AI does some things well and other things not so well. If you’re upholstering a chair, a tack hammer is best, but if you’re putting up a circus tent, you need a bigger hammer.
As twentieth-century psychologist Abraham Maslow pointed out, the problem arises when you start with a tool and try to use it on every challenge or obstacle you encounter. The better approach is to identify a desired outcome and select the tool that possesses capabilities suited to achieving the desired outcome. So what is AI good at?
A → B
As computer scientist Andrew Ng points out, “Despite AI’s breadth of impact, the types of AI being deployed are still extremely limited. Almost all of AI’s recent progress is through one type, in which some input data (A) is used to quickly generate some simple response (B).”
Often, response B consists of nothing more than “Is this X or not X?” For example:
Is this transaction nominal or anomalous?
Is this document a patient case history?
Does this image contain a human face?
This capability is known as supervised learning, in which AI learns the relationship between A and B by processing massive amounts of data, guided by humans to establish the rules that govern the decisions. (See Chapter 1 for more information.)
In other cases, B is a transformation of A, such as when AI is used for transcribing or translating a passage. This capability is known as natural-language processing.
Good use cases
A good use case for AI relies on the core enablers of AI as a tool — big data, digitalization, and well-defined classification and rules.
A 2019 IDC guide on worldwide AI spending through 2023 indicated that the top three use cases are automated customer service agents, automated threat intelligence and prevention systems, and sales process recommendation and automation; the use cases with the biggest growth are human resource automation