With AI and machine learning comes an assumption that the more clean data you have, the more accurate your predictions become. But this also assumes you have the horsepower to process and analyze that data quickly, at scale, without dimming the city’s lights. To be effective at customer analysis, AI solutions must process immense amounts of data efficiently and scale to meet increasing volumes of data over time as it is collected and persisted.
Table 1-2 compares and contrasts the properties and uses of data mining versus text mining.
TABLE 1-2 Data Mining Versus Text Mining
Data Mining | Text Mining | |
Overview | Data mining searches for patterns and relationships in structured data. | Text mining transforms unstructured textual data into structured information to enable data analysis. |
Data Type | Structured data from large datasets is found in systems such as databases, spreadsheets, ERP, and accounting applications. | Unstructured textual data is found in emails, documents, presentations, videos, file shares, social media, and the Internet. |
Data Retrieval | Structured data is homogenous and organized, making it easy to retrieve. | Unstructured textual data comes in many different formats and content types located in a more diverse range of applications and systems. |
Data Preparation | Structured data is formal and formatted, facilitating the process of ingesting data into analytical models. | Linguistic and statistical techniques — including NLP keywording and meta-tagging — must be applied to turn unstructured into usable structured data. |
Taxonomy | There is no need to create an overriding taxonomy. | A global taxonomy must be applied to organize the data into a common framework. |
Machine learning
Machine learning (ML), a subset of artificial intelligence, enables users to learn from historical data to achieve a desired outcome. It powers targeted ads, personalized content, song recommendations, predictive maintenance activities, and virtual assistants.
ML mimics human learning by absorbing information. Humans learn by reading, watching, listening, and doing. ML learns by processing historical data. For example, a human’s knowledge of elephants is based on historical experience, such as going to the zoo, riding an elephant, watching a documentary, and reading a book. ML gains knowledge of elephants by processing text and images.
The learning phase consists of these steps:
1 Sample historical data (machine activity, customer attributes, and transactions).
2 Apply algorithm to historical data to learn key patterns and trends.
3 Generate a model or set of rules or instructions.
The prediction phase consists of these steps:
1 Load the existing model.
2 Apply the model to new data.
3 Predict the likelihood of an outcome (in other words, customer churn).
The output of the prediction phase feeds back into the input of the learning phase to refine the model.
Learning
For the purposes of ML, historical data is called training data. In the case of text mining, the system uses OCR and NLP to process text. For images, the system uses computer vision techniques for detection, recognition, and identification to process the image.
The algorithm processes the data to detect key patterns and trends and correlate them to labels. For example, if you’re doing text mining, the algorithm might notice certain words being associated with elephants, such as large, gray, tusk, and trunk, and associate those with the label “elephant.” Later, in the prediction phase, when the algorithm sees a significant number of these terms, it calculates the probability that the passage is talking about an elephant.
In the learning phase, the system applies statistical techniques or algorithms to the historical data to generate a machine-learning model. You can think of the model as a set of rules or instructions (similar to steps in a recipe) that one must follow to make a business decision.
For example, to approve a loan application, a loan officer considers income, age, net worth, and many other factors. Each attribute of the application is a rule or factor that the officer must evaluate to approve or reject the loan. Machine-learning techniques follow a similar process, comparing various attributes, historical decisions, and the outcome of similar applicants to estimate the credit worthiness of the new applicant. Table 1-3 shows how machine learning is like a recipe.
Prediction
In the prediction stage, the system uses the model to process new data (not historical data), detect patterns and trends, and attempt to match them to patterns from the learning data.
TABLE 1-3 Machine Learning as a Recipe
Machine Learning | Recipe | |
Task | An algorithm is a step-by-step instruction set or formula for solving a problem or completing a task. | Thaw the chicken. Season the chicken. Bake the chicken at 350°F. |
Objective | Minimize errors (loss function) to attain the best approach to solve a task. | Minimize the number of ingredients and steps required to prepare a tasty dish. |
Insight/result | The algorithm learns from errors, finds the best approach, and generates insights and rules used to make predictions. | Learn from your mistakes the next time you attempt the recipe. |
For example, if you process a brochure for the San Diego Zoo using the model, it would recognize the content about elephants and add the tag “elephant” to the document along with a score. The result is a prediction in the form of the percentage probability that the document contains information about elephants. Basically, the model makes a data-driven guess.
In AI and data science, execution is not just implementing a plan. The methodology establishes an iterative process of learning, discovering, and then acting based on new information as opposed to a more traditional IT model of formulating a plan or idea and then rolling it out as planned.
Auto-classification
Auto-classification is a machine-learning technique that