One way to do that is a bar chart, either vertical or horizontal. One axis displays the collection of categories or ranges and the other the quantity, rank, or another metric.
Another way is to have the X axis represent one variable and the Y axis represent a different variable, and then plot the data points. The data points can even use bubble size to represent a third variable, packing information into a simple visualization that conveys lots of information in a glance. Figure 3-1 shows the number of page visits on the X axis, the duration of the visit on the Y axis, and income band by the size of the bubble.
Composition
A composition visualization drills down into the information that comprises a single number.
For example, you may know the total number of employees across industries, but there is more information buried in the data.
FIGURE 3-1: Comparison: Total page visits by mean duration of visit.
The top part of Figure 3-2 shows a bar chart with the various categories of employees on one axis and the number of employees on the other axis to help you understand the composition of the workforce across industries. The bottom part of Figure 3-2 shows a donut chart breaking down the percentage of revenue per market segment.
Distribution
A distribution visualization conveys how the data points fall across categories or locations and such. For example, you can show a table of counties in alphabetical order and the number of startups for each county, but it is hard to get a sense of how they relate geographically. Figure 3-3 uses a heat map to represent values through colors, often with gradations shades or tints indicating the values of adjacent numbers.
Relationship
A relationship visualization reveals how two or more variables affect each other. You can show relationships through a variety of methods.
FIGURE 3-2: Composition: Employee per industry (top), revenue per market segment (bottom).
The simplest is a line graph showing how one variable rises or falls along the Y axis as it moves through degrees of another variable on the X axis. A scatter plot is useful when the data is not linear, such as representing the height and weight of a population, where there are multiple instances of weight for each height. A bubble chart is a scatter plot using the size of the bubble to represent a third variable.
You can also indicate relationship by graphing two variables on one axis across a third variable on the other axis. Figure 3-4 graphs call center wait times and customer satisfaction scores across time.
FIGURE 3-3: Distribution: Startups per county.
FIGURE 3-4: Relationship: Call center wait time versus satisfaction score.
Digesting Data
Of the three pillars of AI — processing power, scalable storage, and big data — the third is the one that presents the biggest challenge. How to get it, how to validate it, how to process it.
Figure 3-5 shows the pyramid of critical success factors for AI and analytics. Four of the six layers relate to data, focusing on relevance, accessibility, usability, completeness, and data-based conclusions.
FIGURE 3-5: Pyramid of critical success factors for AI and analytics.
Table 3-2 describes critical questions to answer at each layer.
TABLE 3-2 Pyramid of Critical Success Factors for AI and Analytics
Element | Questions |
AI | How will you address analytical deployment, governance, and operations? |
Experimentation ML | Does machine learning add business value? How do you define success? |
BI / Analytics | What is the story your data is telling? What conclusions can you make from this information? |
Explore and Enrich | Can the data be used meaningfully? Are you missing any data or features? |
Data Access | Is the data accessible and usable (analysis-ready)? Is the data flow reliable? |
Data Collection | Do you have data relevant to your business goals? |
Identifying data sources
Before you start, you should perform a data audit to determine what data you already have and identify gaps in your data that you must fill to accomplish your business goals.
As mentioned in Chapter 1, for the enterprise, data falls into two categories: structured data (databases and spreadsheets) and unstructured data (email, text messages, voice mail, social media, connected sensors, and so on). Potential sources for data include:
Internal data: The first place to look is the IT department, but depending on the organization, you may not find everything you need in one place. The most common challenges associated with big data aren’t analytics problems; they are information integration problems. To reap the benefits of big data,