FIGURE 2-6: Migrating your data warehouse into your new data lake.
Your old data warehouse contents were likely stored in a dimensional model such as a star schema or a snowflake schema. Inside a data lake, the equivalent models might also be dimensional. Alternatively, you could be using a columnar database such as Amazon Redshift. You can still use a visualization tool such as Tableau or a classic business intelligence tool such as MicroStrategy, but your database design will differ from your old data warehouse.
Resettling a data warehouse into your data lake environment
Suppose you and your team actually did a fantastic job architecting and building your data warehouse. You did your work and deployed the data warehouse only a few years ago, using fairly modern technology. To put it simply, your data warehouse just isn’t ready for retirement. But you still want to build a data lake to take advantage of modern big data technology. What should you do in this case?
Just as with a solidly built data mart, you can sort of “forklift” a well-architected data warehouse into your data lake environment. You’ll still have to do some rewiring of data feeds, and you’ll be adding complexity to your overall analytical data architecture. But there’s no sense in exiling a solidly built data warehouse into oblivion if it can still deliver value for you for a while to come.
Aligning Data with Decision Making
You don’t set out to build a data lake just to stuff tons of data into a modern big data environment. You build a data lake to support analytics throughout your enterprise. And the reason for your organization’s analytics is to deliver data-driven insights, with the emphasis on the term data-driven.
For better or for worse, the term analytics means different things to different people. As you set out to build your data lake, you need to understand what analytics means to your organization.
Deciding what your organization wants out of analytics
You should think of analytics as a continuum of questions that you ask about some particular function or business process within your organization, with the answers coming from your data:
What happened?
Why did it happen?
What’s happening right now?
What’s likely to happen?
What’s something interesting and important out of this mountain of data?
What are our options?
What should we do?
Your data lake needs to support the entire analytics continuum in all corners of your organization.
Suppose that Jan, your company’s CPO, is incredibly pleased with the work that Raul’s team did to have your data lake support machine learning models for the evaluation cycle. So, she asks Raul to expand the HR organization’s usage of analytics that are enabled by the data lake. Raul sits down with his analysts, Julia and Dhiraj, to create a master list of analytical questions that should be considered for implementation.
Raul’s team has the easiest time with “What happened?” types of questions, because these are what your company’s data warehouse and data marts have been producing for years. Now, though, your data marts and data warehouse will either be retired or incorporated into the data lake environment, so your data lake can take over this mission and serve up the data to answer questions along the lines of:
Which employees have consistently been rated in the top quintile in each department during the past three years?
Which employees have received the largest percentage salary increases during all evaluation periods during the past five years?
How many new employees were hired in each of the past three years?
How many employees left during each of the past years? How many of those resigned? How many were involuntarily terminated? How many retired?
Because your company’s executives are somewhat on the formal side, your list of “What happened?” questions will be categorized under the label descriptive analytics. In other words, your data lake will be producing analytics that describe something that happened in the past (which might be the very recent past, several years ago, or perhaps even farther back). But just like your existing data warehouse and data marts mostly do, your data lake will now be producing descriptive analytics.
You also need the data lake to help you dig into the reasons something happened. For example, your descriptive analytics tell you that the number of employees who voluntarily resigned from the company last year was 25 percent above the yearly average for the previous five years. Inquiring minds want to know why!
Diagnostic analytics help you dig into the “why” factor for what your descriptive analytics tell you, and — congratulations! — your data lake will take on another assignment. In this case, you can be sure that Jan, your CPO, will be digging for answers now that she’s clued in to the increase in employee turnover.
Raul is well aware that, although insight into past results is an important part of your company’s analytics continuum, Jan and the other executives — as well as many others at all levels of your organization — also need deep insights into what’s happening right now. Before working in HR, Raul used to be in the supply chain organization. His specialty there was providing up-to-the-minute, near-real-time reports and visualizations for logistics and transportation throughout the entire supply chain.
This special variation of descriptive analytics — basically, factually describing what’s happening right now — may have some applicability to HR, though probably less so than over in the supply chain organization. Still, Raul makes a note to dive into these types of questions.
Jan, Raul, Julia, Dhiraj, Tamara, and most everyone else in HR knows with absolute certainty that predictive analytics need to be a critical capability when the data lake functionality is built out. Even though predictive analytics aren’t exactly a crystal ball with guaranteed predictions, the sophisticated models can ingest data and tell the HR team and others what’s likely to happen. This way, the data lake can help provide insights such as the following:
Which employees are most at risk of resigning in the next year?
Which employees with less than three years of experience are most likely to become top performers in their next jobs?
Which employees with between 10 and 15 years of experience are most likely to underperform during the rest of this fiscal year?
Who are the top 50 nonmanagerial employees most likely to succeed as managers?
Predictive analytics generally falls under the category of data mining. Another form of data mining is digging into mountains of data, seeking interesting and important patterns and other insights that otherwise may remain hidden. Discovery analytics helps you mine your data to see the following:
Have any of your employees exhibited behavior that may indicate inappropriate or illegal activities, such as expense account fraud?
Is there anything going on in the company that can legally expose the