Smarter Data Science. Cole Stryker. Читать онлайн. Newlib. NEWLIB.NET

Автор: Cole Stryker
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Базы данных
Год издания: 0
isbn: 9781119693420
Скачать книгу
created at a rate of more than 1.7MB every second for every person (www.domo.com/solution/data-never-sleeps-6). That equates to approximately 154,000,000,000,000 punched cards. By coupling the volume of data with the capacity to meaningfully process that data, data can be used at scale for much more than simple record keeping.

      Clearly, our world is firmly in the age of big data. Enterprises are scrambling to integrate capabilities that can address advanced analytics such as artificial intelligence and machine learning in order to best leverage their data. The need to draw out insights to improve business performance in the marketplace is nothing less than mandatory. Recent data management concepts such as the data lake have emerged to help guide enterprises in storing and managing data. In many ways, the data lake was a stark contrast to its forerunner, the enterprise data warehouse (EDW). Typically, the EDW accepted data that had already been deemed useful, and its content was organized in a highly systematic way.

      When misused, a data lake serves as nothing more than a hoarding ground for terabytes and petabytes of unstructured and unprocessed data, much of it never to be used. However, a data lake can be meaningfully leveraged for the benefit of advanced analytics and machine learning models.

      But, are data warehouses and data lakes serving their intended purpose? More succinctly, are enterprises realizing the business-side benefit of having a place to hoard data?

      The global research and advisory firm Gartner has provided sobering analysis. It has estimated that more than half of the enterprise data warehouses that were attempted have been failures and that the new data lake has fared even worse. At one time, Gartner analysts projected that the failure rate of data lakes might reach as high as 60 percent (blogs.gartner.com/nick-heudecker/big-data-challenges-move-from-tech-to-the-organization). However, Gartner has now dismissed that number as being too conservative. Actual failure rates are thought to be much closer to 85 percent (www.infoworld.com/article/3393467/4-reasons-big-data-projects-failand-4-ways-to-succeed.html).

      Why have initiatives such as the EDW and the data lake failed so spectacularly? The short answer is that developing a proper information architecture isn't simple.

      For much the same reason that the EDW failed, many of the approaches taken by data scientists have failed to recognize the following considerations:

       The nature of the enterprise

       The business of the organization

       The stochastic and potentially gargantuan nature of change

       The importance of data quality

       How different techniques applied to schema design and information architecture can affect the organization's readiness for change

      Analysis reveals that the higher failure rate for data lakes and big data initiatives has been attributed not to technology itself but, rather, to how the technologists have applied the technology (datazuum.com/5-data-actions-2018/).

      These facets become quickly self-evident in conversations with our enterprise clients. In discussing data warehousing and data lakes, the conversation often involves answers such as, “Which one? We have many of each.” It often happens that a department within an organization needs a repository for its data, but their requirements are not satisfied by previous data storage efforts. So instead of attempting to reform or update older data warehouses or lakes, the department creates a new data store. The result is a hodgepodge of data storage solutions that don't always play well together, resulting in lost opportunities for data analysis.

      Obviously, new technologies can provide many tangible benefits, but those benefits cannot be realized unless the technologies are deployed and managed with care. Unlike designing a building as in traditional architecture, information architecture is not a set-it-and-forget-it prospect.

      While an organization can control how data is ingested, your organization can't always control how the data it needs changes over time. Organizations tend to be fragile in that they can break when circumstances change. Only flexible, adaptive information architectures can adjust to new environmental conditions. Designing and deploying solutions against a moving target is difficult, but the challenge is not insurmountable.

      The glib assertion that garbage in will equal garbage out is treated as being passé by many IT professionals. While in truth garbage data has plagued analytics and decision-making for decades, mismanaged data and inconsistent representations will remain a red flag for each AI project you undertake.

      The level of data quality demanded by machine learning and deep learning can be significant. Like a coin with two sides, low data quality can have two separate and equally devastating impacts. On the one hand, low-quality data associated with historical data can distort the training of a predictive model. On the other, new data can distort the model and negatively impact decision-making.

      As a sharable resource, data is exposed across your organization through layers of services that can behave like a virus when the level of data quality is poor—unilaterally affecting all those who touch the data. Therefore, an information architecture for artificial intelligence must be able to mitigate traditional issues associated with data quality, foster the movement of data, and, when necessary, provide isolation.

      We'll begin in Chapter 1, “Climbing the AI Ladder” with a discussion of the AI Ladder, an illustrative device developed by IBM to demonstrate the steps, or rungs, an organization must climb to realize sustainable benefits with the use of AI. From there, Chapters 2, “Framing Part I: Considerations for Organizations Using AI” and Chapter 3, “Framing Part II: Considerations for Working with Data and AI” cover an array of considerations data scientists and IT leaders must be aware of as they traverse their way up the ladder.

      In Chapter 4, “A Look Back on Analytics: More Than One Hammer” and Chapter 5, “A Look Forward on Analytics: Not Everything Can Be a Nail,” we'll explore some recent history: data warehouses and how they've given way to data lakes. We'll discuss how data lakes must be designed in terms of topography and topology. This will flow into a deeper dive into data ingestion, governance, storage, processing, access, management, and monitoring.

      In Chapter 6, “Addressing Operational Disciplines on the AI Ladder,” we'll discuss how DevOps, DataOps, and MLOps can enable an organization to better use its data in real time. In Chapter 7, “Maximizing the Use of Your Data: Being Value Driven,” we'll delve into the elements of data governance and integrated data management. We'll cover the data value chain and the need for data to be accessible and discoverable in order for the data scientist to determine the data's value.

      Chapter 8, “Valuing Data with Statistical Analysis and Enabling Meaningful Access” introduces