For many organisations, the entities that data represent have existed through multiple data stores and software systems; for example:
an individual born in 1950 will have had their personal details, careers, financial records and so on, stored in multiple systems over the course of their lives;
infrastructure assets such as railways, bridges and buildings can be more than 100 years old (with an expectation that they will continue to provide useful service
PREFACE
PREFACE
xvii
for many more years) with data and records about them having been stored in multiple systems.
This book is ‘technology agnostic’, so is not tied to any one particular technology or software system. It details approaches to managing data quality that will stand the test of time regardless of future technology changes and evolving organisational requirements for data.
Intended audience
We intended this book to be both a reference source to be read (and reread) in its entirety and a source of advice and anecdotes that can be ‘dipped in to’ when required. It is written for data managers and the practitioners, supporters and sponsors involved in data quality initiatives.
It is also written for students and lecturers in both computer science and business/management courses who have an interest in, or reliance on, effective data exploitation.
Reference to other works
This book is not attempting to be a definitive guide for all possible data-related activities, many of which are already described in other authoritative works. Instead, we will focus on the challenges of managing enterprise data quality and the ways to refine the management system of an organisation to take adequate account of data. Where a subject already has authoritative and well-regarded reference material, we will refer to these authoritative sources.
1
This first part of the book will help you to understand better the nature of the data asset and why it can be difficult to manage, particularly in an enterprise or organisational context. Generic behaviours of people relating to data will be explored to help understand how people can affect data quality. Finally, some real-life examples and case studies of data quality problems will be used to help you understand some of the impacts of data that have poor quality.
Part I
The challenge of enterprise data
3
This chapter describes the differences between data and information, and how these relate to most business activities. We then consider the nature of the data asset and the generic life cycles of data and explain what is meant by the term ‘data quality’. Finally, we introduce the objectives of data quality management.
What ARE data?
Before going much further, there are some key terms and concepts that need to be defined and clarified to help ensure consistent understanding as you read this book.
The title of this book is Managing Data Quality, and, because they so often appear together when discussing the impact of computer technology on organisations, there are three important relevant terms that need to be clarified: data, information and knowledge.
When you have more than one data professional in a room, it is likely that there will be fierce debate about these terms. Even the ISO Online Browsing Platform1 (a place where all ISO definitions are gathered together) has numerous different definitions for these terms.
As the subject of this book is data, we can establish a solid foundation for our understanding by referring to the definition for data in ISO 8000-2:
Data: ‘reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing’.
In the case of definitions for information and knowledge, making a choice is more controversial, not least because potential definitions often use the other two terms and any single collection of definitions becomes recursive. However, we believe the following key observations provide sufficient understanding to read the remainder of this book (while we leave more detailed discussion to others):
Use of the term ‘information’ suggests richness of meaning, and is typically taking an end-user view of the value of data to organisations to enable decision making.
1 The data asset
1 https://www.iso.org/obp/ui/
Managing Data Quality
4
Use of the term ‘knowledge’ suggests an understanding acquired through experience or education, putting knowledge outside the scope of this book; for example, it doesn’t matter how many books you read about cycling, it is only when you have ridden a bike that you have knowledge of how to cycle!
Another complication is use of the terms ‘structured data’ and ‘unstructured data’. These terms have been a handy tool for marketing teams who are promoting particular software functionality (typically to extract meaning from unstructured data), but the two terms hide the reality that no data set in digital form is either fully structured or fully unstructured.
Structured data contain explicit, discrete elements (e.g. the tables, columns and keys within a relational database or the tags within an XML file) to represent meaning. These elements enable automation to generate insight and foresight from the meaning (e.g. being able to identify all the children in a hospital database by filtering the rows where age is less than 18).
Unstructured data are fundamentally text and images, which provide meaning in a way that requires either human expertise or artificial intelligence methods to process the meaning (e.g. a doctor reviews the medical scan that is the content of an image file).
In these examples, though, the database will typically also include unstructured elements (e.g. a free-text field to capture observational notes) and the digital file of the MRI scan will also include structured data in the form of metadata (e.g. the creation date) to support management of all the images.
Furthermore, a spreadsheet is essentially semi-structured, sitting somewhere between a database and an image file, because the rows and columns provide some structure but without the full richness of a relational database or an XML file.
In summary, no data set is ever entirely structured or unstructured. Structure is definitely important to data quality, though, because it captures a more precise, controllable set of requirements for the data. Requirements for unstructured data are less easy to enforce by definitive, repeatable computer-based algorithms.
Data as part of business activities
Any business activity should support the strategy of the organisation (and may have some part to play in developing this strategy). There should be governance in place to ensure that there is suitable senior or executive control and monitoring of this activity. Business activity in this context is not just applicable to commercial organisations, but refers to the activity by which any organisation delivers its core mission. Figure 1.1 illustrates this relationship.
The data asset
5
The four core components of a typical business activity are:
The