Completeness: The quality of having data records stored for all entities and that all attributes for an entity are populated.
Consistency: The ability to correctly link data relating to the same entity across different data sets.
Data: Reinterpretable representation of information in a formalized manner suitable for communication, interpretation or processing (ISO 8000-2:2020).
Data custodian: See Data steward.
Data governance: Development and enforcement of policies related to the management of data (ISO 8000-2:2020).
Data management: The activities of defining, creating, storing, maintaining and providing access to data and associated processes in one or more information systems (ISO/IEC TR 10032:2003).
Data owner: An individual who is accountable for a data asset.
Data quality: Degree to which a set of inherent characteristics of data fulfils requirements (ISO 8000-2:2020).
Data quality criteria: Specific tests that can be applied to data in order to understand the nature of their quality. This can also include the methods to be used in assessing quality.
xiv
Glossary
Data quality management: Coordinated activities to direct and control an organisation with regard to data quality (ISO 8000-2:2020).
Data set: Logically meaningful grouping of data (ISO 8000-2:2020).
Data steward: Person or organisation delegated the responsibility for managing a specific set of data resources (ISO 8000-2:2020).
ISO 8000: The multi-part ISO standard for data quality.
ISO 9000: The family of standards addressing various aspects of quality management, providing guidance and tools for companies and organisations who want to ensure that their products and services consistently meet customers’ requirements, and that quality is consistently improved.
Metadata: Data defining and describing other data (ISO 8000-2:2020).
Precision: Degree of specificity for a data entry (ISO/IEC 11179-3:2013 - modified).
Structured data: In a data set, the meaning covered by explicit elements of the data (e.g. the tables, columns and keys within a relational database or the tags within an XML file).
Timeliness: A measure of how current a data item is.
Uniqueness: A measure of whether an entity has a single data entry relating to it within a data set.
Unstructured data: In a data set, levels of meaning that are not covered by structural elements of the data (e.g. the characteristics of the brain in the image of a diagnostic medical scan).
Validity: Conformance of data to rules defining the syntax and structure of data.
Value: Numbers or characters stored in a data field to represent an entity or activity.
xv
PREFACE
Data are all around us1; the volume of data is growing at exponential rates and our lives are increasingly being supported and enabled by the exploitation of data. Despite this, many organisations struggle to effectively manage data and the quality of these data.
The reliance of organisations (and society) on data is a relatively new phenomenon; the techniques to manage data effectively are still developing and wider awareness of these approaches is generally low.
This book is titled Managing Data Quality: A Practical Guide very deliberately; its focus is to provide you with an understanding of how to manage data quality, and practical guidance to achieve this.
Enterprise data quality
This book does not just examine quality issues in single data databases or data stores. Instead, we also look at the wider set of issues arising in a typical organisation where there are multiple data stores that are not always formally managed, have been developed at different times, are constrained by different software tools and will be inputs and outputs of many different business processes.
Managing this landscape of different data stores is complex enough when there is a lack of agreement over which is the most trusted, or ‘master’, data source. This complexity increases, however, in most organisations where a large amount of data are also gathered, stored and manipulated in user-created spreadsheets and databases that often exist ‘below the radar’ of corporate governance approaches and controls.
Depending on the organisational context, this chaotic landscape presents a range of risks (and issues) to the organisation, which might be financial, regulatory, commercial,
Keith Gordon’s book, Principles of Data Management (2013) also takes an enterprise view of data. Keith’s book was published before ISO 8000-61, the international standard that specifies a process reference model for data quality management. This process model is the basis for our approach to enterprise data quality.
1 Please note, some readers will generally use the word ‘data’ in the singular; the BCS convention is to use this word as plural.
xvi
legal or reputational. These risks and issues could be significant. Standing still is almost certainly not a viable option.
From the perspective of the enterprise as a whole, therefore, managing data quality effectively can be such a large task that it either never gets started or is viewed as so expensive that it eats up budget that could be better used elsewhere.
This book will help you to overcome this perceived complexity with practical solutions, by understanding:
the nature of the data asset and why it can be difficult to manage;
the impact of people and behaviours on data quality;
the ISO 8000-61 framework and how it defines approaches to data quality management;
how to develop strategies for change that are relevant to your organisation.
Data and changing technology
Over the many years since computing first became a commercial activity, there have been numerous changes in technology. At the highest level has been the progression from mainframes to personal computers, client/server systems, network computers, cloud computing, the Internet of Things (IoT) and big data analytics. Within each of these broad categories, technologies and approaches have continually evolved. Each evolutionary step is often sold on the basis of overcoming the shortcomings of the previous technology. Today’s latest technology, likewise, will be replaced in the future as new user requirements are discovered and improved technological approaches are developed.
Throughout all these changes in technology, data should have been a constant factor. They should have been migrated without loss of meaning from the old technology to the new, so as to sustain the effective delivery of organisational outcomes. However, data migration projects have historically been high risk and likely to fail in terms of time, cost or quality. For example, data can be lost or corrupted as part of the migration process, with such problems possibly affecting significant volumes of data. Similarly, changing data requirements over time can mean that older data structures are no longer fully understood, resulting in corruption during migration. Data migration approaches,