The stages in this life cycle are as follows.
Specify: The activity of ensuring that data requirements are detailed in order to make certain that data providers understand what is required. For some data, the organisation is not able to impose a specification on external providers but, by identifying formal requirements, the organisation would at least be able to identify issues upon receipt of the data.
Signal/data acquisition: Structured data can arise from signals in physical assets (e.g. a temperature reading being recorded every 10 seconds) or can be generated by operational control systems.
Purchase: Specialist companies can, for example, provide data on population demographics, derive industry-wide market analysis or model future projected demand for a service.
Data entry: Much data will arise from some form of data entry, either specifically as a data population activity (perhaps manually entered) or arising from a business activity as part of the process being undertaken.
Store: Once you have acquired data, they will need to be stored and kept ready for immediate use.
Utilise/exploit: The activity of using data to support business processes, decision making or analysis is where the benefits arising from the data can be delivered for the organisation. This is, however, also the point where poor data quality management can compromise the potential benefits that could be delivered.
Assess quality: A part of data exploitation should be an assessment of the quality of the data. When undertaking data analysis, for example, it may become apparent
The data asset
9
that a particular segment of the data is only partially complete. This knowledge should inform the analysis process, but it is also a trigger for the next step.
Improve data: Greater awareness of the quality of existing data or changes to business requirements can be the trigger to gather new data or improve existing data.
Synthesis: The activity of data exploitation can create new, synthesised data that warrant storage for future utilisation. For instance, this could be performance statistics for each day, which are stored to enable time-series analysis. Forms of synthesis can include inference and extrapolation to allow missing data to be determined; for example, estimating the age of a main water supply pipe based on the age of the properties on a particular street.
Archive: Some data are no longer required for immediate access, but need to be retained for legal or regulatory compliance purposes; so, various offline storage methods can be used to keep the data, accepting that there could be some delay between wanting access to the data and them becoming available.
Delete: Ultimately, some data will have no further purpose or benefit, so can be considered for permanent deletion. An example of this could be the full audit trail for all transactions on a system that will not be required many years after the transactions occurred.
There are many types of document that can exist in an organisation, with varying levels of importance and differing requirements for retention. These can include:
organisational policies, strategies and standards requiring formal approval and version control;
contracts and legal documents requiring retention until all possible consequences have been exhausted;
design, construction and maintenance documents requiring retention until the physical asset no longer exists;
personnel records requiring retention in line with legal and regulatory stipulations;
project and team working documents requiring less rigorous control and management, but are useful for day-to-day activity within the organisation.
The life cycle for documents (which can be referred to as semi-structured data) has a number of areas of difference, particularly for documents stored in a formal electronic document management system (EDMS) and consists of eight stages, as shown in Figure 1.3.
These life cycle stages are as follows:
Create: When a text-based document is created and stored in a document management system, a range of metadata will also be stored about the document; for example, the author, creation date and security classification.
Managing Data Quality
10
Review and approve: Before a document can be published, it will typically need to undergo a review and approval process in order to confirm that it meets the required quality to allow it to be published.
Store: As with structured data, large volumes of documents will need to be stored with appropriate security settings ready for use by staff across the organisation.
Publish and distribute: In order for documents to deliver value to your organisation, they must be available and accessible to the relevant people. There will also probably need to be some way to notify staff of the availability of key documents.
Update: At some point in the life of a document there will be a need to update it, perhaps to reflect changes in organisational structure, new processes and so on. This will entail someone creating a new revision (version) of the document and then submitting it for review and approval.
Supersede: Once the new revision (version) of a document has been approved, then the old (previous) version will need to be marked as ‘not current’ or ‘superseded’. Suitable processes will need to be in place to ensure that any hard copy versions of documents are disposed of and replaced with the current version.
Retire: At some point, superseded versions of documents need to be retired so that they are still retained for evidential purposes, but not visible to general users. This is similar to the archive stage for general data.
Dispose: This can involve deletion for electronic documents and a suitable destruction method for hard copies. If the document is sensitive (for security, commercial or intellectual property reasons), then shredding or secure disposal will be required for hard copies.
Within your organisation there could be variations in the life cycles that have been defined here and the names of the different stages. They will, however, probably be broadly similar to the life cycles detailed above.
Figure 1.3 A typical life cycle for documents
Semi-structured data (such as documents and social media feeds) present some additional challenges from a data quality perspective. These data entities will include metadata that provides clarity on items, such as title, date created, the user ID of the creator, version number and so on. They will also include data that have no predictable pattern to the structure. The ‘body’ of a document or message,
The data asset
11
What is data quality?
The fundamental effect of data quality is the right data being available at the right time to the right users, to make the right decision and achieve the right outcome.
This can be extended by considering that good quality data are safe, legal and processed fairly, correctly and securely.
Whilst ‘perfect’ data quality appears desirable, the reality is that organisations are unlikely to have the time, resources, budget or needs for ‘perfect’ data (and never will have). Therefore, you need to accept that your data are never perfect, and probably never will be. So, accepting this fact, you need to be able to understand and describe the nature of your data quality.
If someone states that ‘the weather is bad’, for example, this has little meaning without stating whether it is too hot or too cold, too wet or too dry, too windy or too still and so on. For some people, the weather could be