1.8.1 Big Data Generation
The first phase of the life cycle of big data is the data generation. The scale of data generated from diversified sources is gradually expanding. Sources of this large volume of data were discussed under the Section 1.5, “Sources of Big Data.”
Figure 1.10 Big data life cycle.
1.8.2 Data Aggregation
The data aggregation phase of the big data life cycle involves collecting the raw data, transmitting the data to the storage platform, and preprocessing them. Data acquisition in the big data world means acquiring the high‐volume data arriving at an ever‐increasing pace. The raw data thus collected is transmitted to a proper storage infrastructure to support processing and various analytical applications. Preprocessing involves data cleansing, data integration, data transformation, and data reduction to make the data reliable, error free, consistent, and accurate. The data gathered may have redundancies, which occupy the storage space and increase the storage cost and can be handled by data preprocessing. Also, much of the data gathered may not be related to the analysis objective, and hence it needs to be compressed while being preprocessed. Hence, efficient data preprocessing is indispensable for cost‐effective and efficient data storage. The preprocessed data are then transmitted for various purposes such as data modeling and data analytics.
1.8.3 Data Preprocessing
Data preprocessing is an important process performed on raw data to transform it into an understandable format and provide access to consistent and accurate data. The data generated from multiple sources are erroneous, incomplete, and inconsistent because of their massive volume and heterogeneous sources, and it is meaningless to store useless and dirty data. Additionally, some analytical applications have a crucial requirement for quality data. Hence, for effective, efficient, and accurate data analysis, systematic data preprocessing is essential. The quality of the source data is affected by various factors. For instance, the data may have errors such as a salary field having a negative value (e.g., salary = −2000), which arises because of transmission errors or typos or intentional wrong data entry by users who do not wish to disclose their personal information. Incompleteness implies that the field lacks the attributes of interest (e.g., Education = “”), which may come from a not applicable field or software errors. Inconsistency in the data refers to the discrepancies in the data, say date of birth and age may be inconsistent. Inconsistencies in data arise when the data collected are from different sources, because of inconsistencies in naming conventions between different countries and inconsistencies in the input format (e.g., date field DD/MM when interpreted as MM/DD). Data sources often have redundant data in different forms, and hence duplicates in the data also have to be removed in data preprocessing to make the data meaningful and error free. There are several steps involved in data preprocessing:
1 Data integration
2 Data cleaning
3 Data reduction
4 Data transformation
1.8.3.1 Data Integration
Data integration involves combining data from different sources to give the end users a unified data view. Several challenges are faced while integrating data; as an example, while extracting data from the profile of a person, the first name and family name may be interchanged in a certain culture, so in such cases integration may happen incorrectly. Data redundancies often occur while integrating data from multiple sources. Figure 1.11 illustrates that diversified sources such as organizations, smartphones, personal computers, satellites, and sensors generate disparate data such as e‐mails, employee details, WhatsApp chat messages, social media posts, online transactions, satellite images, and sensory data. These different types of structured, unstructured, and semi‐structured data have to be integrated and presented as unified data for data cleansing, data modeling, data warehousing, and to extract, transform, and load (ETL) the data.
Figure 1.11 Data integration.
1.8.3.2 Data Cleaning
The data‐cleaning process fills in the missing values, corrects the errors and inconsistencies, and removes redundancy in the data to improve the data quality. The larger the heterogeneity of the data sources, the higher the degree of dirtiness. Consequently, more cleaning steps may be involved. Data cleaning involves several steps such as spotting or identifying the error, correcting the error or deleting the erroneous data, and documenting the error type. To detect the type of error and inconsistency present in the data, a detailed analysis of the data is required. Data redundancy is the data repetition, which increases storage cost and transmission expenses and decreases data accuracy and reliability. The various techniques involved in handling data redundancy are redundancy detection and data compression. Missing values can be filled in manually, but it is tedious, time‐consuming, and not appropriate for the massive volume of data. A global constant can be used to fill in all the missing values, but this method creates issues while integrating the data; hence, it is not a foolproof method. Noisy data can be handled by four methods, namely, regression, clustering, binning, and manual inspection.
1.8.3.3 Data Reduction
Data processing on massive data volume may take a long time, making data analysis either infeasible or impractical. Data reduction is the concept of reducing the volume of data or reducing the dimension of the data, that is, the number of attributes. Data reduction techniques are adopted to analyze the data in reduced format without losing the integrity of the actual data and yet yield quality outputs. Data reduction techniques include data compression, dimensionality reduction, and numerosity reduction. Data compression techniques are applied to obtain the compressed or reduced representation of the actual data. If the original data is retrieved back from the data that is being compressed without any loss of information, then it is called lossless data reduction. On the other hand, if the data retrieval is only partial, then it is called lossy data reduction. Dimensionality reduction is the reduction of a number of attributes, and the techniques include wavelet transforms where the original data is projected into a smaller space and attribute subset selection, a method which involves removal of irrelevant or redundant attributes. Numerosity reduction is a technique adopted to reduce the volume by choosing smaller alternative data. Numerosity reduction is implemented using parametric and nonparametric methods. In parametric methods instead of storing the actual data, only the parameters are stored. Nonparametric methods stores reduced representations of the original data.
1.8.3.4 Data Transformation
Data transformation refers to transforming or consolidating the data into an appropriate format and converting them into logical and meaningful information