1 Introduction to the World of Big Data
CHAPTER OBJECTIVE
This chapter deals with the introduction to big data, defining what actually big data means. The limitations of the traditional database, which led to the evolution of Big Data, are explained, and insight into big data key concepts is delivered. A comparative study is made between big data and traditional database giving a clear picture of the drawbacks of the traditional database and advantages of big data. The three Vs of big data (volume, velocity, and variety) that distinguish it from the traditional database are explained. With the evolution of big data, we are no longer limited to the structured data. The different types of human‐ and machine-generated data—that is, structured, semi-structured, and unstructured—that can be handled by big data are explained. The various sources contributing to this massive volume of data are given a clear picture. The chapter expands to show the various stages of big data life cycle starting from data generation, acquisition, preprocessing, integration, cleaning, transformation, analysis, and visualization to make business decisions. This chapter sheds light on various challenges of big data due to its heterogeneity, volume, velocity, and more.
1.1 Understanding Big Data
With the rapid growth of Internet users, there is an exponential growth in the data being generated. The data is generated from millions of messages we send and communicate via WhatsApp, Facebook, or Twitter, from the trillions of photos taken, and hours and hours of videos getting uploaded in YouTube every single minute. According to a recent survey 2.5 quintillion (2 500 000 000 000 000 000, or 2.5 × 1018) bytes of data are generated every day. This enormous amount of data generated is referred to as “big data.” Big data does not only mean that the data sets are too large, it is a blanket term for the data that are too large in size, complex in nature, which may be structured or unstructured, and arriving at high velocity as well. Of the data available today, 80 percent has been generated in the last few years. The growth of big data is fueled by the fact that more data are generated on every corner of the world that needs to be captured.
Capturing this massive data gives only meager value unless this IT value is transformed into business value. Managing the data and analyzing them have always been beneficial to the organizations; on the other hand, converting these data into valuable business insights has always been the greatest challenge. Data scientists were struggling to find pragmatic techniques to analyze the captured data. The data has to be managed at appropriate speed and time to derive valuable insight from it. These data are so complex that it became difficult to process it using traditional database management systems, which triggered the evolution of the big data era. Additionally, there were constraints on the amount of data that traditional databases could handle. With the increase in the size of data either there was a decrease in performance and increase in latency or it was expensive to add additional memory units. All these limitations have been overcome with the evolution of big data technologies that lets us capture, store, process, and analyze the data in a distributed environment. Examples of Big data technologies are Hadoop, a framework for all big data process, Hadoop Distributed File System (HDFS) for distributed cluster storage, and MapReduce for processing.
1.2 Evolution of Big Data
The first documentary appearance of big data was in a paper in 1997 by NASA scientists narrating the problems faced in visualizing large data sets, which were a captivating challenge for the data scientists. The data sets were large enough, taxing more memory resources. This problem is termed big data. Big data, the broader concept, was first put forward by a noted consultancy: McKinsey. The three dimensions of big data, namely, volume, velocity, and variety, were defined by analyst Doug Laney. The processing life cycle of big data can be categorized into acquisition, preprocessing, storage and management, privacy and security, analyzing, and visualization.
The broader term big data encompasses everything that includes web data, such as click stream data, health data of patients, genomic data from biologic research, and so forth.
Figure 1.1 shows the evolution of big data. The growth of the data over the years is massive. It was just 600 MB in the 1950s but has grown by 2010 up to 100 petabytes, which is equal to 100 000 000 000 MB.
Figure 1.1 Evolution of Big Data.
1.3 Failure of Traditional Database in Handling Big Data
The Relational Database Management Systems (RDBMS) was the most prevalent data storage medium until recently to store the data generated by the organizations. A large number of vendors provide database systems. These RDBMS were devised to store the data that were beyond the storage capacity of a single computer. The inception of a new technology is always due to limitations in the older technologies and the necessity to overcome them. Below are the limitations of traditional database in handling big data.
Exponential increase in data volume, which scales in terabytes and petabytes, has turned out to become a challenge to the RDBMS in handling such a massive volume of data.
To address this issue, the RDBMS increased the number of processors and added more memory units, which in turn increased the cost.
Almost 80% of the data fetched were of semi‐structured and unstructured format, which RDBMS could not deal with.
RDBMS could not capture the data coming in at high velocity.
Table 1.1 shows the differences in the attributes of RDBMS and big data.
1.3.1 Data Mining vs. Big Data
Table 1.2 shows a comparison between data mining and big data.
Table 1.1 Differences in the attributes of big data and RDBMS.
ATTRIBUTES | RDBMS | BIG DATA |
---|---|---|
Data volume | gigabytes to terabytes | petabytes to zettabytes |
Organization | centralized | distributed |
Data type | structured | unstructured and semi‐structured |
Hardware type | high‐end model | commodity hardware |
Updates | read/write many times | write once, read many times |
Schema | static | dynamic |