S. No. | Data mining | Big data |
---|---|---|
1) | Data mining is the process of discovering the underlying knowledge from the data sets. | Big data refers to massive volume of data characterized by volume, velocity, and variety. |
2) | Structured data retrieved from spread sheets, relational databases, etc. | Structured, unstructured, or semi‐structured data retrieved from non‐relational databases, such as NoSQl. |
3) | Data mining is capable of processing large data sets, but the data processing costs are high. | Big data tools and technologies are capable of storing and processing large volumes of data at a comparatively lower cost. |
4) | Data mining can process only data sets that range from gigabytes to terabytes. | Big data technology is capable of storing and processing data that range from petabytes to zettabytes. |
1.4 3 Vs of Big Data
Big data is distinguished by its exceptional characteristics with various dimensions. Figure 1.2 illustrates various dimensions of big data. The first of its dimensions is the size of the data. Data size grows partially because the cluster storage with commodity hardware has made it cost effective. Commodity hardware is a low cost, low performance, and low specification functional hardware with no distinctive features. This is referred by the term “volume” in big data technology. The second dimension is the variety, which describes its heterogeneity to accept all the data types, be it structured, unstructured, or a mix of both. The third dimension is velocity, which relates to the rate at which the data is generated and being processed to derive the desired value out of the raw unprocessed data. The complexities of the data captured pose a new opportunity as well as a challenge for today’s information technology era.
Figure 1.2 3 Vs of big data.
1.4.1 Volume
Data generated and processed by big data are continuously growing at an ever increasing pace. Volume grows exponentially owing to the fact that business enterprises are continuously capturing the data to make better and bigger business solutions. Big data volume measures from terabytes to zettabytes (1024 GB = 1 terabyte; 1024 TB = 1 petabyte; 1024 PB = 1 exabyte; 1024 EB = 1 zettabyte; 1024 ZB = 1 yottabyte). Capturing this massive data is cited as an extraordinary opportunity to achieve finer customer service and better business advantage. This ever increasing data volume demands highly scalable and reliable storage. The major sources contributing to this tremendous growth in the volume are social media, point of sale (POS) transactions, online banking, GPS sensors, and sensors in vehicles. Facebook generates approximately 500 terabytes of data per day. Every time a link on a website is clicked, an item is purchased online, a video is uploaded in YouTube, data are generated.
1.4.2 Velocity
With the dramatic increase in the volume of data, the speed at which the data is generated also surged up. The term “velocity” not only refers to the speed at which data are generated, it also refers to the rate at which data is processed and analyzed. In the big data era, a massive amount of data is generated at high velocity, and sometimes these data arrive so fast that it becomes difficult to capture them, and yet the data needs to be analyzed. Figure 1.3 illustrates the data generated with high velocity in 60 seconds: 3.3 million Facebook posts, 450 thousand tweets, 400 hours of video upload, and 3.1 million Google searches.
Figure 1.3 High‐velocity data sets generated online in 60 seconds.
1.4.3 Variety
Variety refers to the format of data supported by big data. Data arrives in structured, semi‐structured, and unstructured format. Structured data refers to the data processed by traditional database management systems where the data are organized in tables, such as employee details, bank customer details. Semi‐structured data is a combination of structured and unstructured data, such as XML. XML data is semi‐structured since it does not fit the formal data model (table) associated with traditional database; rather, it contains tags to organize fields within the data. Unstructured data refers to data with no definite structure, such as e‐mail messages, photos, and web pages. The data that arrive from Facebook, Twitter feeds, sensors of vehicles, and black boxes of airplanes are all unstructured, which the traditional database cannot process, and here is when big data comes into the picture. Figure 1.4 represents the different data types.
Figure 1.4 Big data—data variety.
1.5 Sources of Big Data
Multiple disparate data sources are responsible for the tremendous increase in the volume of big data. Much of the growth in data can be attributed to the digitization of almost anything and everything in the globe. Paying E‐bills, online shopping, communication through social media, e‐mail transactions in various organizations, a digital representation of the organizational data, and so forth, are some of the examples of this digitization around the globe.
Sensors: Sensors that contribute to the large volume of big data are listed below.Accelerometer sensors installed in mobile devices to sense the vibrations and other movements.Proximity Sensors used in public places to detect the presence of objects without physical contact with the objects.Sensors in vehicles and medical devices.
Health care: The major sources of big data in health care are:Electronic Health Records (EHRs) collect and display patient information such as past medical history, prescriptions by the medical practitioners, and laboratory test results.Patient portals permit patients to access their personal medical records saved in EHRs.Clinical data repository aggregates individual patient records from various clinical sources and consolidates them to give a unified view of patient history.
Black box: Data are generated by the black box in airplanes, helicopters, and jets. The black box captures the activities of flight, flight crew announcements, and aircraft performance information.Figure 1.5 Sources of big data.
Web data: Data generated on clicking a link on a website is captured by the online retailers. This is perform click stream analysis to analyze customer interest and buying patterns to generate recommendations based on the customer interests and to post relevant advertisements to the consumers.
Organizational data: E‐mail transactions and documents that are generated within the organizations together contribute to the organizational data.