Data Science For Dummies. Lillian Pierson. Читать онлайн. Newlib. NEWLIB.NET

Автор: Lillian Pierson
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Базы данных
Год издания: 0
isbn: 9781119811619
Скачать книгу
algorithms to combine machine learning approaches to achieve results that are better than would be available from any single machine learning method on its own.

      

Visit the companion website to this book (https://businessgrowth.ai/) to get a quick-start guide to selecting the best deep learning network for your most immediate needs.

Schematic illustration of machine learning algorithms can be broken down by function.

      FIGURE 3-3: Machine learning algorithms can be broken down by function.

      FIGURE 3-4: Neural networks are connected layers of artificial neural units.

Schematic illustration of a deep learning network which is a neural network with more than one hidden layer.

      FIGURE 3-5: A deep learning network is a neural network with more than one hidden layer.

      Using Spark to generate real-time big data analytics

      Apache Spark is an in-memory distributed computing application that you can use to deploy machine learning algorithms on big data sources in near-real-time to generate analytics from streaming big data sources. Whew!

      

In-memory refers to processing data within the computer’s memory, without actually reading and writing its computational results onto the disk. In-memory computing provides its results a lot faster but cannot process much data per processing interval.

      Because it processes data in microbatches, with 3-second cycle times, you can use it to significantly decrease time-to-insight in cases where time is of the essence. It can be run on data that sits in a wide variety of storage architectures, including Hadoop HDFS, Amazon Redshift, MongoDB, Cassandra, Solr and AWS. Spark is composed of the following submodules:

       Spark SQL: You use this module to work with and query structured data using Spark. Within Spark, you can query data using Spark’s built-in SQL package: SparkSQL. You can also query structured data using Hive, but then you’d use the HiveQL language and run the queries using the Spark processing engine.

       GraphX: The GraphX library is how you store and process network data from within Spark.

       Streaming: The Streaming module is where the big data processing takes place. This module basically breaks a continuously streaming data source into much smaller data streams, called Dstreams — discreet data streams, in other words. Because the Dstreams are small, these batch cycles can be completed within three seconds, which is why it’s called microbatch processing.

       MLlib: The MLlib submodule is where you analyze data, generate statistics, and deploy machine learning algorithms from within the Spark environment. MLlib has APIs for Java, Scala, Python, and R. The MLlib module allows data professionals to work within Spark to build machine learning models in Python or R, and those models will then pull data directly from the requisite data storage repository, whether that be on-premise, in a cloud, or even a multicloud environment. This helps reduce the reliance that data scientists sometimes have on data engineers. Furthermore, computations are known to be 100 times faster when processed in-memory using Spark as opposed to the traditional MapReduce framework.

      You can deploy Spark on-premise by downloading the open-source framework from the Apache Spark website, at http://spark.apache.org/downloads.html. Another option is to run Spark on the cloud via the Apache Databricks service, at https://databricks.com.

      Math, Probability, and Statistical Modeling

      IN THIS CHAPTER

      

Introducing the core basics of statistical probability

      

Quantifying correlation

      

Reducing dataset dimensionality

      

Building decision models with multiple criteria decision-making

      

Diving into regression methods

      

Detecting outliers

      

Talking about time series analysis

      Math and statistics are not the scary monsters that many people make them out to be. In data science, the need for these quantitative methods is simply a fact of life — and nothing to get alarmed over. Although you must have a handle on the math and statistics that are necessary to solve a problem, you don’t need to go study for degrees in those fields.

      Contrary to what many pure statisticians would have you believe, the data science field isn’t the same as the statistics field. Data scientists have substantive knowledge in one field or several fields, and they use statistics, math, coding, and strong communication skills to help them discover, understand, and communicate data insights that lie within raw datasets related to their field of expertise. Statistics is a vital component of this formula, but not more vital than the others. In this chapter, I introduce you to the basic ideas behind probability, correlation analysis, dimensionality reduction, decision modeling, regression analysis, outlier detection, and time series analysis.