Probability is one of the most fundamental concepts in statistics. To even get started making sense of your data by using statistics, you need to be able to identify something as basic as whether you’re looking at descriptive or inferential statistics. You also need a firm grasp of the basics of probability distribution. The following sections cover these concepts and more.
A statistic is a result that’s derived from performing a mathematical operation on numerical data. In general, you use statistics in decision-making. Statistics come in two flavors:
Descriptive: Descriptive statistics provide a description that illuminates some characteristic of a numerical dataset, including dataset distribution, central tendency (such as mean, min, or max), and dispersion (as in standard deviation and variance). For clarification, the mean of a data set is the average value of its data points, its min is the minimum value of its data points and the max is the maximum value. Descriptive statistics are not meant to illustrate any causal claims. Descriptive statistics can highlight relationships between X and Y, but they do not posit that X causes Y.
Inferential: Rather than focus on pertinent descriptions of a dataset, inferential statistics carve out a smaller section of the dataset and attempt to deduce significant information about the larger dataset. Unlike descriptive statistics, inferential methods, such as regression analysis, DO try to predict by studying causation. Use this type of statistics to derive information about a real-world measure in which you’re interested.
It’s true that descriptive statistics describe the characteristics of a numerical dataset, but that doesn’t tell you why you should care. In fact, most data scientists are interested in descriptive statistics only because of what they reveal about the real-world measures they describe. For example, a descriptive statistic is often associated with a degree of accuracy, indicating the statistic’s value as an estimate of the real-world measure.
To better understand this concept, imagine that a business owner wants to estimate the upcoming quarter’s profits. The owner might take an average of the past few quarters’ profits to use as an estimate of how much profit they’ll make during the next quarter. But if the previous quarters’ profits varied widely, a descriptive statistic that estimated the variation of this predicted profit value (the amount by which this dollar estimate could differ from the actual profits earned) would indicate just how far the predicted value could be from the actual one. (Not bad information to have, right?)
You can use descriptive statistics in many ways — to detect outliers, for example, or to plan for feature preprocessing requirements or to quickly identify which features you may want — or not want — to use in an analysis.
Like descriptive statistics, inferential statistics are used to reveal something about a real-world measure. Inferential statistics do this by providing information about a small data selection, so you can use this information to infer something about the larger dataset from which it was taken. In statistics, this smaller data selection is known as a sample, and the larger, complete dataset from which the sample is taken is called the population.
If your dataset is too big to analyze in its entirety, pull a smaller sample of this dataset, analyze it, and then make inferences about the entire dataset based on what you learn from analyzing the sample. You can also use inferential statistics in situations where you simply can’t afford to collect data for the entire population. In this case, you’d use the data you do have to make inferences about the population at large. At other times, you may find yourself in situations where complete information for the population isn’t available. In these cases, you can use inferential statistics to estimate values for the missing data based on what you learn from analyzing the data that’s available.
For an inference to be valid, you must select your sample carefully so that you form a true representation of the population. Even if your sample is representative, the numbers in the sample dataset will always exhibit some noise — random variation, in other words — indicating that the sample statistic isn’t exactly identical to its corresponding population statistic. For example, if you’re constructing a sample of data based on the demographic makeup of Chicago’s population, you would want to ensure that proportions of racial/ethnic groups in your sample match up to proportions in the population overall.
Probability distributions
Imagine that you’ve just rolled into Las Vegas and settled into your favorite roulette table over at the Bellagio. When the roulette wheel spins off, you intuitively understand that there is an equal chance that the ball will fall into any of the slots of the cylinder on the wheel. The slot where the ball lands is totally random, and the probability, or likelihood, of the ball landing in any one slot over another is the same. Because the ball can land in any slot, with equal probability, there is an equal probability distribution, or a uniform probability distribution — the ball has an equal probability of landing in any of the slots in the wheel.
But the slots of the roulette wheel aren’t all the same — the wheel has 18 black slots and 20 slots that are either red or green.
Because of this arrangement, the probability that your ball will land on a black slot is 47.4%.
Your net winnings here can be considered a random variable, which is a measure of a trait or value associated with an object, a person, or a place (something in the real world) that is unpredictable. Because this trait or value is unpredictable, however, doesn’t mean that you know nothing about it. What’s more, you can use what you do know about this thing to help you in your decision-making. Keep reading to find out how.
A weighted average is an average value of a measure over a very large number of data points. If you take a weighted average of your winnings (your random variable) across the probability distribution, this would yield an expectation value — an expected value for your net winnings over a successive number of bets. (An expectation can also be thought of as the best guess, if you had to guess.) To describe it more formally, an expectation is a weighted average of some measure associated with a random variable. If your goal is to model an unpredictable variable so that you can make data-informed decisions based on what you know about its probability in a population, you can use random variables and probability distributions to do this.
When considering the probability of an event, you must know what other events are possible. Always define the set of events as mutually exclusive — only one can occur at a time. (Think of the six possible results of rolling a die.) Probability has these two important characteristics:
The probability of any single event never goes below 0.0 or exceeds 1.0.
The probability of all events always sums to exactly 1.0.
Probability distribution is classified per these two types:
Discrete: A random variable where values can be counted by groupings
Continuous: A random variable that assigns probabilities to a range of value
To understand discrete and continuous distribution, think of two variables from a dataset describing cars. A “color” variable would have a discrete distribution because cars have only a limited range of colors (black, red, or blue, for example). The observations would be countable per the color grouping. A variable describing cars’ miles per gallon, or mpg, would have a continuous distribution because each car could have its