3 Measure a response for each group. The outcome of interest has to be chosen before the experiment is conducted, and the method for performing the measurement has to be decided in advance, too.
4 Compare group responses to determine which treatment is better. This is accomplished with a statistical test. It can be something as simple as a two‐sample test of means. Whatever test is applied, the test will tell us whether the difference between the groups could be just random or is more likely due to some systematic effect.
We know that observational data can only give correlations and they can't give causation. When we have a well‐designed experiment, we can answer causal questions: Does changing
[Yahoo did a study] to assess whether display ads for a brand, shown on Yahoo sites, can increase searches for the brand name or related keywords. The observational part of the study estimated that ads increased the number of searches by 871 percent to 1,198 percent. But when Yahoo ran a controlled experiment, the increase was only 5.4 percent. If not for the control, the company might have concluded that the ads had a huge impact and wouldn't have realized that the increases in searches was due to other variables that changed during the observation period.
1.4.2 Big Three of Causality
The “Big Three” criteria for being able to make causal inference are as follows:
1 When changed, also changed. If changes and doesn't change, then we cannot assert that causes (sometimes this is useful information).
2 happened before . If happens after , then cannot cause . This issue arises sometimes in marketing research, where a commercial is shown one day and sales on that same day are measured. How can we know that today's sales weren't affected by something that happened yesterday?
3 Nothing else besides changed systematically. If variables and change at the same time that changes – not every time, but often enough – then we cannot rule out the possibility that and are causing the changes in . Observational data cannot rule out this possibility. The random treatment assignment of an experiment can rule this out.
Experimentation is the art of making sure these criteria are met so that valid causal statements can be made. Much more will be said about this in the next two chapters. The problem with observational data is that at least one of three is always missing, usually the third.
You should consider performing an experiment when you have lots of items on which to experiment (i.e. “experimental units”), you have the capability to take measurements on these units and the outcomes from the experiments can be measured easily, and you have control over the treatments. In manufacturing, for example, lots of items come off the assembly line, so there is an abundance of experimental units. Measurement typically is easy: Does the item work or how well does it work? Often it is very easy to apply treatments to some units and not to others.
In digital marketing, experimental units are available in very large quantities – think website visitors. However, measurement can sometimes be problematic – what constitutes a “successful” website visit? Is success an immediate purchase or a purchase three weeks later? How can you tell the same visitor returned three weeks later? Control of treatments can be difficult. We can put an ad on a webpage that a customer visited, but how do we know he actually saw the ad? In the case of television, we can run a commercial, but how do we know who actually saw it? How do we know which of the persons who bought our product this week have seen the ad? These are some of the problems we will deal with in later chapters.
1.4.3 Most Experiments Fail
It is important to remember that the purpose of an experiment is to test some idea, not prove something and also that most experiments fail! This may sound depressing, but it is hugely effective if you can create a process that allows bad ideas to fail quickly and with minimal investment:
“[Our company has] tested over 150 000 ideas in over 13 000 MVT [multivariate testing] projects during the past 22 years. Of all the business improvement ideas that were tested, only about 25 percent (one in four) actually produced improved results; 53 percent (about half) made no difference (and were a waste of everybody's times); and 22 percent (that would have been implemented otherwise) actually hurt the results they were intended to help” (Holland and Cochran, 2005, p. 21).
“Netflix considers 90% of what they try to be wrong” (Moran, 2007, p. 240).
“I have observed the results of thousands of business experiments. The two most important substantive observations across them are stark. First, innovative ideas rarely work When a new business program is proposed, it typically fails to increase shareholder value versus the previous best alternative” (Manzi, 2012, p. 13).
Writing of the credit card company Capital One (Goncalves, 2008, p. 27): “We run thirty thousand tests a year, and they all compete against each other on the basis of economic results. The vast majority of these experiments fail, but the ones that succeed can hit very big[.]”
“Given a ten percent chance of a 100 times payoff, you should take that bet every time. But you're still going to be wrong nine times out of ten.” Amazon CEO Jeff Bezos wrote this in his 2016 letter to shareholders.
“Economic development builds on business experiments. Many, perhaps most experiments fail” (Eliasson, 2010, p. 117).
You are not going to get useful results from most of the experiments that you conduct. But, as Thomas Edison said of inventing the lightbulb, “I have not failed. I've just found 10 000 ways that didn't work.” Failed experiments are not worthless; they can contain much useful information: Why didn't the experiment work? Did we maintain false assumptions? Was the design faulty? Everything learned from a failed experiment can help make the next experiment better.
When dealing with human subjects, where response sizes are small and there are lots of noise, there can be a tendency toward false positives (especially when sample sizes are small!), so follow‐up experiments of small sample experiments are important to document that the discovered effect really exists.
Even with large samples, it is best to make sure that a discovered effect really exists. In webpage development, an experiment to optimize a webpage might prove fruitful, yet the improvement will not immediately be rolled out to all users. Instead, it might be rolled out to 5% of users to guard against the possibility that some unforeseen development might render the improvement futile or worse, harmful. Only after it has been deemed successful with the 5% sample will it be rolled out to all users.
Exercises
1 1.4.1 Suppose the company in the invoice example billed quarterly and had, on average, $10 million in accounts receivable each quarter. If short‐term money costs 6%, how much does the company save?
2 1.4.2 Give an example of a business hypothesis, e.g. we think that raising price from $2 to $2.25 won't cost us sales. Describe an experiment to test your hypothesis. What data need to be collected? How should the data be collected?
3 1.4.3 Find an example of a business experiment reported in the popular business literature, e.g. Forbes or The Wall Street Journal.
1.5 Improving Website Designs
One of the most popular types of experiment in business is the A/B website test. For example, Figure 1.4 shows two different versions