Some time ago we conducted an experiment to measure the results of a marketing campaign on customer purchases in a mobile app. A customer would receive a marketing message with the call to action being an in-app product purchase. The campaign aimed to lift the total number of in-app purchases as well as the average purchase amount per customer.

To measure the impact of the campaign, we designed a randomized controlled experiment splitting our customers into two groups: treatment and control. Customers in the treatment group received a marketing message. Customers in the control group did not. This was a classic A/B test. The rationale behind A/B testing is that by randomly assigning treatments you keep all other factors constant, and therefore can infer whether the treatment altered the measured outcome.

Stratification as a first step

We needed the customers assigned to the treatment group to be similar to the customers assigned to the control group. Otherwise, any differences we observed between the groups could be attributed to differences in customer backgrounds. To just assume that random assignment of treatments adequately covers all of that background variation assumes quite a lot: that most participants in our experiment come from the same place, speak the same language, have the same age, and even receive the same monthly income.

This is where the word randomized in randomized controlled experiments is important. In theory, if we assign users to groups in a random fashion, we are likely to end up with similar groups. But our experiment was not a theoretical exercise and our base of only a few thousand customers made it necessary to take some precautions. We first divided the customers into four segments based on their purchase frequency and type. We then randomly assigned the users within each segment to the treatment and control groups. This process is called stratification. Only comparing the results between the treatment and control groups within each strata helps take some of the burden of off randomization.

Detecting misleading results

Having taken these basic precautions, we ran the experiment, collected the data, and analyzed the results. Then we discovered that users in the treatment group, receiving the campaign, did not make more purchases but did spend an average of 14% more on their purchases compared to the control group. Those looked like really valuable results.

While preparing to communicate the results to the marketing owner of the campaign, the discrepancy between the increase in the purchase amount and the lack of change in the number of purchases felt suspicious. We looked deeper. One of the first checks we performed compared the amount spent on purchases between the groups before the experiment began. If a difference existed already before the experiment and the exposure to the campaign, then it wasn’t the campaign that caused the observed difference. Sure enough, customers in the treatment group had spent 12% more on their purchases compared to the control group even before the experiment started. Our findings started to look a lot less meaningful.

Reality is complex and that makes experiments difficult

Going into this experiment, we were aware that customer behavior can bias group assignment in ways that randomization cannot magically solve. We took measures to mitigate that problem. We still failed to generate a design that could separate the treatment effect signal from the background noise. Randomization is not enough to ensure trustworthy results. Even stratification and randomization combined fail pretty easily. We built a design based on two aspects of customer behavior. Reality is much more complex than that.

No matter how many customer segments you can think of, there will always be alternative segments, and additional segments you don’t even think of.

One last detail about our experiment: we were interested in the spread in purchase amounts across users in each of our treatment and control groups, hoping to understand what drove the differences. We found that the median purchase amounts in the treatment and control groups were very similar, but, simply by chance, the treatment group had outlier customers that purchased in substantially higher amounts. This pushed the average upwards. There weren’t that many of these outliers. It doesn’t take much to undermine a conscientious experimental design.

Let aampe help you avoid expensive mistakes

We could have easily reported false results to the marketing owner. Maybe that one mistake would make a big difference to the business, but maybe it wouldn’t. Think of how many A/B tests may hide such biases. Over time, those little mistakes pile up.

Experiments are hard. Randomization isn’t magic. User segments are as like to hide important variation as expose it. Unless you’re dealing with extremely large numbers of participants, if you rely on random assignment and simple segmentation, you can be almost sure many of your experiments will lie to you.

It doesn’t have to be that way. Aampe takes a smarter approach to experimental design, ensuring treatments are assigned in ways much more granular than any amount of segmentation could accomplish. Experiments are hard, but that doesn’t mean you have to do all the heavy lifting yourself.