An A/B test is perhaps the most basic form of an experiment: take a group of people, randomly split them into two subgroups, send a different message to each group, and look at the difference in response. A/B tests are easy and widely available. They are also a really easy way to get misleading, and sometimes flat-out wrong, results. This naive approach of just assigning treatments to groups and looking at differences in results is an example of what we call unconditioned experiments.
When we introduced Aampe, we wrote that if you want to see how much a particular difference in your messaging impacts consumer behavior, an experiment structures your data collection so you can differentiate signal from noise. Unless you’re dealing with incredibly large sample sizes, results from unconditioned experiments cannot be trusted without a lot of subsequent analysis and caveats. Conditioned experiments generate results that can be trusted. Aampe conditions every experiment it runs. In that post, we explain how it does that.
Design of experiments
First, let’s speak very generally:
An experiment exposes a group of participants (or any unit of measurement: plants, people, pieces of equipment, etc.) to an altered condition, and observes how the exposed participants respond differently from participants not exposed to the condition. These altered conditions are called treatments. If you randomly assign which unit gets exposed to each treatment, you can generally be more confident that differences in treatment caused the differences in response.
However, random assignment usually isn't enough, because other factors may influence a response. If most of the units assigned a treatment are also disproportionately exposed to some other influence, it becomes easy to mistakenly think there is a difference in response when there was actually just a difference in baseline conditions. These potentially mitigating factors are called confounders. The traditional way to deal with confounders is to make sure that the different treatments are spread across different potential confounders, so at the very least the interaction of a confounder and a treatment can be explicitly measured.
For example: say you want to test the effectiveness of a new fertilizer. You might test it at two different farms, each of which grows two different crops - say, soybeans and wheat. You'd want to make sure the fertilizer was administered to only part of the soybeans and part of the wheat at both farms. That way you can see if fertilized crops respond differently to unfertilized crops, but also account for the possibility that soybeans react differently than wheat, or that conditions on Farm A cause a different reaction than conditions at Farm B, or that wheat at Farm A reacts differently than wheat at Farm B, and so on. If you design your experiment to spread all treatments across all these different confounders, and you find a difference in treatment still tends to correspond to a difference in response, then you can feel more confident that the response you see is actually due to the treatment.
Human behavior is complicated
When running experiments that involve people, especially when those experiments are conducted “in the wild” rather than in a laboratory, it gets a lot harder to design experiments that consider potential confounders. This is because of four reasons:
- There are usually far too many possible confounders to realistically expect to spread all treatments across every possible combination.
- A lot of the real potential confounders are unknown, so it never occurs to anyone to measure them.
- In many cases, even if we know a potential confounder exists, we don’t have a way to measure it, or we can’t ethically incorporate it into the experiment.
- Most potential confounders overlap - a difference in one thing, like a person’s income, often corresponds to differences in other things, like where a person lives. So incorporating even a small number of known, measurable, ethical confounders into the design is anything but straightforward.
So go back to our fertilizer example: imagine Farm A and Farm B both grow soybeans and wheat, but some parts of the crops are tended by both farms, and some areas of both farms have soybeans and wheat growing in the exact same place, intermingled among one another. Also, there are some places where there are other crops besides soybeans and wheat growing among one or both of those crops, but there’s no way to know where. Also, some parts of some of the crops may be tended by Farm C, Farm D, or Farm E, but as the person conducting the experiment, you don’t even know those farms exist.
That’s not an incredibly realistic scenario for a fertilizer experiment, but it’s a fairly optimistic view of what a human behavioral experiment faces. Humans are complicated: their behavior is influenced by a whole lot of conditions, and those conditions change often.
Conditioning: latent-space transformation and matched assignment
The whole point of the traditional design of experiments - dividing units into lots of smaller groups representing different combinations of potential confounders, and then assigning treatments within those groups - is to mitigate, or at least make measurable, the extent to which confounders distort the results of the experiment. That is our goal with human behavioral experiments, but because of the difficulties inherent in those types of experiments, we have to take a different, somewhat more complicated, path to incorporate our confounders.
We start with dimensionality reduction. These types of algorithms take every bit of data we might have about the experiment participants, and transform it into a series of “components”. Each component represents different patterns of shared variation in our data. So, to return to the example we used earlier, if income and location are highly correlated, our latent space might contain a component that represents that part of income and location that vary together. If other variables were also correlated with income and location, those would be included in that same component. That’s why these components are called a latent space - they represent the variation that lies hidden underneath the more human-interpretable metrics.
The form of latent space transformation we use at Aampe transforms all of the potential confounding factors into components that are largely uncorrelated with each other. So it reduces the problem of confounders that overlap. It also heightens our ability to capture confounders we don’t know about - if we know about confounders A, B, C, and D, but don’t know about confounder E, but confounder E is heavily correlated with confounders B and D, then the component that encapsulates the shared variation of those two confounders will also, to some extent, be an indirect measure of confounder E.
Our latent space transformation gives us the ability to create a series of scores for any experiment participant that represent all of the things we know about that participant, independent of their participation in the experiment itself. That allows us to do two important things. First, we use everything we know before the experiment as a baseline when evaluating results. Second, that baseline de-noising gets better over time, as each experiment’s results get incorporated into the next experiment’s baseline.
We can only conduct that kind of after-experiment analysis, however, if we have ensured that we assigned all possible treatments, including controls, across all the major combinations of latent components. For example, if we only assign treatments to participants who have a high score on the first component, or to all participants who score low on the first component but high on the second, we won’t have the data to get an accurate measure of those components’ influences on responses, because our treatments won’t be spread out across units that capture those differences in background information.
We can solve that problem by clustering our data as represented in the latent space. If we have a simple experiment - just a treatment and a control - then we want clusters that have no fewer than two participants in them. If we have two alternative treatments and a control, we want clusters with no fewer than three participants. Then we randomly assign participants to treatments within each cluster. The process is conceptually similar to Coarsened Exact Matching, a procedure often used to enable causal inference in observational data.
Conditioned experiments are trustworthy experiments
Conditioned experiments ensure that treatments are assigned in a way that minimizes the chances that any individual treatment will get disproportionately assigned to a biased subset of participants, but it also sets you up to be able to explicitly measure that bias after you get results. And the latent space transformation makes that measurement easy to the point that it can be entirely automated. It allows you to cope with however many confounders your participants want to throw at you, handles confounder overlap, and mitigates the biasing impact of unmeasurable and unknown confounders. If you are not conditioning your experiments, you are setting your experiments up to lie to you.