Data Science

A walk through our conditioned experimentation process

How we design our experiments, step by step, with illustrations.

Schaun Wheeler
April 22, 2021

We explained our approach to designing “conditioned” experiments in a previous post. Briefly: humans are complicated so if you want experiment results that don’t lie, you have to assign your treatments that take that complexity into account. The traditional approach of just randomly assigning treatments only works if you have a huge sample size - and even, then is fragile. The traditional workaround of assigning treatments within a few segments doesn’t work here either: there are too many possible segments and too many ways those segments overlap.

In this post, I’m going to illustrate that complexity - and how we deal with it - using a real-but-anonymized dataset. If you want to see it at work on your own dataset, we have a tool that lets you do that.

A typical dataset is really messy

So let’s start by looking at the dataset (scroll vertically and horizontally to take a look). This is just the first 50 rows. The full dataset contains around 27,000 records, each one representing a different customer.

As you can see, this looks like some pretty standard metrics for a company with multiple stores and a robust e-commerce presence. For each customer, they have data points like number and type of transactions, revenue, and similar information. Aampe doesn’t need any of this information to run an experiment, but the more information you can provide, the better we can set up your experiments for success. The point is to bring whatever data you happen to have - we’ll take care of the rest.

So now let’s take a look at the distributions of all of the data (click on tabs to see additional columns):

The bigger and darker the square, the higher the correlation between variables. That line of big, dark squares along the diagonal is each column’s correlation with itself. You can see -hover over a square to see the specific value - that quite a few variables co-vary with other variables. (Incidentally, we used the absolute value of Pearson’s coefficient to measure correlation between two continuous variables, Theil’s U - also called the Uncertainty Coefficient, based on conditional entropy - for correlations between to categorical variables, and the Correlation Ratio for correlations between continuous and categorical variables. All of the correlations run from 0.0 to 1.0.)

So let’s summarize where we’re at:

  • A typical customer dataset has a lot of variables.
  • Non-categorical variables need to be turned into categoricals to be able to create cohorts. The simplest way is to split them down the middle to turn them into two-category binaries, but that’s not necessarily a <em>good</em> way to do things.
  • Categorical variables often have a lot of levels - sometimes hundreds.
  • Missing values add even more levels that need to be taken into consideration.
  • For an average dataset, if you create treatment cohorts out of all combinations of all variables, you’ll end up with more cohorts than you can reasonably fill, and most of those will have too few people to take at least one of each treatment.
  • Even if none of the above problems existed, most variables correlate with others - often substantially - which means that any treatment cohorts you create will over-represent certain parts of the customer variation more than others without you knowing it.

A latent-space transformation simplifies some things

The major challenge of a raw dataset is that it packs all of the information about customers in a way that makes sense to the business but doesn’t necessarily make sense for analysis. So we re-pack. We want to get rid of all of those complicated categorical variables and missing values, and create continuous measures that aren’t highly correlated with one another. This is what is known as a dimensionality-reduction problem. Remember, even though the original dataset had only a couple dozen variables, all of the missing values and category levels resulted in a couple thousand different dimensions. We can do better than that.

I won’t go into the details of how we do the dimensionality reduction here. Suffice it to say the procedure transforms the original, complicated, inter-correlated dataset into a number of “components”, all of which are continuous variables with no missing values. When we do that with the current dataset, here’s how the component correlations break down:

I’ve kept all of the squares the same size because, if I shrunk them for lower correlations, you wouldn’t be able to see any squares other than the diagonal. The highest correlation any component has with any other component is around 0.15. A full 60% of the variable correlations in the original dataset are higher than that. So right off the bat, a latent-space transformation converts all of our information to largely uncorrelated continuous variables with no missing values.

Incidentally, we can check to see how well the latent components represent the original data:

Each column of the heatmap is a component of the latent space. Each row is a cluster. There are 1756 clusters in all - that’s a 95% reduction in complexity, which far surpasses any of the approaches we went through previously. The blue squares indicate how many of the members of a cluster scored high on a component, and the red squares indicate how many scored low - the deeper the color, the more members scored. We create enough clusters that each cluster tends to separate where members land on each component (more on that in a second).

The branches on the left side of the plot show the relationships between clusters - the lower any two clusters connect, the more similar they are. This allows us to deal with the problem of clusters that have only one member. That’s a relatively rare problem when we can use a clustering algorithm rather than combinations of columns to create cohorts: only around 2% of all the clusters are made of a single member. Even in those cases, we can attach those members to the members of the next most similar cluster - all of the logic about which participant pairs best with which other participant is built into the system.

I said that each cluster tends to separate where members land on each component. Here’s an example of that:

Each dot represents a single record’s score on each latent-space component. Each row is a component, and the dot’s horizontal position indicates it’s score. The blue dots represent members of one cluster, and the red dots indicate members of another cluster. Notice that, unless their scores are very close to zero, the blue dots and the red dots rarely overlap. For example, on the first component (the bottom row), most of the blue dots are near 0.5 and most of the red dots are near -0.5. On the next component, most of the blue dots are near 0.2 and most of the red dots are near -1.0. Our clustering algorithm has automatically identified cohorts whose members are quite different from one another. That means that assigning all possible treatments within each cluster will ensure that we assign treatments across all the major sources of variation in the original data set. We’ll have a well-conditioned experiment.