GET AAMPE
Schaun Wheeler

We explained our approach to designing “conditioned” experiments in a previous post. Briefly: humans are complicated so if you want experiment results that don’t lie, you have to assign your treatments that take that complexity into account. The traditional approach of just randomly assigning treatments only works if you have a huge sample size - and even, then is fragile. The traditional workaround of assigning treatments within a few segments doesn’t work here either: there are too many possible segments and too many ways those segments overlap.

In this post, I’m going to illustrate that complexity - and how we deal with it - using a real-but-anonymized dataset. If you want to see it at work on your own dataset, we have a tool that lets you do that.

A typical dataset is really messy

So let’s start by looking at the dataset (scroll vertically and horizontally to take a look). This is just the first 50 rows. The full dataset contains around 27,000 records, each one representing a different customer.

As you can see, this looks like some pretty standard metrics for a company with multiple stores and a robust e-commerce presence. For each customer, they have data points like number and type of transactions, revenue, and similar information. Aampe doesn’t need any of this information to run an experiment, but the more information you can provide, the better we can set up your experiments for success. The point is to bring whatever data you happen to have - we’ll take care of the rest.

So now let’s take a look at the distributions of all of the data (click on tabs to see additional columns):

The date columns show a general increase in volume over time. You can also see the continuous columns alternate between log-normal distributions (somewhat “bell-curved” after you log the values) for things like monetary values, and exponential distributions (a large amount of very small numbers and a small amount of very large numbers) for things like counts. Categorical variables tend to be dominated by a relatively small number of unique values (in cases where you see a spike at the end of a categorical distribution - that’s from us rolling up a bunch of the really small categories into a generic “other” category). You can scroll over all of the plots to see actual values.

But not all of the information about each column is contained in the distribution. Many of the columns have missing values:

A missing value can be just as important as a non-missing value - we don’t want to ignore those.

So, now that we’ve seen all the ways these customers can be similar or different, let’s remind ourselves of why those similarities and differences matter. To set up an experiment in a way that ensures trustworthy results, we need to assign different treatments to similar participants allowing the comparison of apples to apples and oranges to oranges, and assign similar treatments to different participants, covering both apples and oranges. We don’t want to assign one treatment to everyone in the same city, or assign another treatment to only people with high transaction counts. By spreading out treatments across all the ways participants can vary, we set ourselves up to be able to measure - and therefore discount - the impact of all that background information upon the experiment outcomes.

And that is how we see just how difficult it is to run an experiment in this kind of situation using traditional tools. Let’s assume we can just split every continuous variable, including dates, into two categories - high and low, based on whether each of the values is above or below the median of the distribution. So now there are 11 continuous-turned-binary variables, with two levels each; 4 categorical variables, with anywhere from two to several hundred levels each, and 12 more binary variables representing data missingness. All told, that makes for 2,701 dimensions for this particular dataset. If we look at every possible combination of those 2,701 dimensions <strong>within </strong>the dataset - each combination representing one way that a single participant could differ from other participants - we find 13,208 combinations, each representing a cohort of similar participants.

That’s a lot of cohorts - only a 50% reduction in complexity from just looking at every participant in the dataset individually. Theoretically, you’d want to assign at least one of each treatment in your experiment to one person in each of those cohorts. That’s not possible however, as 10,150 of those 13,208 cohorts - around 75% - contain only one person in them. This means that we wouldn’t even be able to assign two different treatments to two similar users, losing our ability to compare apples to apples.

But wait: it gets even better. Those columns aren’t all independent. Here’s a view of the degree to which each column’s variation overlaps with the other:

The bigger and darker the square, the higher the correlation between variables. That line of big, dark squares along the diagonal is each column’s correlation with itself. You can see -hover over a square to see the specific value - that quite a few variables co-vary with other variables. (Incidentally, we used the absolute value of Pearson’s coefficient to measure correlation between two continuous variables, Theil’s U - also called the Uncertainty Coefficient, based on conditional entropy - for correlations between to categorical variables, and the Correlation Ratio for correlations between continuous and categorical variables. All of the correlations run from 0.0 to 1.0.)

So let’s summarize where we’re at:

  • A typical customer dataset has a lot of variables.
  • Non-categorical variables need to be turned into categoricals to be able to create cohorts. The simplest way is to split them down the middle to turn them into two-category binaries, but that’s not necessarily a <em>good</em> way to do things.
  • Categorical variables often have a lot of levels - sometimes hundreds.
  • Missing values add even more levels that need to be taken into consideration.
  • For an average dataset, if you create treatment cohorts out of all combinations of all variables, you’ll end up with more cohorts than you can reasonably fill, and most of those will have too few people to take at least one of each treatment.
  • Even if none of the above problems existed, most variables correlate with others - often substantially - which means that any treatment cohorts you create will over-represent certain parts of the customer variation more than others without you knowing it.

A latent-space transformation simplifies some things

The major challenge of a raw dataset is that it packs all of the information about customers in a way that makes sense to the business but doesn’t necessarily make sense for analysis. So we re-pack. We want to get rid of all of those complicated categorical variables and missing values, and create continuous measures that aren’t highly correlated with one another. This is what is known as a dimensionality-reduction problem. Remember, even though the original dataset had only a couple dozen variables, all of the missing values and category levels resulted in a couple thousand different dimensions. We can do better than that.

I won’t go into the details of how we do the dimensionality reduction here. Suffice it to say the procedure transforms the original, complicated, inter-correlated dataset into a number of “components”, all of which are continuous variables with no missing values. When we do that with the current dataset, here’s how the component correlations break down:

I’ve kept all of the squares the same size because, if I shrunk them for lower correlations, you wouldn’t be able to see any squares other than the diagonal. The highest correlation any component has with any other component is around 0.15. A full 60% of the variable correlations in the original dataset are higher than that. So right off the bat, a latent-space transformation converts all of our information to largely uncorrelated continuous variables with no missing values.

Incidentally, we can check to see how well the latent components represent the original data:

I "inverse-transformed" all of the components, taking them back to the original data space. If we had kept 2,701 components - the same number of dimensions as we had in the original data - we could have reproduced that original data exactly. Because we only kept around 60 components, the difference between the original data and the inverse-transformed data can tell us how well our latent space represents our original information landscape. For each continuous variable, the bar in plot above represents the percentage of records whose inverse-transformed scores were less than 5 percentage points off of the original scores. For the categorical variables and missingness indicators, the bars represent the percentage of records whose inverse-transformed scores matched the original scores.

Clustering based on the latent space solves the rest of our problems

However, we’re not to where we want to be. The latent space has around 60 components, but if we split all of these components into “high” and “low”, just like we did the continuous variables in the original data (and, as with the original data, there’s no guarantee that “high” vs. “low” are an appropriate way to transform these components into categories), we end up with 19,398 unique combinations for treatment cohorts, which is only about a 25% reduction in complexity - actually worse than creating treatment cohorts from the original data. And 83% of those cohorts have only one member - again, worse than what we had using the original dataset. So why did we transform everything into a latent representation?

The reason is that the latent-space transformation allows us to do things we couldn’t do with the original data. By reducing the complexity to 60 uncorrelated dimensions, we can efficiently cluster all of our participants based on how similar they are in that latent space.

Take a look at the following visualization of the clustering:

Each column of the heatmap is a component of the latent space. Each row is a cluster. There are 1756 clusters in all - that’s a 95% reduction in complexity, which far surpasses any of the approaches we went through previously. The blue squares indicate how many of the members of a cluster scored high on a component, and the red squares indicate how many scored low - the deeper the color, the more members scored. We create enough clusters that each cluster tends to separate where members land on each component (more on that in a second).

The branches on the left side of the plot show the relationships between clusters - the lower any two clusters connect, the more similar they are. This allows us to deal with the problem of clusters that have only one member. That’s a relatively rare problem when we can use a clustering algorithm rather than combinations of columns to create cohorts: only around 2% of all the clusters are made of a single member. Even in those cases, we can attach those members to the members of the next most similar cluster - all of the logic about which participant pairs best with which other participant is built into the system.

I said that each cluster tends to separate where members land on each component. Here’s an example of that:

Each dot represents a single record’s score on each latent-space component. Each row is a component, and the dot’s horizontal position indicates it’s score. The blue dots represent members of one cluster, and the red dots indicate members of another cluster. Notice that, unless their scores are very close to zero, the blue dots and the red dots rarely overlap. For example, on the first component (the bottom row), most of the blue dots are near 0.5 and most of the red dots are near -0.5. On the next component, most of the blue dots are near 0.2 and most of the red dots are near -1.0. Our clustering algorithm has automatically identified cohorts whose members are quite different from one another. That means that assigning all possible treatments within each cluster will ensure that we assign treatments across all the major sources of variation in the original data set. We’ll have a well-conditioned experiment.

This browser does not support inline PDFs. Download the PDF to view it.

How we design our experiments, step by step, with illustrations.

A walk through our conditioned experimentation process

We explained our approach to designing “conditioned” experiments in a previous post. Briefly: humans are complicated so if you want experiment results that don’t lie, you have to assign your treatments that take that complexity into account. The traditional approach of just randomly assigning treatments only works if you have a huge sample size - and even, then is fragile. The traditional workaround of assigning treatments within a few segments doesn’t work here either: there are too many possible segments and too many ways those segments overlap.

In this post, I’m going to illustrate that complexity - and how we deal with it - using a real-but-anonymized dataset. If you want to see it at work on your own dataset, we have a tool that lets you do that.

A typical dataset is really messy

So let’s start by looking at the dataset (scroll vertically and horizontally to take a look). This is just the first 50 rows. The full dataset contains around 27,000 records, each one representing a different customer.

As you can see, this looks like some pretty standard metrics for a company with multiple stores and a robust e-commerce presence. For each customer, they have data points like number and type of transactions, revenue, and similar information. Aampe doesn’t need any of this information to run an experiment, but the more information you can provide, the better we can set up your experiments for success. The point is to bring whatever data you happen to have - we’ll take care of the rest.

So now let’s take a look at the distributions of all of the data (click on tabs to see additional columns):

The date columns show a general increase in volume over time. You can also see the continuous columns alternate between log-normal distributions (somewhat “bell-curved” after you log the values) for things like monetary values, and exponential distributions (a large amount of very small numbers and a small amount of very large numbers) for things like counts. Categorical variables tend to be dominated by a relatively small number of unique values (in cases where you see a spike at the end of a categorical distribution - that’s from us rolling up a bunch of the really small categories into a generic “other” category). You can scroll over all of the plots to see actual values.

But not all of the information about each column is contained in the distribution. Many of the columns have missing values:

A missing value can be just as important as a non-missing value - we don’t want to ignore those.

So, now that we’ve seen all the ways these customers can be similar or different, let’s remind ourselves of why those similarities and differences matter. To set up an experiment in a way that ensures trustworthy results, we need to assign different treatments to similar participants allowing the comparison of apples to apples and oranges to oranges, and assign similar treatments to different participants, covering both apples and oranges. We don’t want to assign one treatment to everyone in the same city, or assign another treatment to only people with high transaction counts. By spreading out treatments across all the ways participants can vary, we set ourselves up to be able to measure - and therefore discount - the impact of all that background information upon the experiment outcomes.

And that is how we see just how difficult it is to run an experiment in this kind of situation using traditional tools. Let’s assume we can just split every continuous variable, including dates, into two categories - high and low, based on whether each of the values is above or below the median of the distribution. So now there are 11 continuous-turned-binary variables, with two levels each; 4 categorical variables, with anywhere from two to several hundred levels each, and 12 more binary variables representing data missingness. All told, that makes for 2,701 dimensions for this particular dataset. If we look at every possible combination of those 2,701 dimensions <strong>within </strong>the dataset - each combination representing one way that a single participant could differ from other participants - we find 13,208 combinations, each representing a cohort of similar participants.

That’s a lot of cohorts - only a 50% reduction in complexity from just looking at every participant in the dataset individually. Theoretically, you’d want to assign at least one of each treatment in your experiment to one person in each of those cohorts. That’s not possible however, as 10,150 of those 13,208 cohorts - around 75% - contain only one person in them. This means that we wouldn’t even be able to assign two different treatments to two similar users, losing our ability to compare apples to apples.

But wait: it gets even better. Those columns aren’t all independent. Here’s a view of the degree to which each column’s variation overlaps with the other:

The bigger and darker the square, the higher the correlation between variables. That line of big, dark squares along the diagonal is each column’s correlation with itself. You can see -hover over a square to see the specific value - that quite a few variables co-vary with other variables. (Incidentally, we used the absolute value of Pearson’s coefficient to measure correlation between two continuous variables, Theil’s U - also called the Uncertainty Coefficient, based on conditional entropy - for correlations between to categorical variables, and the Correlation Ratio for correlations between continuous and categorical variables. All of the correlations run from 0.0 to 1.0.)

So let’s summarize where we’re at:

  • A typical customer dataset has a lot of variables.
  • Non-categorical variables need to be turned into categoricals to be able to create cohorts. The simplest way is to split them down the middle to turn them into two-category binaries, but that’s not necessarily a <em>good</em> way to do things.
  • Categorical variables often have a lot of levels - sometimes hundreds.
  • Missing values add even more levels that need to be taken into consideration.
  • For an average dataset, if you create treatment cohorts out of all combinations of all variables, you’ll end up with more cohorts than you can reasonably fill, and most of those will have too few people to take at least one of each treatment.
  • Even if none of the above problems existed, most variables correlate with others - often substantially - which means that any treatment cohorts you create will over-represent certain parts of the customer variation more than others without you knowing it.

A latent-space transformation simplifies some things

The major challenge of a raw dataset is that it packs all of the information about customers in a way that makes sense to the business but doesn’t necessarily make sense for analysis. So we re-pack. We want to get rid of all of those complicated categorical variables and missing values, and create continuous measures that aren’t highly correlated with one another. This is what is known as a dimensionality-reduction problem. Remember, even though the original dataset had only a couple dozen variables, all of the missing values and category levels resulted in a couple thousand different dimensions. We can do better than that.

I won’t go into the details of how we do the dimensionality reduction here. Suffice it to say the procedure transforms the original, complicated, inter-correlated dataset into a number of “components”, all of which are continuous variables with no missing values. When we do that with the current dataset, here’s how the component correlations break down:

I’ve kept all of the squares the same size because, if I shrunk them for lower correlations, you wouldn’t be able to see any squares other than the diagonal. The highest correlation any component has with any other component is around 0.15. A full 60% of the variable correlations in the original dataset are higher than that. So right off the bat, a latent-space transformation converts all of our information to largely uncorrelated continuous variables with no missing values.

Incidentally, we can check to see how well the latent components represent the original data:

I "inverse-transformed" all of the components, taking them back to the original data space. If we had kept 2,701 components - the same number of dimensions as we had in the original data - we could have reproduced that original data exactly. Because we only kept around 60 components, the difference between the original data and the inverse-transformed data can tell us how well our latent space represents our original information landscape. For each continuous variable, the bar in plot above represents the percentage of records whose inverse-transformed scores were less than 5 percentage points off of the original scores. For the categorical variables and missingness indicators, the bars represent the percentage of records whose inverse-transformed scores matched the original scores.

Clustering based on the latent space solves the rest of our problems

However, we’re not to where we want to be. The latent space has around 60 components, but if we split all of these components into “high” and “low”, just like we did the continuous variables in the original data (and, as with the original data, there’s no guarantee that “high” vs. “low” are an appropriate way to transform these components into categories), we end up with 19,398 unique combinations for treatment cohorts, which is only about a 25% reduction in complexity - actually worse than creating treatment cohorts from the original data. And 83% of those cohorts have only one member - again, worse than what we had using the original dataset. So why did we transform everything into a latent representation?

The reason is that the latent-space transformation allows us to do things we couldn’t do with the original data. By reducing the complexity to 60 uncorrelated dimensions, we can efficiently cluster all of our participants based on how similar they are in that latent space.

Take a look at the following visualization of the clustering:

Each column of the heatmap is a component of the latent space. Each row is a cluster. There are 1756 clusters in all - that’s a 95% reduction in complexity, which far surpasses any of the approaches we went through previously. The blue squares indicate how many of the members of a cluster scored high on a component, and the red squares indicate how many scored low - the deeper the color, the more members scored. We create enough clusters that each cluster tends to separate where members land on each component (more on that in a second).

The branches on the left side of the plot show the relationships between clusters - the lower any two clusters connect, the more similar they are. This allows us to deal with the problem of clusters that have only one member. That’s a relatively rare problem when we can use a clustering algorithm rather than combinations of columns to create cohorts: only around 2% of all the clusters are made of a single member. Even in those cases, we can attach those members to the members of the next most similar cluster - all of the logic about which participant pairs best with which other participant is built into the system.

I said that each cluster tends to separate where members land on each component. Here’s an example of that:

Each dot represents a single record’s score on each latent-space component. Each row is a component, and the dot’s horizontal position indicates it’s score. The blue dots represent members of one cluster, and the red dots indicate members of another cluster. Notice that, unless their scores are very close to zero, the blue dots and the red dots rarely overlap. For example, on the first component (the bottom row), most of the blue dots are near 0.5 and most of the red dots are near -0.5. On the next component, most of the blue dots are near 0.2 and most of the red dots are near -1.0. Our clustering algorithm has automatically identified cohorts whose members are quite different from one another. That means that assigning all possible treatments within each cluster will ensure that we assign treatments across all the major sources of variation in the original data set. We’ll have a well-conditioned experiment.

This browser does not support inline PDFs. Download the PDF to view it.