Every day, Aampe sends out thousands of different messages across your user base, learns your individual user preferences based on who responds to what, and then automatically adjusts your messaging timing, content, and copy to fit those learned preferences.

[Note: We also monitor your users’ activity after we send them a message, which allows us to calculate success rates that go far beyond click-through. We can provide app-visitation rates, add-to-cart rates, purchase rates, etc. — if a thing matters to your business, we can show the impact Aampe had on that thing.]

It’s a complex system (as it needs to be), but this can also make attribution difficult — It’s not like a simple A/B test where you’re testing one version against another. We’re literally testing hundreds or even thousands of different things (between copy, content, tone, and timing) at the same time.
With so many moving pieces, how are you supposed to know if Aampe is showing an improvement over your old messaging strategy?

That’s where our control groups come in. 

How to build an accurate control group

Control groups are a common way to estimate attributable returns. You hold out a sample of users from whatever messaging approach you take and then compare the subsequent behavior of those users to the users not held out. 

There are several ways to implement a control:

  • A global holdout. This means you just permanently keep some users out of messaging. This is challenging to do well. It’s hard to maintain a global holdout that stays comparable to your ever-evolving base of messaged users, because those messaged users are constantly changing in response to the things you send them.
  • Switchback holdouts. This means you define two back-to-back windows — say, week A and week B. You message some of our users on one week, message the other users the same message on the second week, and in their off week, users get nothing. This poses challenges in that there may be fundamental differences in time periods that get mistakenly attributed as messaging effects.
  • Synthetic controls. This means you find user attributes that (1) correlate with performance and (2) won’t themselves be impacted by messages. You train a model to predict performance based on those attributes, then send the messages and see how much reality differs from the predictions. This method can have a hard time handling constantly-adapted messaging. It does best when you can point to a one-time intervention. 

How we built our control group

Aampe’s systems measure attribution by taking aspects of both switchback holdouts and synthetic controls, combining them with a statistical method called Coarsened Exact Matching (CEM). We won’t go into the technical details of CEM in this article, but if you want them, you can find them here, and if you like to geek out on the theory behind the method, you can find that here.

CEM works by matching users who got an “intervention” (in our case, a message) with a user who didn’t get that same intervention but is similar to the target user in other ways. 

So where a synthetic control tries to model the entire outcome of a message, our approach is to model each individual user who received the message in order to pick a comparable individual control user. 

It works like this:

So we find a user who got a message, then we randomly pick a second user who received a message at a slightly earlier period in time. We make sure the matched users are similarly active on the app (so we don’t match an active user to one who hasn’t been on the app for a month, for example) and that the messages each user received had similar content. We then measure the behavior of both the original and the “control” users as if they had both received the original user’s message.

This allows us to not just show your performance vs. a non-messaged user but to show your actual performance alongside what it would have been if you hadn’t used Aampe:

Notice that we don’t just show the control and test group results, we also include groups labeled “Very High,” “High,” and “Low.” These groups are categorized based on the probability (estimated by Aampe’s algorithm) that a particular user will (or won’t) respond positively to a particular message.

(Wait a minute…Why does Aampe send messages to users in the “Low” group where the algorithm is pretty sure the user won’t respond? 

Well, there are a variety of reasons, but it’s mostly because our system continuously tests its assumptions and corrects where necessary. Just because a message or messaging time wasn’t ideal last week, doesn’t mean it’s still unappealing this week. 

Seasons change, people change, and it’s important that your messaging evolves with them.)

We also tell you which messages were sent while in the “Still exploring” phase, meaning our systems are still trying to discover definite preferences for those users.  If we ran out of people in this bucket, we’d all be out of a job.

These groups allow you to not only see the value of your messaging in general but also allow you to see the value of trusting the algorithm’s judgment in deciding when and what to message (The “Very High” and “High” buckets should have higher overall CTRs, app activity rates, etc. than those in the “Control” and “Low” buckets, indicating that the algorithm is making accurate predictions). 

So, how do you know that Aampe is working?

You don’t have to guess. We’re more than happy to show you.

In fact, we even built this handy summary for your dashboard:

Each number on this dashboard is calculated by taking the total metric and subtracting the control to arrive at purely Aampe-attributable results.

So, next time someone asks how you know Aampe is working, you can just show them this dashboard and let the results speak for themselves.