Attribution is the practice of assigning credit to different marketing channels or touchpoints that contribute to a customer's conversion. The goal of attribution is to understand the effectiveness of various strategies in influencing consumer behavior. Last-touch attribution is the default attribution offered in Google Analytics, though GA also offers other formats such as first-touch, “position-based” (u-shaped), linear, and time-decay methods, and there are other variations such as w-shaped. 

This post is about why it’s a mistake to use any of those methods.

Last-touch attribution and other bad ideas

Aampe deals with messaging, so we’ll look at attribution in those terms, though the principles here apply to advertising and pretty much any touchpoint you could have with a customer. 

Say we’re an e-commerce app, and a user stops coming to the app for some extended period of time. After a little while, we start to send them messages. Let’s say that over the course of 16 days, we send messages in the following order:

  • Day 1:  Message about shirts
  • Day 5 Message about shoes
  • Day 9: Message about pants
  • Day 12 (morning): Message about shoes
  • Day 12 (late afternoon): Message about activewear
  • Day 15: Message about pants
  • Day 16: Message about handbags

Within 4 hours of the handbag message, they go on to the app and buy shoes. 

Let’s say, for the sake of the illustration, that they clicked on all of those messages. Here’s how the different standard attribution models interpret those events. 

  • Last-click: The user bought shoes because they got a message about handbags.
  • First-click: The user bought shoes because they got a message about shirts.
  • Positional (u-shaped): The user bought shoes mostly because they got messages about handbags and shirts, but a little credit goes to pants, activewear, and shoes.
  • Positional (w-shaped): The user bought shoes mostly because they got messages about handbags, shirts, and shoes, but a little credit goes to pants and activewear.
  • Linear: The user bought shoes because they got lots of messages: 2/7 credit goes to shoes, 2/7 to pants, and shirts, activewear, and handbags get 1/7 credit each.
  • Time-decay: The user bought shoes mostly because they got a message about handbags but also because of messages about pants, activewear, and shoes, though each one gets less credit than the one that came after it, except for pants and shoes because those had additional touch points even further back. Shirts get hardly any credit at all.

That covers the major frameworks. Last-click attribution arguably gives you a simple action item — send more handbag messages — but how are we supposed to interpret that? Should we devote a lot of time delving into the mysteries of how handbags influence shoe purchases? Of course not.

Last-click attribution has two major problems:

  • Last. There’s no reason the last touchpoint should be the most important. There’s no reason the first touchpoint should be the most important, or that just the endpoints are the most important, or that they’re all equal. Position-based attribution assumes that order really matters a whole lot. There’s no basis for that assumption.
  • Click. The user bought shoes, and we sent two messages about shoes. Say the user didn’t click on either of those messages but clicked on all the others. Are we to conclude that the fact that we sent two messages about the exact product category they bought has no influence at all? That’s silly. Clicks are a weak signal: they have little, if anything, to do with people’s eventual buying choices.

The real attribution story is messy: We primed the pump with several different messages that got the user thinking about their buying choices, and eventually, we filled their bucket of confidence and purchase intent enough that it spilled over into a purchase. And there were certainly other factors at play, such as their activity history on the app as well as just random chance and other things we can’t measure. A good attribution should make sense of and organize that complexity, not pretend it doesn’t exist.

Alright, time to get a little technical. We’ll keep it high-level.

We need an attribution method that can do two things:

  1. Model the relationships between each touch point and the outcome (the purchase), as well as the relationships between each touch point and other touch points. 
  2. Extract from that complicated model some simplified summary of the value of each individual ingredient. So, every touchpoint works together with all other touchpoints, but some touchpoints are typically more critical. 

For example, maybe messaging about shoes doesn’t cause people to buy shoes, but messaging about shoes in the context of a whole wardrobe does. It’s not shoes. It’s shoes plus the way shoes fit in with all that other stuff. We need a way to represent that.

Gradient boosting is a super-powerful ML approach that pretty much everyone except marketers figured out a long time ago

We can build a more realistic attribution model through an ensemble of decision trees. Let's break that down into pieces:

  • Decision trees. A decision tree is a flowchart-like structure where each "branch" in the chart represents a decision based on a feature. So if messaging about shoes at least once leads to sales more often than not messaging about shoes at all, that's a branch. If messaging about shirts at least two times leads to more sales than messaging fewer than two times, that's a branch. Branches flow into other branches, so messaging at least once about shoes and at least two times about shirts might lead to sales more often than any other branch. A decision tree can handle that kind of logic.
  • Ensemble learning. In machine learning, an ensemble is a situation where you take a bunch of "weak learners" — models that aren't very predictive all by themselves — but by averaging or in some other way combining their predictions, you get a "strong learner" — a model that gives very accurate predictions. Each weak learner looks at the problem from its own limited perspective. Combine all the perspectives, and you get to see the whole picture.

One of the strengths of ensembles of decision trees is that they implicitly model interactions. Each tree is trained on a subset of features, which means each tree represents the decision to combine different touchpoints in different ways. In combining each tree into a collective "forest," the ensemble captures information about which touchpoints show outsized impact when combined with other touchpoints. When building an ensemble of trees, you can "grow" the trees in parallel or sequentially. When you grow them in parallel, it's called a random forest. When you grow them sequentially and let each previous tree impact the decision about which subsequent trees are grown, that's called gradient boosting. In gradient boosting, each tree is trained to correct the errors of the previous one. So the first tree predicts the outcome, then we measure how far those predictions are off, then the next tree is trained to predict the errors, then we measure how far those new predictions are off from reality, and so on.

Gradient boosting gives us a model of all those complex interactions. We don’t need to pick last click or first click. We don’t even need to pick just clicks. We can throw all the touchpoints in there, and the model can make sense of them.

Shapley values quantify feature impact in complicated models

So we have this model that can look at way more information — and look at it in much, much more complex ways — than a human can. This model can “think” about attribution on a level that we just can’t. If all we want is an accurate prediction of which users will purchase and which won’t — and sometimes that is all we want — then the model all by itself is all we need.

Usually, however, we need more than an accurate prediction. We need insight. Understanding. We need to make sense of things. That’s where Shapley values come in.

Shapley values are named for economist Lloyd Shapley. He won the Nobel Memorial Prize in Economic Sciences for the concept in 2012. They're rooted in cooperative game theory: Shapley originally developed the values to figure out how to distribute a collective payoff among individuals based on their individual contributions when collaborating with others. So the outcome of the collaboration wasn't just the sum of individual contributions — each person did some stuff by themselves, but they did other stuff in combination with other people, and those interactions contributed to the outcome differently than the individual efforts alone did and...this is sounding a lot like our problem of attribution.

Shapley values are calculated by considering all possible permutations of inputs to a model and averaging the "marginal" contributions of each input across these permutations. The fundamental idea is to assess an input's contribution not just in isolation but in the context of all possible interactions. What that means in practice:

  1. Start with a baseline. This is usually the average prediction of the model across users for which we want predictions.
  2. Generate subset combinations. Take every touchpoint and combine it with every other touchpoint. Then, combine each pair with a third touchpoint. Keep going until you have all possible combinations of all touchpoints.
  3. Calculate predictions for all subsets. The model can make a prediction for each subset of features because that's what models do.
  4. Calculate marginal contributions. For each subset prediction, go in and zero out a particular touchpoint — for example, you can tell the model, "assume no one got any messages about shoes." You can compare these predictions to the original predictions to get an estimate of how much value you had from having shoe messages in the mix. This is a “marginal” estimate, meaning you’ve strained out the influence of all of the stuff except for the thing you’re interested in understanding. Repeat this for all possible touchpoints.
  5. Average marginal contributions across subsets. You have the amount of value the model thinks you get from a particular touchpoint. Average that value across all the subsets. Do this for all features.
  6. Adjust for Expected Value. Now that you have the marginal contributions, you baseline them to your global expected value. So if the baseline is a 50/50 chance of making a purchase, each Shapley value for each touchpoint will be the extent to which including that touchpoint increases or decreases the chance of purchase.

So, Gradient Boosting lets you handle all sorts of complexity in your attribution modeling, and Shapley values digest all of that complexity into something you can easily interpret and act on. 

A real-world e-commerce example

For one of our Aampe customers, we took a month of purchases. For each purchase, we did the following:

  1. Found the last time they were active before starting the session in which they made the purchase.
  2. Counted up all of the different items we messaged them about in that space between the last activity and purchase sessions.
  3. Counted up how many of those messages they clicked on.
  4. Counted up how many funnel activities (viewing a product, adding to cart, adding to wishlist, making a purchase) they did in the time before the period of inactivity during which we messaged them. (Our monitoring window for this before-period was as long as the period between their last activity and their subsequent purchase).

We then collected a bunch of users who hadn’t purchased and matched them to our purchasers based on last-activity week and first-seen-on-app week. So, if someone was last seen on the app on August 28 and downloaded the app on February 6, we matched them randomly with a non-purchaser who had downloaded the app and been active during those same times.

That gave us what data scientists call a “balanced” dataset — that means we had roughly as many “positive classes” (purchases) and “negative classes” (no purchases). That’s useful because it sets our expected purchase rate at 50%, which makes it easier to reason about the model effects we see.

This particular customer had a pretty clear-cut attribution task, in that they’d been using Aampe to message their users about ten specific product categories: Baby, Beauty, Electronics, Home Appliances, Home Kitchen, Men Fashion, Mobiles, Sports, Toys, and Women Fashion. Over any period of time, a user could be assigned messages from any or all of these categories. So, there was a legitimate question about how much each messaging topic contributed to conversions.

Don’t measure model performance in terms of accuracy

So, we trained a Gradient Boosting model on this data. For those wanting the gory details, we used a learning rate of 0.01, a maximum tree depth of 20, 80% record subsampling on the trees, and, though we allowed up to 100 trees, we had early-stopping rules in place that cut off new growth at iteration 36. In human speak, we took precautions to make sure the model was always surprised by something with each new tree it grew. This keeps the model from “overfitting,” which means we keep it from becoming too confident about too many attribution rules too quickly.

A good first question after training a model is: “How good a model is it?” You might feel tempted to ask how accurate the model’s predictions are. Don’t. Don’t look at accuracy. Accuracy will lie to you and break your heart. For example, if you’re dealing with a conversion event that only happens, say, 5% of the time, a model can just conclude that no one will ever convert, and it will be 95% accurate, which is stupid. We prefer to measure model performance in two ways:

  • Precision. Of the people the model said would convert, how many actually converted? In the case of this model, our precision was 64%. That means that around 2 out of 3 times the model said someone would convert, they did.
  • Recall. Of all the people who converted, how many were flagged by the model as being expected to convert? Our model’s recall was 80%, which means that 4 out of 5 times, if someone converted, the model had expected that. 

Those are decent numbers for those two metrics. Not mind-blowing. But decent. And that’s the point: if your attribution method is telling you what percentage of your success is attributable to different touchpoints, and all of those percentages add up to 100%, then your attribution method is garbage. There will always be unknowns, un-measurables, and random noise that account for some percentage of the success you see. A good attribution method should recognize that and take it into account.

So, how much of the picture does our model capture? It’s common to combine precision and recall into an F-score, which in the case of this model was 71%. So all of our touchpoints - those within our control, like messages we send; those outside of our control, like previous app history; and those partially in our control, like clicks - tell us about three-fourths of the story. Which is pretty good.

So, let’s figure out actual attribution

First, a note for ML geeks: don’t use “feature importance” measures from the ensemble model. They tell you how much a particular feature influences the model, but they don’t tell you the direction of effect — so if sending messages about Electronics actually makes people LESS likely to convert, the feature importance scores are going to register that as having a very high importance, which is a lousy takeaway for attribution. That’s another reason we use Shapley values. Here’s how those looked for this model:

Blue means a user did the thing. Red means they didn’t do the thing. The more dots are to the right of the vertical line on the plot, the more the thing corresponds to subsequent conversions. Plots like this aren’t precise enough to yield most of the takeaways that we can get from the modeling process, but they are good for building intuition about attribution. For example:

  • App activity: page_detail. This means the user viewed a specific product page. It’s browsing. Window shopping. Notice a bunch of blue dots far to the left of the line. That means a lot of users who did this were much, much less likely to purchase. On the other hand, there are lots of blue dots to the right of the line. So we’ll need to do more than eyeball these results to figure out exactly what’s going on.
  • App activity: conversion. We can eyeball this one. Almost all of the blue dots are to the right of the line, and some are very far to the right. Most of the red dots are to the left of the line. Buying something makes someone much more likely to buy again. However, notice how the red dots are all clustered just to the left of the line. Not buying stuff makes you only just a little bit less likely to buy later. Previous conversion has a really big impact. Lack of previous conversion, not so much.

We’re going to skip over all of the lines about specific product messages because those views are too messy to eyeball. However, notice two other insights:

  • App activity: move_to_wishlist. This doesn’t have a huge impact, but notice that when someone does move something to their wishlist, they are less likely to subsequently convert. Now, that could mean users on this app tend to use the wishlist for long-term planning - maybe they’re less likely to convert in the days and weeks after adding to their wishlist, but are more likely to convert several months later. That raises an interesting point about attribution in general — it’s well-suited for understanding immediate impact, not long-term impact.
  • Clicks. The click measures for every category of product message show pretty much no impact — maybe just a teensy bit positive — from not clicking and uniformly negative impact from clicking. Sometimes, it’s pretty extremely negative, too: clicking on a mobile device message or a women’s fashion message can have a huge suppressive effect on purchasing. It’s not just that clicks aren’t conversions. Clicks are anti-conversions. When users clicked, they were less likely to convert.

That’s about as far as we can get from eyeballing a graph. For everything else, we need to get some summary measures.

Use measures that summarize the cost and benefits of taking action

So we need some way of characterizing how much importance or credit to assign to each touchpoint, including the touchpoints that we have no control over — the app activity a user exhibited before they went into the inactive period that preceded their eventual purchase.

Let’s give ourselves two reality checks before we get started:

  • Something can be better without being good. Giving a user a particular touchpoint may reduce their chance of conversion by 10%, but not giving them that touchpoint may reduce their conversion probability by 20%. The touchpoint can be valuable, even if the raw impact is negative.
  • Practically everything is good for some users and bad for others. Attribution isn’t a matter of finding what touchpoints positively or negatively impact your users. Everything has a positive impact on some users, and those same things have negative impacts on other users. This isn’t a matter of user preferences. Messaging about electronics by itself might be great, but messaging about electronics in the same week that you message about toys could be terrible, and that will apply to some users and not to others and anyway that dynamic will change next week

So quit looking for nice, tidy packages of insight. Attribution is hard work. Roll your sleeves up.

We find value in three different views of the attribution problem:

  • Confidence index. This is the probability that doing something is better than doing nothing. We randomly sampled the Shapley values associated with users who didn’t have a touchpoint and users who did, then calculated the percentage of those simulations where having the touchpoint resulted in a higher probability of conversion. 
  • Magnitude index. This is the probability that doing something is better than doing nothing, where “better” is defined relative to the baseline of a 50/50 chance of conversion. So, if the touchpoint doesn’t move a user above that 50% mark, it doesn’t score as high on this index. Split touchpoints by presence (“do something”)  vs. absence (“do nothing), then calculate the probability of an above-baseline outcome). The index then incorporates both of these probabilities. So, if the Confidence Index tells you if a touchpoint is good, the Magnitude Index tells you how good. 
  • Safety index. This is the probability doing something does more good than harm. Subsetting our Shapley value to only those associated with the presence of a touchpoint, we sum all positive and negative values separately and then calculate how much of that total positive and negative movement is associated with each touchpoint. The index incorporates both of those metrics, measuring the extent to which the benefit of enabling a touchpoint outweighs the risk.

You can see how this played out for the touchpoints in our model below:

The rows of the above table are divided so all app activity is together, and then all messages. Each section is ordered by the Confidence Index - so you can see that “electronics” is the largest number and the darkest blue of the messages, and numbers get smaller and the blue lighter as it descends finally to “sports.” So electronics is the most valuable message touchpoint — not as influential as most app activity touchpoints, but still giving you a 30% chance of influence. A 30% solution isn’t something to ignore.

Then, look at the Magnitude index. Notice the places where the ordering contradicts the Confidence Index. “Electronics” is still the highest, but both “baby” and “beauty” are higher on the Magnitude index than they are on the Confidence index. That suggests that the “baby” and “beauty” categories may have an outsized impact, even though the chance of their having an impact may be smaller.

Now, look at the Safety Index. There are lots of contradictions here. “Baby” and “beauty” once again outperform relative to the Confidence index, but so do “Toys.” This indicates that these touchpoints carry less risk: even if they have a smaller probability and size of benefit, they don’t carry as many negative side effects as the other categories.

There are other insights to be gained from the probabilities that are used to build the indices. Notice, for example, in the “negative” column of the Safety section, that both “App activity: add_to_cart” and “Message: sports” both have really high scores relative to the other touchpoints. This indicates that these two touch points are the riskiest: No matter what benefit they carry, they negatively impact a disproportionate number of users. The same, to a lesser extent, appears to be true of the “mobiles” and “home_kitchen” message categories.

How does this compare to other model-based attribution methods?

While we’re on the subject, let’s quickly talk about two other ways people often use models to do marketing attribution:

  • Markov-chain models. (For example, see here.) Markov models organize each touchpoint as a “state” — a sort of conceptual destination a user can visit — and models the probability that a user in one state will move to a different state. That allows you to list conversion as yet another state and then estimate the probability that a user who visits one state will eventually visit the state you really want them to go to. Markov models are smart, but they’ve got no memory — when a user experiences a particular touchpoint, the Markov model knows only about that particular moment in time and what other users who end up at that touchpoint tend to do afterward. As far as the model is concerned, that user has no history, and that user’s next actions are determined solely by what actions happen, on average, among the entire user population. Those are very limiting assumptions.
  • Marketing mix models. (For example, see here.) MMM doesn’t refer to a single method. It’s the practice of tracking an outcome (in the case of our current discussion: conversion) over time, and then tracking various marketing inputs over the same time period, and using a timeseries model to pull out lagged relationships. So if, over time, we tended to see conversions go up around a day after we saw electronics messages go up, we could conclude the electronics messages were having an impact. MMM is a great method - and it’s particularly useful if you don’t have information on individual users and their touchpoints. However, if you do have that kind of individualized data, MMM is like choosing to use a butcher knife when you could use a scalpel instead.

If you're going to use ML for attribution, you probably need an AI to act on it

Marketing attribution is hard — a harder problem than can be solved through simplistic rules-based attribution methods like last-touch. The problem is complicated enough that it requires a machine-learning solution. There are a variety of ML methods that we could bring to bear on this problem, but gradient boosting is particularly well-suited because of its ability to implicitly handle complex interactions among touchpoints, and Shapley values are a well-established way to turn that complexity into something humans can wrap their minds around.

All that being said, no ML solution will yield a picture as simple as a rule-based attribution method. That's the tradeoff in attribution: do you want something simple or something that's right? The three indices we use to summarize attribution information yield valuable and actionable insights about where to devote attention and effort, but to truly leverage all of the insights machine-learning attribution has to offer, you need a machine to ingest and act on those insights. The real value of attribution is to give you high-level guidance about what you should pay attention to, what kind of new content you might need to create, and which approaches it may be time to re-evaluate. Beyond that, though, the real way to act on attribution is to use an AI as your agent.