User activity is one of the key ways mobile applications assess their health as a system, a product, and a company. The number of users who actually use the app, and their degree of usage, tells an important story about what works, what doesn’t, and what you can do about it.

The problem, however, is that there are so many ways to define and track user activity. Here is just a few of the most common metrics:

  • Frequency. Calculate the average number of days that pass between app activity for each user.
  • Volume. Calculate the average amount of activity that can be measured in terms of volume (so, for an app that serves video or audio content, this would be playtime; for an app that sells things, this might be money spent, etc.)
  • Recency. Calculate the number of days since each user was last seen.
  • Longevity. Calculate number of days between a user being first seen and last seen.

There are many possible variations on all of the above themes. Volume could be measured in terms of duration or dollars or both or something else entirely. Frequency, consistency, recency, and longevity could be measured in terms of minutes, hours, days, weeks, or months depending on what is normal for a particular app’s users. And any metric could be applied across and on top of various other ways to segment a user population: age, day/time or cohort of app install, download source, or something else entirely like first content consumed in a first session. 

But the different ways these metrics can be massaged and tailored to fit the preferences of those managing an app do not eliminate the real difficulty of these kinds of metrics. In fact, complexity compounds and so does the difficulty of managing that complexity. That difficulty becomes apparent when you try to track an app’s health over time. Most apps have enough users that it is impossible to track any of these metrics on an individual basis (and if you have so few users that you can look at each one individually, metrics are not your biggest worry right now). So you have to bin the metrics into groups that can be tracked, like this:

In each of the above graphs, the darker the area, the more active (desirable) the user is. Any trends in these graphs are overwhelmingly swallowed up in the noise from users that are still in the system but stopped being active long ago. In some cases, the trends are practically nonsensical. The least-desirable segments of each metric can’t help but grow over time: the more users there are in the system, the more will churn, and the more people there will be who were seen for a total of just  a few days, or were seen long ago, or have low total playtime. 

So to use any of these intuitive metrics, you also need a way of focusing only on active users. But these metrics were supposed to identify active users. We can arbitrarily define inactive for individual metrics - for example, if we only saw a person once and never again, we can label them as “bounced” and remove them...until they show up again after a month. Then we have to unbounce them. Any definition of inactivity based on these kinds of intuitive metrics is necessarily fragile, and only becomes clear with a whole lot of hindsight, and without a definition of inactivity, there’s no very useful way to track users who are active.

A different way to define activity is to define a specific outcome we would like our users to do again and again and again, and then use machine learning to score them in terms of the probability that they will do that thing. For example: if my app serves video content, I want people to watch a video. So we can define that as our target outcome: activity means a user is going to come to my app and watch a video tomorrow. Once we have that target outcome, we can model it, using all the kinds of heuristics metrics like the four I listed above as features on which to base those predictions. 

This prediction is the “activity” score. We feed information about each user’s recent history into the model, and predict their probability of doing the thing we want them to do tomorrow. The plot below shows how this model performs:

To evaluate the model, we’ve binned the scores - which run from 0.0 to 1.0 - into increments of 0.01. Then, within each bin, we calculate what percentage of users who had that score on a given day actually showed up on the app and watched a video the next day. Those are the blue lines - the performance for each day that we calculated the score. The light blue line in the middle of all the dark blue lines is the average of all those days. As you can see, on average, users with a score of 0.5 have a 50% chance of showing up on the app on the next day. Users with a score of 0.9 have a 90% chance of showing up on the app the next day. The grey lines show the percentage of users who scored at or below a certain level on any given day (the black line is the average of the grey lines).

If we use the activity score as our definition of activity - for example, based on the plot above, a user with an activity score below 0.1 has practically no chance of showing up on the app the next day, so we can define all users below that threshold as inactive - then we can use our more intuitive metrics to monitor the health of our app by focusing on on those users who are still active enough to be useful indicators of health:

And if we want to see how using a model-based activity score is better than the basic metrics many product analytics teams use, we can plot them together. The blue line in the plot below shows the % of users who came back on a given day for every level of the activity score used as a cut-off metric. If we use 0.4 as a cut-off for our active users, then we can be confident that over 95% of the users with a 0.4 or higher will show up on the day we expect them to show up. On the other hand, the grey line shows the % of users who come back when we use a simple metric like “days since last seen”. If we assume that all users who have been seen in the last 15 days are active, then we still end up with less than 20% who show up. 

If you’ve grown your user base past a million users, you probably have a lot of features you’ve spent time and effort building, and your user population is probably very diverse with a lot of potentially meaningful user segments. Those many different segments correspond to many different ways and times of using our app, and those correspondences occur at different rates for each segment. It’s a complex problem. Basic, naive, metrics make it difficult to see how your app is really performing. We’re building Aampe so you can optimize your push notifications to maximize the activity that matters most to you.