There are generally limited options for clustering large numbers of records. A typical Aampe customer has up to 400 app events ("product viewed", "purchase completed", etc.) instrumented, for sometimes as many as 100 million end users. Clustering 300 features is simple, even with 100 million records - something like mini-batch k-means can do the trick.

Clustering records (grouping those 100 million users based on complete-profile similarity) gets more complicated.

Many clustering algorithms need to compute pairwise distances...even with 1 million records, that's no joke. That's why my current favorite clustering approach for event data involves:

  1. An event-event distance matrix based on temporal proximities.
  2. Hierarchical clustering to group events.
  3. Assignment of users to event clusters based on event activation.

These days, I calculate the distance matrix as a modified Jaccard distance by picking temporal bounds - I count the intersection of any two events that happen within 24 hours but not within 1 hour of each other. I use that lower bound to filter out event relationships that are just part of an automated workflow - for example, checkout started, payment selected, address selected, and purchase, and checkout completed tend to be very tightly connected because once you start a checkout, you're likely to do the rest of the flow; I omit those relationships because I'm more interested in event relationships across sessions.

I've always preferred hierarchical clustering because it's versatile and introspectable.

I can choose however many clusters I need - and switch to a different number of clusters quickly and even interactively if I want - but I can also easily look at higher-level clusters to aid in interpretation.

After that, I assign users to clusters based on the number of times they've done the events that make up each cluster. I put those event counts through a tf-idf transformation to prevent clusters with very common events from looking like they are more popular. Of course, this means that each user will be a member of several clusters. The thing is: that's true of most clustering approaches. It's just that most clustering approaches hide those multiple associations - they pick cluster centroids, measure each record's distance from each centroid, and assign the cluster of the closest centroid. But I can also get each user's second-best, third-best, etc. cluster if I need them.

All of this gives clusters with intuitive definitions and flexible membership criteria to aid introspection (you can see a typical basic output in the table above). Also, those intuitive definitions help when you calculate success metrics (click rates, purchase rates, etc.) for each cluster - you can easily rank clusters based on the success metrics, identify intervention possibilities based on cluster events, and target those interventions based on cluster membership.