GET AAMPE
Ellie Hanna

Has this ever happened to you?

You're working with time series data, and you've got a bunch of unpredictable gaps.  Maybe there were some unforeseen dropouts in monitoring; maybe your measurement instrument was shorting out.  In some cases, you can just ignore those missing timepoints, but sometimes you need to find another way to account for the time in the gap.  Maybe you even want to separate your time series into segments made up of consecutive sets of timepoints, bounded by those gaps, and keep track of whether the segments are behaving differently. If only you had some quick way to find the boundaries of each of those consecutive sets of timepoints!

Like this.  This big ugly line between mid-August and early September.  Those are missing data - that line is misleading.

Now for some situations, it's not too taxing to just eyeball this and manually label consecutive sets of time points.  But sometimes it'd be nice to do it automatically (or at scale, where eyeballing it would be impractical).  I ran into a situation recently where I needed to separate out a time series into consecutive chunks of days.  I was halfway into making the world's most revolting for loop when I happened to recall how we used to make fun of a mentor of mine for solving every data science problem with connected components, and then I realized I had become that which I made fun of.

For the uninitiated, connected components are disjoint sets of connected nodes in a graph - in other words, groups of things that are only connected to each other and nothing else.  Think of a connected component as a particularly cliquish group of people who are friends only with Facebook friends with some subset of the rest of that group (God, I'm dating myself).  Crucially, you don't have to be connected to every other item in the group to be part of that connected component - just one.  If one of the people in that Facebook clique is also friends with her aunt from Wisconsin on Facebook, and her aunt is only Facebook friends with her, her aunt from Wisconsin is part of that connected component as well.

(If you're not into Facebook, I've made a more abstract example below.)

import networkx as nx

g = nx.Graph()
g.add_edges_from([
    ["A","B"],
    ["B","C"],
    ["D","E"],
    ["D","F"],
    ["D","G"],
    ["F","G"]
])

It's actually the fact that you only need to be connected to one node in a component - Wisconsin Facebook Aunt's situation - that makes connected components apply really nicely to this use case.  If we're looking for consecutive groups of days that are partitioned by gaps, we're really just looking for chains of days that each link together, and trying to find the places where the chain is broken. You can formulate this chain as a network, where each day is a node and days only edge to the ones immediately before and after them, and each unbroken stretch of the chain will be its own connected component.

Here's what it looked like for me.  I had a string of event counts aggregated by date.  Pretty much every day had some value except for a few periods of time where we were lacking data, and I had context to think the counts shouldn't be zero during those timespans.  (Matplotlib’s .plot() method will automatically draw a line between the bounds of the gap.)

Here’s an example - no data between February 15th and February 20th.

So how do I go about automatically segmenting this time series into sets of consecutive days?

The first step is to make sure you have an (1) ordered list of dates that (2) spans the entire range, including the days for which you don't have data.

(It obviously doesn't have to be dates - this approach would generalize to anything with a consistent sampling rate, like every millisecond between 1:37 00:35:42 and 2:45 39:16:14.  You just want to make sure you have a row for each timepoint that "ought" to have been sampled between the minimum and the maximum, put the data in the appropriate row, and have a null value for timepoints you don't have data for.)

Below is one way to do it.  You can see that date is ordered, and that we've got rows with empty cells during the February 15-20 gap.

expanded_df = pd.merge(

    pd.DataFrame(data = pd.date_range(*toy_dataset["date"].agg(["min","max"])), columns = ["date"]),
    toy_dataset,
    on = "date",
    how = "left"
)

expanded_df.loc[range(43,52)]

Now we're going to build an edge list, which is the way we're going to create a graph-based "chain" in networkx.  The edge list here is just a list of nodes that connect to each other - in our case, that's a list of days WITH data that follow each other.  You can create it here by availing yourself of the .shift() method a few times in pandas.

# make a new column that shifts the date column to the next one in the sequence

expanded_df["next_date"] = np.where(

    expanded_df["n_events"].shift(-1).notnull(), # whenever the day after this one has a non-null number of events...
    expanded_df["date"].shift(-1),               # ...put the date of the next day
    np.datetime64("NaT")                         # otherwise, leave it null

)

expanded_df.loc[range(43,52)]

If you drop the nulls, you're left with just pairs of consecutive days where there were data.  Observe - the row with a date of February 15th is gone, because February 16th didn't have data; but there IS a "next_date" of February 15th, because February 15th itself did have data (and so did the date before it).

# you need your dropna() to include both the "next_date" and the "events" column at a minimum

expanded_df.loc[range(43,52)].dropna()[["date","next_date"]]

We've got an edge list, so now we use it to make a graph object.

g = nx.Graph()

# note that you want to .dropna() first before specifying the columns

g.add_edges_from(expanded_df.dropna(subset = ["n_events","next_date"])[["date","next_date"]].values)

(One quick detour before we proceed.  What happens if you've got a timepoint with data, but the two timepoints immediately surrounding it have dropout?  Using the approach we've used so far, you'd lose these little lonely components of one, because we end up dropping any rows in the expanded df that a singleton timepoint would be a part of.  You might or might not be interested in these components - if you are, you can add all the nodes with data from the original df to the graph object, without having them be part of edge list.  networkx is smart enough to not forget the edges you've already told it when you add the nodes by themselves.)

g.add_nodes_from(toy_dataset["date"].values) # this is from the original toy_dataset, not expanded_df, because you only want the original set of dates that have data

Now for the main event.  Use connected components to find the unbroken segments of the chain!

# this makes a list of sets, where each set is the unbroken segments of consecutive dates

cc = list(nx.connected_components(g))

cc.sort(key = lambda x: min(x)) # the sorting here is just for aesthetic purposes - makes it so the segments will be labeled in order

# it's prettier in a dataframe

connected_components_df = pd.DataFrame(data = enumerate(cc), columns = ["cc","date"]).explode("date")

connected_components_df.head()

So what do our chain segments look like?  In my case, I had six groups of consecutive days, one of which was just made up of a single day by itself, which I could now disentangle for figures or for separately analyzing or anything my little heart desired.

Voila!  Now if there's some reason I want to treat each segment differently (for instance, include it as a covariate of no interest in a statistical test, in case there's something different about isolated segments of the time series in between dropouts), I've now got a label for each segment.  If I want to interpolate between segments 4 and 5 (in some nicer way than Matplotlib just drawing that big ugly line between segment the last two segments - maybe add some seasonality), I've now got an easy way to find the end points of segments that bound a gap.

Have you ever found a surprising use for graph theory? Have another hack for time series data?

This browser does not support inline PDFs. Download the PDF to view it.

How to use connected components in a graph-based approach to find consecutive sets of timepoints separated by gaps.

Using graph theory for missing data

Has this ever happened to you?

You're working with time series data, and you've got a bunch of unpredictable gaps.  Maybe there were some unforeseen dropouts in monitoring; maybe your measurement instrument was shorting out.  In some cases, you can just ignore those missing timepoints, but sometimes you need to find another way to account for the time in the gap.  Maybe you even want to separate your time series into segments made up of consecutive sets of timepoints, bounded by those gaps, and keep track of whether the segments are behaving differently. If only you had some quick way to find the boundaries of each of those consecutive sets of timepoints!

Like this.  This big ugly line between mid-August and early September.  Those are missing data - that line is misleading.

Now for some situations, it's not too taxing to just eyeball this and manually label consecutive sets of time points.  But sometimes it'd be nice to do it automatically (or at scale, where eyeballing it would be impractical).  I ran into a situation recently where I needed to separate out a time series into consecutive chunks of days.  I was halfway into making the world's most revolting for loop when I happened to recall how we used to make fun of a mentor of mine for solving every data science problem with connected components, and then I realized I had become that which I made fun of.

For the uninitiated, connected components are disjoint sets of connected nodes in a graph - in other words, groups of things that are only connected to each other and nothing else.  Think of a connected component as a particularly cliquish group of people who are friends only with Facebook friends with some subset of the rest of that group (God, I'm dating myself).  Crucially, you don't have to be connected to every other item in the group to be part of that connected component - just one.  If one of the people in that Facebook clique is also friends with her aunt from Wisconsin on Facebook, and her aunt is only Facebook friends with her, her aunt from Wisconsin is part of that connected component as well.

(If you're not into Facebook, I've made a more abstract example below.)

import networkx as nx

g = nx.Graph()
g.add_edges_from([
    ["A","B"],
    ["B","C"],
    ["D","E"],
    ["D","F"],
    ["D","G"],
    ["F","G"]
])

It's actually the fact that you only need to be connected to one node in a component - Wisconsin Facebook Aunt's situation - that makes connected components apply really nicely to this use case.  If we're looking for consecutive groups of days that are partitioned by gaps, we're really just looking for chains of days that each link together, and trying to find the places where the chain is broken. You can formulate this chain as a network, where each day is a node and days only edge to the ones immediately before and after them, and each unbroken stretch of the chain will be its own connected component.

Here's what it looked like for me.  I had a string of event counts aggregated by date.  Pretty much every day had some value except for a few periods of time where we were lacking data, and I had context to think the counts shouldn't be zero during those timespans.  (Matplotlib’s .plot() method will automatically draw a line between the bounds of the gap.)

Here’s an example - no data between February 15th and February 20th.

So how do I go about automatically segmenting this time series into sets of consecutive days?

The first step is to make sure you have an (1) ordered list of dates that (2) spans the entire range, including the days for which you don't have data.

(It obviously doesn't have to be dates - this approach would generalize to anything with a consistent sampling rate, like every millisecond between 1:37 00:35:42 and 2:45 39:16:14.  You just want to make sure you have a row for each timepoint that "ought" to have been sampled between the minimum and the maximum, put the data in the appropriate row, and have a null value for timepoints you don't have data for.)

Below is one way to do it.  You can see that date is ordered, and that we've got rows with empty cells during the February 15-20 gap.

expanded_df = pd.merge(

    pd.DataFrame(data = pd.date_range(*toy_dataset["date"].agg(["min","max"])), columns = ["date"]),
    toy_dataset,
    on = "date",
    how = "left"
)

expanded_df.loc[range(43,52)]

Now we're going to build an edge list, which is the way we're going to create a graph-based "chain" in networkx.  The edge list here is just a list of nodes that connect to each other - in our case, that's a list of days WITH data that follow each other.  You can create it here by availing yourself of the .shift() method a few times in pandas.

# make a new column that shifts the date column to the next one in the sequence

expanded_df["next_date"] = np.where(

    expanded_df["n_events"].shift(-1).notnull(), # whenever the day after this one has a non-null number of events...
    expanded_df["date"].shift(-1),               # ...put the date of the next day
    np.datetime64("NaT")                         # otherwise, leave it null

)

expanded_df.loc[range(43,52)]

If you drop the nulls, you're left with just pairs of consecutive days where there were data.  Observe - the row with a date of February 15th is gone, because February 16th didn't have data; but there IS a "next_date" of February 15th, because February 15th itself did have data (and so did the date before it).

# you need your dropna() to include both the "next_date" and the "events" column at a minimum

expanded_df.loc[range(43,52)].dropna()[["date","next_date"]]

We've got an edge list, so now we use it to make a graph object.

g = nx.Graph()

# note that you want to .dropna() first before specifying the columns

g.add_edges_from(expanded_df.dropna(subset = ["n_events","next_date"])[["date","next_date"]].values)

(One quick detour before we proceed.  What happens if you've got a timepoint with data, but the two timepoints immediately surrounding it have dropout?  Using the approach we've used so far, you'd lose these little lonely components of one, because we end up dropping any rows in the expanded df that a singleton timepoint would be a part of.  You might or might not be interested in these components - if you are, you can add all the nodes with data from the original df to the graph object, without having them be part of edge list.  networkx is smart enough to not forget the edges you've already told it when you add the nodes by themselves.)

g.add_nodes_from(toy_dataset["date"].values) # this is from the original toy_dataset, not expanded_df, because you only want the original set of dates that have data

Now for the main event.  Use connected components to find the unbroken segments of the chain!

# this makes a list of sets, where each set is the unbroken segments of consecutive dates

cc = list(nx.connected_components(g))

cc.sort(key = lambda x: min(x)) # the sorting here is just for aesthetic purposes - makes it so the segments will be labeled in order

# it's prettier in a dataframe

connected_components_df = pd.DataFrame(data = enumerate(cc), columns = ["cc","date"]).explode("date")

connected_components_df.head()

So what do our chain segments look like?  In my case, I had six groups of consecutive days, one of which was just made up of a single day by itself, which I could now disentangle for figures or for separately analyzing or anything my little heart desired.

Voila!  Now if there's some reason I want to treat each segment differently (for instance, include it as a covariate of no interest in a statistical test, in case there's something different about isolated segments of the time series in between dropouts), I've now got a label for each segment.  If I want to interpolate between segments 4 and 5 (in some nicer way than Matplotlib just drawing that big ugly line between segment the last two segments - maybe add some seasonality), I've now got an easy way to find the end points of segments that bound a gap.

Have you ever found a surprising use for graph theory? Have another hack for time series data?

This browser does not support inline PDFs. Download the PDF to view it.