Overlapping time period problem

Eddie Lin on 2020-09-02

Overlapping time period problem in the event table

In the data science world, people tend to think cleaning data is boring and desire more of machine learning and modeling challenges, but sometimes some problems might arise like a fun brain teaser or an algorithmic puzzle during data cleaning. Here’s a fun one I recently ran into when I am cleaning an event table. It’s not anything like crazy ML algorithms or new cool python tools but a small interesting problem worth a couple of minutes to figure out.

The event table is pretty common when we are building any kind of data pipeline or ML pipeline, especially when we are dealing with prediction problem. An event table is like a transaction table that normally has a start time, end time, person ID, and other event attributes. Each row is an event and some events might overlap which could be a problem. An event table in real-world might look like this:

   person_id      start        end
0          1 2017-01-01 2017-06-01
1          1 2017-02-01 2017-03-01
2          1 2017-05-01 2017-08-01
3          1 2017-09-01 2017-10-01
4          1 2017-11-01 2018-02-01
5          1 2018-01-01 2018-03-01

Sometimes we can just live with it, but sometimes we want to merge those overlapping rows into one row with the earliest start time and the latest end time like this:

   person_id      start        end
0          1 2017-01-01 2017-08-01
1          1 2017-09-01 2017-10-01
2          1 2017-11-01 2018-03-01

That being said, since the second event and the third event both overlap with the first event even though the third event doesn’t overlap with the second event, we would like the merged event to have the first event’s start time and the third event’s end time. It looks pretty simple but it’s not so straightforward to figure out a good solution.

Here I am solving this in python.

Simple case:

In the first solution, we can treat this as a graph connectivity problem. Each event is a node. If two events overlap, then they connect with an edge. Therefore, the first three events will form a connected sub-graph, the third event is an isolated node kind of sub-graph, and the fourth and fifth events will form another connected sub-graph. We assign each sub-graph a group_id in the original DataFrame. Now we can group by the group_id and aggregate by min(start time) and max(end time) and then we have the answer.

To implement that, we will create an adjacency matrix to represent the graph. First, we want to define “overlap”. The overlap here means: for any two nodes A and B. If node A’s start date is before node B’s end date and node A’s end date is after node B’s start date, then A and B overlap. We can implement the logic with numpy array to create a sparse matrix. There are several algorithms existed to solve the connectivity graph problem. We can just use one of the off-the-shelf functions scipy.sparse.csgraph.connected_components to achieve this. scipy implement the function with something similar to Tarjan’s algorithm. Then, we just do the simple pandas groupbymethod to get the result we want.

graph_approach.py

def graph_approach(df, start='start', end='end'):
    start = df[start].values
    end = df[end].values
    graph = (start <= end[:, None]) & (end >= start[:, None])
    n_components, indices = connected_components(graph, directed=False)
    return df.groupby(indices).aggregate({'start': 'min','end': 'max'})

This is a pretty solution for its mathematical approach. That’s just for one person. If we have thousand or million people, which is normally the real-world case, the graph will grow very large and the space complexity it’s gonna grow in n². We can still implement this by adding another “&” logic in the line of creating the adjacency matrix.

graph_approach.py

def graph_approach(df, key='person_id', start='start', end='end'):
    start = df[start]values
    end = df[end]values
    person_id = df[person_id]values
    graph = (start <= end[:, None]) & (end >= start[:, None]) & (person_id == person_id[:, None])
    n_components, indices = connected_components(graph, directed=False)
    return df.groupby(indices).aggregate({'person_id': 'first', 'start': 'min','end': 'max'})

Theoretically, the time complexity of Tarjan’s algorithm is linear. The n^2 space complexity makes it not able to scale. This made me think about how to solve it in a scalable way along with pandas.

Honestly, I am not a fan of pandas due to its unintuitive way of querying things, but it does provide good performance if we can have a vectorized solution.

The vectorized solution is like this.

vectorized_approach.py

def vectorized_approach(df, key='person_id', start='start', end='end'):
    df = df.copy()
    # start_end is an indicator for start or end
    startdf = pd.DataFrame({key: df[key], 'time': df[start], 'start_end': 1})
    enddf = pd.DataFrame({key: df[key], 'time': df[end], 'start_end': -1})
    # concat and sort the whole thing by key and time
    mergedf = pd.concat([startdf, enddf]).sort_values([key, 'time'])
    # use cumsum to create gaps and islands
    mergedf['cumsum'] = mergedf.groupby(key)['start_end'].cumsum()
    # assign new start date
    mergedf['new_start'] = mergedf['cumsum'].eq(1) & mergedf['start_end'].eq(1)
    # use cumsum to assign group id
    mergedf['group'] = mergedf.groupby(key)['new_start'].cumsum()
    # group_id by choosing the start_date row
    df['group_id'] = mergedf['group'].loc[mergedf['start_end'].eq(1)]

    return df.groupby([key, 'group_id']).aggregate({start: min, end: max}).reset_index()

Let’s forget about all the graphs and nodes and edges and rethink this problem with some pure simple time-series data. First of all, we break the start time and end time to create a one-dimensional time series by sorting it. We have a column start_end to indicate if it's a start time or an end time where1 is the start time and -1 is the end time. Then we perform a cumulative sum on the series as the column cumsum. By looking at it, we notice that the new start time is a start time where its cumsum is 1. A new end time is an end time where its cumsum is 0. So now we have a new_start column. Next, we perform another cumulative sum on the new_start column to get the group_id. Last, we do the same aggregation to get the answer.

It’s probably easier to look at the mergedf to understand what's going on here.

   person_id       time  start_end  cumsum  new_start  group
0          1 2017-01-01          1       1       True    1.0
1          1 2017-02-01          1       2      False    1.0
1          1 2017-03-01         -1       1      False    1.0
2          1 2017-05-01          1       2      False    1.0
0          1 2017-06-01         -1       1      False    1.0
2          1 2017-08-01         -1       0      False    1.0
3          1 2017-09-01          1       1       True    2.0
3          1 2017-10-01         -1       0      False    2.0
4          1 2017-11-01          1       1       True    3.0
5          1 2018-01-01          1       2      False    3.0
4          1 2018-02-01         -1       1      False    3.0
5          1 2018-03-01         -1       0      False    3.0

Performance

Since I don’t want to go too deep into the implementation of scipy’s connected_component and analyze its complexity, I simply want to compare the scalability in an empirical way. What I did is just to duplicate the example n times to create a big event table and measure the runtime and peak memory with n growing in the orders of magnitude.

Both approaches are kind of similar in runtime and memory before n hits 10². After that, it’s pretty clear that the graph approach explodes in both runtime and memory very soon. I actually ran into memory leakage when running the graph approach with n=100000 on a machine with 200GB RAM. There definitely are some ways to optimize the graph approach which I’d love to see, but what I’m trying to make points here is that the operation behind scipy’s connected_componentmore memory-intense than the vectorization in pandas. I guess sometimes ugly algorithms might just beat beautiful algorithms.

       size    mem     runtime       approach
0        10     82    0.083125          graph
1        50     83    0.111390          graph
2       100     85    0.094990          graph
3       500    150    0.458318          graph
4      1000    305    1.544724          graph
5      5000   6964   27.119864          graph
6     10000  27576  113.710234          graph
7    100000    inf         inf          graph
8        10     82    0.092768  vectorization
9        50     83    0.125092  vectorization
10      100     83    0.121607  vectorization
11      500     86    0.091670  vectorization
12     1000     89    0.168066  vectorization
13     5000    101    0.154213  vectorization
14    10000    115    0.224327  vectorization
15    50000    216    1.523057  vectorization
16   100000    351    2.482687  vectorization
17   500000   1407    6.840897  vectorization
18  1000000   2707   12.607009  vectorization