Overlapping time period problem in the event table
In the data science world, people tend to think cleaning data is boring and desire more of machine learning and modeling challenges, but sometimes some problems might arise like a fun brain teaser or an algorithmic puzzle during data cleaning. Here’s a fun one I recently ran into when I am cleaning an event table. It’s not anything like crazy ML algorithms or new cool python tools but a small interesting problem worth a couple of minutes to figure out.
The event table is pretty common when we are building any kind of data pipeline or ML pipeline, especially when we are dealing with prediction problem. An event table is like a transaction table that normally has a start time, end time, person ID, and other event attributes. Each row is an event and some events might overlap which could be a problem. An event table in real-world might look like this:
person_id start end
0 1 2017-01-01 2017-06-01
1 1 2017-02-01 2017-03-01
2 1 2017-05-01 2017-08-01
3 1 2017-09-01 2017-10-01
4 1 2017-11-01 2018-02-01
5 1 2018-01-01 2018-03-01
Sometimes we can just live with it, but sometimes we want to merge those overlapping rows into one row with the earliest start time and the latest end time like this:
person_id start end 0 1 2017-01-01 2017-08-01 1 1 2017-09-01 2017-10-01 2 1 2017-11-01 2018-03-01
That being said, since the second event and the third event both overlap with the first event even though the third event doesn’t overlap with the second event, we would like the merged event to have the first event’s start time and the third event’s end time. It looks pretty simple but it’s not so straightforward to figure out a good solution.
Here I am solving this in python.
Simple case:
In the first solution, we can treat this as a graph connectivity problem. Each event is a node. If two events overlap, then they connect with an edge. Therefore, the first three events will form a connected sub-graph, the third event is an isolated node kind of sub-graph, and the fourth and fifth events will form another connected sub-graph. We assign each sub-graph a group_id in the original DataFrame. Now we can group by the group_id and aggregate by min(start time) and max(end time) and then we have the answer.
To implement that, we will create an adjacency matrix to represent the graph. First, we want to define “overlap”. The overlap here means: for any two nodes A and B. If node A’s start date is before node B’s end date and node A’s end date is after node B’s start date, then A and B overlap. We can implement the logic with numpy
array to create a sparse matrix. There are several algorithms existed to solve the connectivity graph problem. We can just use one of the off-the-shelf functions scipy.sparse.csgraph.connected_components
to achieve this. scipy
implement the function with something similar to Tarjan’s algorithm. Then, we just do the simple pandas groupby
method to get the result we want.
def graph_approach(df, start='start', end='end'):
start = df[start].values
end = df[end].values
graph = (start <= end[:, None]) & (end >= start[:, None])
n_components, indices = connected_components(graph, directed=False)
return df.groupby(indices).aggregate({'start': 'min','end': 'max'})
This is a pretty solution for its mathematical approach. That’s just for one person. If we have thousand or million people, which is normally the real-world case, the graph will grow very large and the space complexity it’s gonna grow in n². We can still implement this by adding another “&” logic in the line of creating the adjacency matrix.
def graph_approach(df, key='person_id', start='start', end='end'):
start = df[start]values
end = df[end]values
person_id = df[person_id]values
graph = (start <= end[:, None]) & (end >= start[:, None]) & (person_id == person_id[:, None])
n_components, indices = connected_components(graph, directed=False)
return df.groupby(indices).aggregate({'person_id': 'first', 'start': 'min','end': 'max'})
Theoretically, the time complexity of Tarjan’s algorithm is linear. The n^2
space complexity makes it not able to scale. This made me think about how to solve it in a scalable way along with pandas
.
Honestly, I am not a fan of pandas
due to its unintuitive way of querying things, but it does provide good performance if we can have a vectorized solution.
The vectorized solution is like this.
def vectorized_approach(df, key='person_id', start='start', end='end'):
df = df.copy()
# start_end is an indicator for start or end
startdf = pd.DataFrame({key: df[key], 'time': df[start], 'start_end': 1})
enddf = pd.DataFrame({key: df[key], 'time': df[end], 'start_end': -1})
# concat and sort the whole thing by key and time
mergedf = pd.concat([startdf, enddf]).sort_values([key, 'time'])
# use cumsum to create gaps and islands
mergedf['cumsum'] = mergedf.groupby(key)['start_end'].cumsum()
# assign new start date
mergedf['new_start'] = mergedf['cumsum'].eq(1) & mergedf['start_end'].eq(1)
# use cumsum to assign group id
mergedf['group'] = mergedf.groupby(key)['new_start'].cumsum()
# group_id by choosing the start_date row
df['group_id'] = mergedf['group'].loc[mergedf['start_end'].eq(1)]
return df.groupby([key, 'group_id']).aggregate({start: min, end: max}).reset_index()
Let’s forget about all the graphs and nodes and edges and rethink this problem with some pure simple time-series data. First of all, we break the start time and end time to create a one-dimensional time series by sorting it. We have a column start_end
to indicate if it's a start time or an end time where1
is the start time and -1
is the end time. Then we perform a cumulative sum on the series as the column cumsum
. By looking at it, we notice that the new start time is a start time where its cumsum
is 1. A
new end time is an end time where its cumsum
is 0
. So now we have a new_start
column. Next, we perform another cumulative sum on the new_start
column to get the group_id
. Last, we do the same aggregation to get the answer.
It’s probably easier to look at the mergedf
to understand what's going on here.
person_id time start_end cumsum new_start group 0 1 2017-01-01 1 1 True 1.0 1 1 2017-02-01 1 2 False 1.0 1 1 2017-03-01 -1 1 False 1.0 2 1 2017-05-01 1 2 False 1.0 0 1 2017-06-01 -1 1 False 1.0 2 1 2017-08-01 -1 0 False 1.0 3 1 2017-09-01 1 1 True 2.0 3 1 2017-10-01 -1 0 False 2.0 4 1 2017-11-01 1 1 True 3.0 5 1 2018-01-01 1 2 False 3.0 4 1 2018-02-01 -1 1 False 3.0 5 1 2018-03-01 -1 0 False 3.0
Performance


Since I don’t want to go too deep into the implementation of scipy’s connected_component
and analyze its complexity, I simply want to compare the scalability in an empirical way. What I did is just to duplicate the example n
times to create a big event table and measure the runtime and peak memory with n
growing in the orders of magnitude.
Both approaches are kind of similar in runtime and memory before n hits 10². After that, it’s pretty clear that the graph approach explodes in both runtime and memory very soon. I actually ran into memory leakage when running the graph approach with n=100000
on a machine with 200GB RAM. There definitely are some ways to optimize the graph approach which I’d love to see, but what I’m trying to make points here is that the operation behind scipy’s connected_component
more memory-intense than the vectorization in pandas. I guess sometimes ugly algorithms might just beat beautiful algorithms.
size mem runtime approach 0 10 82 0.083125 graph 1 50 83 0.111390 graph 2 100 85 0.094990 graph 3 500 150 0.458318 graph 4 1000 305 1.544724 graph 5 5000 6964 27.119864 graph 6 10000 27576 113.710234 graph 7 100000 inf inf graph 8 10 82 0.092768 vectorization 9 50 83 0.125092 vectorization 10 100 83 0.121607 vectorization 11 500 86 0.091670 vectorization 12 1000 89 0.168066 vectorization 13 5000 101 0.154213 vectorization 14 10000 115 0.224327 vectorization 15 50000 216 1.523057 vectorization 16 100000 351 2.482687 vectorization 17 500000 1407 6.840897 vectorization 18 1000000 2707 12.607009 vectorization