The fastest way to get first N items in each group of a Pandas DataFrame

In my work, the speed of code writing and reading is usually more important than the speed of its execution. Right now, I’m facing a challenge of optimizing the running time of a fairly complex data science project. After a lot of profiling, I identified the major time consumers. One of such time-consuming steps involved grouping a Pandas DataFrame by a key, sorting each group by a score column, and taking first N elements in each group. The tables in this step are pretty small not more than one hundred elements. But since I have to perform this step many times, the running time accumulates to a substantial fraction.

Let’s first construct a toy example

N = 100
x = np.random.randint(1, 5, N).astype(int)
y = np.random.rand(N)
d = pd.DataFrame(dict(x=x, y=y))

I’ll use %%timeit cell magic which runs a Jupyter cell many times, and measures the time it takes to run the code.


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.head(K)
 ).reset_index(drop=True)

This is the output:

3.19 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

 

I suspected that head() was not the most efficient way to take the first lines. I tried .iloc


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.iloc[0:K]
 ).reset_index(drop=True)

2.92 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A 10% improvement. Not bad but not excellent either. Then I realized that Pandas groupby object have their own head function


%%timeit
d.groupby(
 'x'
 ).head(
 K
 ).reset_index(drop=True)

674 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

647 microseconds instead of 3.2 milliseconds. The improvement is by almost a factor of five!

It’s not enough to have the right tool, it’s important to be aware of it, and to use it right. I wonder whether there is even faster way to obtain this job.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s