The fastest way to get first N items in each group of a Pandas DataFrame

In my work, the speed of code writing and reading is usually more important than the speed of its execution. Right now, I’m facing a challenge of optimizing the running time of a fairly complex data science project. After a lot of profiling, I identified the major time consumers. One of such time-consuming steps involved grouping a Pandas DataFrame by a key, sorting each group by a score column, and taking first N elements in each group. The tables in this step are pretty small not more than one hundred elements. But since I have to perform this step many times, the running time accumulates to a substantial fraction.

Let’s first construct a toy example

N = 100
x = np.random.randint(1, 5, N).astype(int)
y = np.random.rand(N)
d = pd.DataFrame(dict(x=x, y=y))

I’ll use %%timeit cell magic which runs a Jupyter cell many times, and measures the time it takes to run the code.

 lambda t: t.head(K)

This is the output:

3.19 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


I suspected that head() was not the most efficient way to take the first lines. I tried .iloc

 lambda t: t.iloc[0:K]

2.92 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A 10% improvement. Not bad but not excellent either. Then I realized that Pandas groupby object have their own head function


674 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

647 microseconds instead of 3.2 milliseconds. The improvement is by almost a factor of five!

It’s not enough to have the right tool, it’s important to be aware of it, and to use it right. I wonder whether there is even faster way to obtain this job.

Numpy vs. Pandas: functions that look the same, share the same code but behave differently

I can’t imagine how my professional life would have looked like without pandas, THE data analysis library for Python. Pandas shares much of its functionality and syntax with numpy, a fundamental package for scientific computing with Python. The reason for that is that, under the hood, pandas uses numpy. This similarity is very convenient as it allows passing numpy arrays to many pandas functions and vice versa. However, sometimes it sabs you in the back. Here is a nice example that I discovered after hours (OK, minutes) of debugging.

Let’s create a numpy vector with a single element in it:

>>> import numpy as np

>>> v = np.array([3.14]) 

Now, let's compute the standard deviaiton of this vector. According to the definition, we expect it to be equal zero.
>>> np.std(v)

So far so good. No surprises.

Now, let’s make a pandas Series out of our vector. A Series is basically a vector in which the elements can be indexed by arbitrary labels. What do you expect the standard deviation should be now?

>>> import pandas as pd
>>> s = pd.Series(v)
>>> s.std()

What? Not a number? What the hell? It’s not an empty vector! I didn’t ask to perform the corrected sample standard deviation. Wait a second…

>> s.std(ddof=0)

Now I start getting it. Compare this

>>> print(np.std.__doc__)
Compute the standard deviation along the specified axis.
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.

… to this

>>> print(pd.Series.std.__doc__)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument
ddof : int, default 1
degrees of freedom

Formally, the pandas developers did nothing wrong. They decided that it makes sense to default for normalized standard deviation when working with data tables, unlike numpy that is supposedly meant to deal with arbitrary matrices of numbers. They made a decision, they wrote it at least three times in the documentation, and yet… I didn’t know that even after working with both the libraries for so long.

To sum up:

> s.std()
>> v.std()
>> s == v
0 True
dtype: bool