Numpy vs. Pandas: functions that look the same, share the same code but behave differently

Screenshot that shows that two variables have different standard deviation despite being equal

I can’t imagine how my professional life would have looked like without pandas, THE data analysis library for Python. Pandas shares much of its functionality and syntax with numpy, a fundamental package for scientific computing with Python. The reason for that is that, under the hood, pandas uses numpy. This similarity is very convenient as it allows passing numpy¬†arrays to many pandas functions and vice versa. However, sometimes it sabs you in the back. Here is a nice example that I discovered after hours (OK, minutes) of debugging.

Let’s create a numpy vector with a single element in it:

>>> import numpy as np

>>> v = np.array([3.14]) 

Now, let's compute the standard deviaiton of this vector. According to the definition, we expect it to be equal zero.
>>> np.std(v)

So far so good. No surprises.

Now, let’s make a pandas Series out of our vector. A Series is basically a vector in which the elements can be indexed by arbitrary labels. What do you expect the standard deviation should be now?

>>> import pandas as pd
>>> s = pd.Series(v)
>>> s.std()

What? Not a number? What the hell? It’s not an empty vector! I didn’t ask to perform the corrected sample standard deviation. Wait a second…

>> s.std(ddof=0)

Now I start getting it. Compare this

>>> print(np.std.__doc__)
Compute the standard deviation along the specified axis.
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.

… to this

>>> print(pd.Series.std.__doc__)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument
ddof : int, default 1
degrees of freedom

Formally, the pandas developers did nothing wrong. They decided that it makes sense to default for normalized standard deviation when working with data tables, unlike numpy that is supposedly meant to deal with arbitrary matrices of numbers. They made a decision, they wrote it at least three times in the documentation, and yet… I didn’t know that even after working with both the libraries for so long.

To sum up:

> s.std()
>> v.std()
>> s == v
0 True
dtype: bool






Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s