One of the reasons I don’t like R

I never liked R. I didn’t like it for the first time I tried to learn it, I didn’t like it when I had to switch to R as my primary work tool at my previous job. And didn’t like it one and a half year later, when I was comfortable enough to add R to my CV, right before leaving my previous job.

Today, I was reminded of one feature (out of so many) that made dislike R. It’s its import (or library, as they call it in R) feature. In Python, you can import a_module and then use its components by calling a_model.a_function. Simple and predictable. In R, you have to read the docs in order to understand what will happen to your namespace after you have library(a.module) (I know, those dots grrrr) in your code. This feature is so annoying that people write modules that help them using other modules. Like in this blog post, which looks like an interesting thing to do, but … wouldn’t it be easier to use Python?

 

What is the best way to handle command line arguments in Python?

The best way to handle command line arguments with Python is defopt. It works like magic. You write a function, add a proper docstring using any standard format (I use [numpy doc]), and see the magic


import defopt

def main(greeting, *, count=1):
    """Display a friendly greeting.

    :param str greeting: Greeting to display
    :param int count: Number of times to display the greeting
    """
    for _ in range(count):
        print(greeting)

if __name__ == '__main__':
    defopt.run(main)

 

You have:

  • help string generation
  • data type conversion
  • default arguments
  • zero boilerplate code

Magic!

Illustration: the famous XKCD

Measuring the wall time in python programs

Illustration: a watch

Measuring the wall time of various pieces of code is a very useful technique for debugging, profiling, and computation babysitting.  The first time I saw a code that performs time measurement was many years ago when a university professor used Matlab’s tic-toc pair. Since then, whenever I learn a new language, the first “serious” code that I write is a tic-toc mechanism. This is my Python Tictoc class: [Github gist].

Gender salary gap in the Israeli high-tech — now the code

Screenshot of a Jupyter notebook with some code and a graph.

Several people have asked me about the technology I used to create the graphs in my recent post about the gender salary gap in the Israeli high-tech. Like 99% of the graphs I create, I used matplotlib. I have uploaded the notebook that I used for that post to Github. Here’s the link. The published version uses seaborn style settings. The original one uses a slightly customized style.

 

The Y-axis doesn’t have to be on the left

Line charts are great to convey the evolution of a variable over the time. This is a typical chart. It has three key components, the X-axis that represents the time, the Y-axis that represents the tracked value, and the line itself.

A typical line chart. The Y-axis is on the left

Usually, you will see the Y-axis at the left part of the graph. Unless you design for a Right-To-Left language environment, placing the Y-axis on the left makes perfect sense. However, left-side Y-axis isn’t a hard rule.

In many cases, more importance is given to the most recent data point. For example, it might be interesting to know a stock price dynamics but today’s price is what determines how much money I can get by selling my stock portfolio.

What happens if we move the axis to the right?

A slightly improved version. The Y-axis is on the right, adjacent to the most recent data point

Now, today’s price of XYZ stock is visible more clearly. Let’s make the most important values explicitly clear:

The final version. The Y-axis is on the right, adjacent to the most recent data point. The axis ticks correspont to actual data points

There are two ways to obtain right-sided Y axis in matplotib. The first way uses a combination of

ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")

The second one creates a “twin X” axis and makes sure the first axis is invisible. It might seem that the first option is easier. However, when combined with seaborn’s despine function, strange things happen. Thus, I perform the second option. Following is the code that I used to create the last version of the graph.

np.random.seed(123)
days = np.arange(1, 31)
price = (np.random.randn(len(days)) * 0.1).cumsum() + 10

fig = plt.figure(figsize=(10, 5))
ax = fig.gca()
ax.set_yticks([]) # Make 1st axis ticks disapear.
ax2 = ax.twinx() # Create a secondary axis
ax2.plot(days,price, '-', lw=3)
ax2.set_xlim(1, max(days))
sns.despine(ax=ax, left=True) # Remove 1st axis spines
sns.despine(ax=ax2, left=True, right=False)
tks = [min(price), max(price), price[-1]]
ax2.set_yticks(tks)
ax2.set_yticklabels([f'min:\n{tks[0]:.1f}', f'max:\n{tks[1]:.1f}', f'{tks[-1]:.1f}'])
ax2.set_ylabel('price [$]', rotation=0, y=1.1, fontsize='x-large')
ixmin = np.argmin(price); ixmax = np.argmax(price);
ax2.set_xticks([1, days[ixmin], days[ixmax], max(days)])
ax2.set_xticklabels(['Oct, 1',f'Oct, {days[ixmin]}', f'Oct, {days[ixmax]}', f'Oct, {max(days)}' ])
ylm  = ax2.get_ylim()
bottom = ylm[0]
for ix in [ixmin, ixmax]:
    y = price[ix]
    x = days[ix]
    ax2.plot([x, x], [bottom, y], '-', color='gray', lw=0.8)
    ax2.plot([x, max(days)], [y, y], '-', color='gray', lw=0.8)
ax2.set_ylim(ylm)

Next time when you create a “something” vs time graph, ask yourself whether the last available point has a special meaning to the viewer. If it does, consider moving the Y axis to the left part of your graph and see whether it becomes more readable.

This post was triggered by a nice write-up by  Plotting a Course: Line Charts by a new blogger David (he didn’t mention his last name) from https://thenumberist.wordpress.com/

The fastest way to get first N items in each group of a Pandas DataFrame

In my work, the speed of code writing and reading is usually more important than the speed of its execution. Right now, I’m facing a challenge of optimizing the running time of a fairly complex data science project. After a lot of profiling, I identified the major time consumers. One of such time-consuming steps involved grouping a Pandas DataFrame by a key, sorting each group by a score column, and taking first N elements in each group. The tables in this step are pretty small not more than one hundred elements. But since I have to perform this step many times, the running time accumulates to a substantial fraction.

Let’s first construct a toy example

N = 100
x = np.random.randint(1, 5, N).astype(int)
y = np.random.rand(N)
d = pd.DataFrame(dict(x=x, y=y))

I’ll use %%timeit cell magic which runs a Jupyter cell many times, and measures the time it takes to run the code.


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.head(K)
 ).reset_index(drop=True)

This is the output:

3.19 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

 

I suspected that head() was not the most efficient way to take the first lines. I tried .iloc


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.iloc[0:K]
 ).reset_index(drop=True)

2.92 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A 10% improvement. Not bad but not excellent either. Then I realized that Pandas groupby object have their own head function


%%timeit
d.groupby(
 'x'
 ).head(
 K
 ).reset_index(drop=True)

674 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

647 microseconds instead of 3.2 milliseconds. The improvement is by almost a factor of five!

It’s not enough to have the right tool, it’s important to be aware of it, and to use it right. I wonder whether there is even faster way to obtain this job.