Working with the local filesystem and with S3 in the same code

Photo by Ekrulila on Pexels.com

As data people, we need to work with files: we use files to save and load data, models, configurations, images, and other things. When possible, I prefer working with local files because it’s fast and straightforward. However, sometimes, the production code needs to work with data stored on S3. What do we do? Until recently, you would have to rewrite multiple parts of the code. But not anymore. I created a sshalosh package that solves so many problems and spares a lot of code rewriting. Here’s how you work with it:

if work_with_s3:
    s3_config = {
      "s3": {
        "defaultBucket": "bucket",
        "accessKey": "ABCDEFGHIJKLMNOP",
        "accessSecret": "/accessSecretThatOnlyYouKnow"
      }
    }
    
else:
    s3_config = None
serializer = sshalosh.Serializer(s3_config)

# Done! From now on, you only need to deal with the business logic, not the house-keeping

# Load data & model
data = serializer.load_json('data.json')
model = serializer.load_pickle('model.pkl')

# Update
data = update_with_new_examples()
model.fit(data)

# Save updated objects
serializer.dump_json(data, 'data.json')
serializer.dump_pickle(model, 'model.pkl')

As simple as that.
The package provides the following functions.

  • path_exists
  • rm
  • rmtree
  • ls
  • load_pickle, dump_pickle
  • load_json, dump_json

There is also a multipurpose open function that can open a file in read, write or append mode, and returns a handler to it.

How to install? How to contribute?

The installation is very simple: pip install sshalosh-borisgorelik
and you’re done. The code lives on GitHub under http://github.com/bgbg/shalosh. You are welcome to contribute code, documentation, and bug reports.

The name is strange, isn’t it?

Well, naming is hard. In Hebrew, “shalosh” means “three”, so “sshalosh” means s3. Don’t overanalyze this. The GitHub repo doesn’t have the extra s. My bad

Sharing the results of your Python code

Photo by veeterzy on Pexels.com

If you work, but nobody knows about your results or cares about them, have you done any work at all? 

A proverbial tree in the proverbial forest. Photo by veeterzy on Pexels.com

As a data scientist, the product of my work is usually an algorithm, an analysis, or a model. What is a good way to share these results with my clients? 

Since 99% of my time, I write in Python, I fell in love with a framework called Panel (http://panel.holoviz.org/). Panel allows you to create and serve basic interactive UI around data, an analysis, or a method. It plays well with API frameworks such as FastAPI or Flask.  The only problem is that to share this work. Sometimes, it is enough to run a local demo server, but if you want to share the work with someone who doesn’t sit next to you, you have to host it somewhere and to take care of access rights. For this purpose, I have a cheap cloud server ($5/month), which is more than enough for my personal needs.

If you can share the entire work publicly, some services can pick up your Jupyter notebooks from  Github and interactively serve them. I know of voila  and Binder)

Recently, Streamlit.io is entering this niche. It currently only allows sharing public repos, but promises to add a paid service for your private code. I’m eager to see that.

How to become a Python professional in 42 hours?

Here’s an appealing ad that I saw

This image has an empty alt attribute; its file name is image-2.png

How to become a Python professional in 42 hours? I’ll tell you how. There is no way. I don’t know any field of knowledge in which one can become professional after 42 hours. Certainly not Python. Not even after 42 days. Maybe after 42 weeks if that’s mostly what you do and you already a programmer.

TicToc — a flexible and straightforward stopwatch library for Python.

Photo by Brett Sayles on Pexels.com

Many years ago, I needed a way to measure execution times. I didn’t like the existing solutions so I wrote my own class. As time passed by, I added small changes and improvements, and recently, I decided to publish the code on GitHub, first as a gist, and now as a full-featured Github repository, and a pip package.

TicToc – a simple way to measure execution time

TicToc provides a simple mechanism to measure the wall time (a stopwatch) with reasonable accuracy.

Crete an object. Run tic() to start the timer, toc() to stop it. Repeated tic-toc’s will accumulate time. The tic-toc pair is useful in interactive environments such as the shell or a notebook. Whenever toc is called, a useful message is automatically printed to stdout. For non-interactive purposes, use start and stop, as they are less verbose.

Following is an example of how to use TicToc:

Usage examples

def leibniz_pi(n):
    ret = 0
    for i in range(n * 1000000):
        ret += ((4.0 * (-1) ** i) / (2 * i + 1))
    return ret

tt_overall = TicToc('overall')  # started  by default
tt_cumulative = TicToc('cumulative', start=False)
for iteration in range(1, 4):
    tt_cumulative.start()
    tt_current = TicToc('current')
    pi = leibniz_pi(iteration)
    tt_current.stop()
    tt_cumulative.stop()
    time.sleep(0.01)  # this inteval will not be accounted for by `tt_cumulative`
    print(
        f'Iteration {iteration}: pi={pi:.9}. '
        f'The computation took {tt_current.running_time():.2f} seconds. '
        f'Running time is {tt_overall.running_time():.2} seconds'
    )
tt_overall.stop()
print(tt_overall)
print(tt_cumulative)

TicToc objects are created in a “running” state, i.e you don’t have to start them using tic. To change this default behaviour, use

tt = TicToc(start=False)
# do some stuff
# when ready
tt.tic()

Installation

Install the package using pip

pip install tictoc-borisgorelik

The difference between python decorators and inheritance that cost me three hours of hair-pulling

Photo by Genaro Servu00edn on Pexels.com

I don’t have much hair on my head, but recently, I encountered a funny peculiarity in Python due to which I have been pulling my hair for a couple of hours. In retrospect, this feature makes a lot of sense. In retrospect.

First, let’s start with the mental model that I had in my head: inheritance.

Let’s say you have a base class that defines a function `f`

Now, you inherit from that class and rewrite f

What happens? The fact that you defined f in ClassB means that, to a rough approximation, the old definition of f from ClassA does not exist in all the ClassB objects.

Now, let’s go to decorators.

@dataclass_json
@dataclass
class Message2:
    message: str
    weight: int
    def to_dict(self, encode_json=False):
        print('Custom to_dict')
        ret = {'MESSAGE': self.message, 'WEIGHT': self.weight}
        return ret
m2 = Message2('m2', 2)

What happened here? I used a decorator `dataclass_json` that, among other things, provides a `to_dict` function to Python’s data classes. I created a class `Message2`, but I needed s custom `to_dict` definition. So, naturally, I defined a new version of `to_dict` only to discover several hours later that the new `to_dict` doesn’t exist.

Do you get the point already? In inheritence, the custom implementations are added ON TOP of the base class. However, when you apply a decorator to a class, your class’s custom code is BELOW the one provided by the decorator. Therefore, you don’t override the decorating code but rather “underride” it (i.e., give it something it can replace).

As I said, it makes perfect sense, but still, I missed it. I don’t know whether I would have managed to find the solution without Stackoverflow.

One of the reasons I don’t like R

I never liked R. I didn’t like it for the first time I tried to learn it, I didn’t like it when I had to switch to R as my primary work tool at my previous job. And didn’t like it one and a half year later, when I was comfortable enough to add R to my CV, right before leaving my previous job.

Today, I was reminded of one feature (out of so many) that made dislike R. It’s its import (or library, as they call it in R) feature. In Python, you can import a_module and then use its components by calling a_model.a_function. Simple and predictable. In R, you have to read the docs in order to understand what will happen to your namespace after you have library(a.module) (I know, those dots grrrr) in your code. This feature is so annoying that people write modules that help them using other modules. Like in this blog post, which looks like an interesting thing to do, but … wouldn’t it be easier to use Python?

 

What is the best way to handle command line arguments in Python?

The best way to handle command line arguments with Python is defopt. It works like magic. You write a function, add a proper docstring using any standard format (I use [numpy doc]), and see the magic


import defopt

def main(greeting, *, count=1):
    """Display a friendly greeting.

    :param str greeting: Greeting to display
    :param int count: Number of times to display the greeting
    """
    for _ in range(count):
        print(greeting)

if __name__ == '__main__':
    defopt.run(main)

 

You have:

  • help string generation
  • data type conversion
  • default arguments
  • zero boilerplate code

Magic!

Illustration: the famous XKCD

Measuring the wall time in python programs

[UPDATE Feb 2020]: TicToc is now a package. See this post.

Measuring the wall time of various pieces of code is a very useful technique for debugging, profiling, and computation babysitting.  The first time I saw a code that performs time measurement was many years ago when a university professor used Matlab’s tic-toc pair. Since then, whenever I learn a new language, the first “serious” code that I write is a tic-toc mechanism. This is my Python Tictoc class: [Github gist].

Gender salary gap in the Israeli high-tech — now the code

Several people have asked me about the technology I used to create the graphs in my recent post about the gender salary gap in the Israeli high-tech. Like 99% of the graphs I create, I used matplotlib. I have uploaded the notebook that I used for that post to Github. Here’s the link. The published version uses seaborn style settings. The original one uses a slightly customized style.

 

The Y-axis doesn’t have to be on the left

Line charts are great to convey the evolution of a variable over the time. This is a typical chart. It has three key components, the X-axis that represents the time, the Y-axis that represents the tracked value, and the line itself.

A typical line chart. The Y-axis is on the left

Usually, you will see the Y-axis at the left part of the graph. Unless you design for a Right-To-Left language environment, placing the Y-axis on the left makes perfect sense. However, left-side Y-axis isn’t a hard rule.

In many cases, more importance is given to the most recent data point. For example, it might be interesting to know a stock price dynamics but today’s price is what determines how much money I can get by selling my stock portfolio.

What happens if we move the axis to the right?

A slightly improved version. The Y-axis is on the right, adjacent to the most recent data point

Now, today’s price of XYZ stock is visible more clearly. Let’s make the most important values explicitly clear:

The final version. The Y-axis is on the right, adjacent to the most recent data point. The axis ticks correspont to actual data points

There are two ways to obtain right-sided Y axis in matplotib. The first way uses a combination of

ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")

The second one creates a “twin X” axis and makes sure the first axis is invisible. It might seem that the first option is easier. However, when combined with seaborn’s despine function, strange things happen. Thus, I perform the second option. Following is the code that I used to create the last version of the graph.

np.random.seed(123)
days = np.arange(1, 31)
price = (np.random.randn(len(days)) * 0.1).cumsum() + 10

fig = plt.figure(figsize=(10, 5))
ax = fig.gca()
ax.set_yticks([]) # Make 1st axis ticks disapear.
ax2 = ax.twinx() # Create a secondary axis
ax2.plot(days,price, '-', lw=3)
ax2.set_xlim(1, max(days))
sns.despine(ax=ax, left=True) # Remove 1st axis spines
sns.despine(ax=ax2, left=True, right=False)
tks = [min(price), max(price), price[-1]]
ax2.set_yticks(tks)
ax2.set_yticklabels([f'min:\n{tks[0]:.1f}', f'max:\n{tks[1]:.1f}', f'{tks[-1]:.1f}'])
ax2.set_ylabel('price [$]', rotation=0, y=1.1, fontsize='x-large')
ixmin = np.argmin(price); ixmax = np.argmax(price);
ax2.set_xticks([1, days[ixmin], days[ixmax], max(days)])
ax2.set_xticklabels(['Oct, 1',f'Oct, {days[ixmin]}', f'Oct, {days[ixmax]}', f'Oct, {max(days)}' ])
ylm  = ax2.get_ylim()
bottom = ylm[0]
for ix in [ixmin, ixmax]:
    y = price[ix]
    x = days[ix]
    ax2.plot([x, x], [bottom, y], '-', color='gray', lw=0.8)
    ax2.plot([x, max(days)], [y, y], '-', color='gray', lw=0.8)
ax2.set_ylim(ylm)

Next time when you create a “something” vs time graph, ask yourself whether the last available point has a special meaning to the viewer. If it does, consider moving the Y axis to the left part of your graph and see whether it becomes more readable.

This post was triggered by a nice write-up by  Plotting a Course: Line Charts by a new blogger David (he didn’t mention his last name) from https://thenumberist.wordpress.com/

The fastest way to get first N items in each group of a Pandas DataFrame

In my work, the speed of code writing and reading is usually more important than the speed of its execution. Right now, I’m facing a challenge of optimizing the running time of a fairly complex data science project. After a lot of profiling, I identified the major time consumers. One of such time-consuming steps involved grouping a Pandas DataFrame by a key, sorting each group by a score column, and taking first N elements in each group. The tables in this step are pretty small not more than one hundred elements. But since I have to perform this step many times, the running time accumulates to a substantial fraction.

Let’s first construct a toy example

N = 100
x = np.random.randint(1, 5, N).astype(int)
y = np.random.rand(N)
d = pd.DataFrame(dict(x=x, y=y))

I’ll use %%timeit cell magic which runs a Jupyter cell many times, and measures the time it takes to run the code.


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.head(K)
 ).reset_index(drop=True)

This is the output:

3.19 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

 

I suspected that head() was not the most efficient way to take the first lines. I tried .iloc


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.iloc[0:K]
 ).reset_index(drop=True)

2.92 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A 10% improvement. Not bad but not excellent either. Then I realized that Pandas groupby object have their own head function


%%timeit
d.groupby(
 'x'
 ).head(
 K
 ).reset_index(drop=True)

674 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

647 microseconds instead of 3.2 milliseconds. The improvement is by almost a factor of five!

It’s not enough to have the right tool, it’s important to be aware of it, and to use it right. I wonder whether there is even faster way to obtain this job.