Whoever owns the metric owns the results — don’t trust benchmarks

Illustration: a mechanical stopwatch in a person's palm

Other factors being equal, what language would you choose for heavy numeric computations: Python or PHP? This is not a language war but a serious question. For me, the choice seems to be obvious: I would choose Python, and I’m not the only one. In this survey, for example, 45% of data scientist use Python, compared to 24% who use PHP. The two sets of data scientists aren’t mutually exclusive, but we do see the picture.

This is why I was very surprised when a colleague of mine suggested switching to PHP due to a three times faster performance in a benchmark. I was very surprised and intrigued. Especially, when I noticed that they used a heavy number crunching for the benchmark.

In that benchmark, the authors compute prime numbers using the following Python code

def get_primes7(n):
	"""
	standard optimized sieve algorithm to get a list of prime numbers
	--- this is the function to compare your functions against! ---
	"""
	if n < 2:
		return []
	if n == 2:
		return [2]
	# do only odd numbers starting at 3
	if sys.version_info.major <= 2:
		s = range(3, n + 1, 2)
	else:  # Python 3
		s = list(range(3, n + 1, 2))
	# n**0.5 simpler than math.sqr(n)
	mroot = n ** 0.5
	half = len(s)
	i = 0
	m = 3
	while m <= mroot:
		if s[i]:
			j = (m * m - 3) // 2  # int div
			s[j] = 0
			while j =6, Returns a array of primes, 2 &lt;= p <span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>&lt; n &quot;&quot;&quot;
    sieve = np.ones(n//3 + (n%6==2), dtype=np.bool)
    sieve[0] = False
    for i in range(int(n**0.5)//3+1):
        if sieve[i]:
            k=3*i+1|1
            sieve[      ((k*k)//3)      ::2*k] = False
            sieve[(k*k+4*k-2*k*(i&amp;1))//3::2*k] = False
    return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]

Did you notice the problem? The code above is a pure Python code. I can't think of a good reason to use pure python code for computationally-intensive, time-sensitive tasks. When you need to crunch numbers with Python, and when the computational time is even remotely important, you will most certainly use tools that were specifically optimized for such tasks. One of the most important such tools is numpy, in which the most important loops are implemented in C++ or in Fortran. Many other packages, such as Pandas, scipy, sklearn, and others rely on numpy or other form of speed optimization.

The following snippet uses numpy to perform the same computation as the first one.

def numpy_primes(n):
    # http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
    """ Input n&gt;=6, Returns a array of primes, 2 &lt;= p <span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>&lt; n &quot;&quot;&quot;
    sieve = np.ones(n//3 + (n%6==2), dtype=np.bool)
    sieve[0] = False
    for i in range(int(n**0.5)//3+1):
        if sieve[i]:
            k=3*i+1|1
            sieve[      ((k*k)//3)      ::2*k] = False
            sieve[(k*k+4*k-2*k*(i&amp;1))//3::2*k] = False
    return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]

On my computer, the timings to generate primes smaller than 10,000,000 is 1.97 seconds for the pure Python implementation, and 21.4 milliseconds for the Numpy version. The numpy version is 92 times faster!

What does that mean? 
Whoever owns the metric owns the results. Never trust a benchmark result before you understand how the benchmark was performed, and before making sure the benchmark was performed under the conditions that are relevant to you and your problem.

 

 

When “a pile of shit” is a compliment — On context importance in remote communication

Pile of poo emoji

What would you do, if someone left a “Pile of Poo” emoji as a reaction to your photo in your team Slack channel?

This is exactly what happened to me a couple of days ago, when Sirin, my team lead, posted a picture of me talking to the Barcelona Machine Learning Meetup Group about data visualization.

Slack screenshot: Photo of me delivering a presentation. One "smiling poop emoji" attached to the photo as a reaction

Did I feel offended? Not at all. It was pretty funny, actually. To understand why, let’s talk about the importance of context in communication, especially in a distributed team.

The context

My Barcelona talk is titled “Three most common mistakes in data visualization and how to avoid them“. During the preparation, I noticed that the first mistake is about keeping the wrong attitude, and the third one is about not writing conclusions. I noticed the “A”, and “C”, and decided to abbreviate the talk as “ABC”. Now, I had to find the right word for the “B” chapter. The second point in my talk deals with low signal-to-noise ratio. How do you summarize signal-to-noise ratio using a word that starts with “B”? My best option was “bullshit”, as a reference to “noise” — the useless pieces of information that appear in so many graphs. I was so happy about “bullshit,” but I wasn’t sure it was culturally acceptable to use this word in a presentation. After fruitless searches for a more appropriate alternative, I decided to ask my colleagues.

Slack screenshot: My poll that asks whether it was OK to use "bullshit" in a presentation. Four out of four responders thought it was

All the responders agreed that using bullshit in a presentation was OK. Martin, the head of Data division at Automattic, provided the most elaborate answer.

Screenshot: Martin's response "for a non-native English speaking audience, i think that american coinages like bullshit come across funnier and less aggressive than they would for (some) American audiences"

I was excited that my neat idea was appropriate, so I went with my plan:

Screenshot. My presentation slides. One of them says "Cut the bullshit"

Understandably, the majority of the data community at Automattic became involved in this presentation. That is why, when Sirin posted a photo of me giving that presentation, it was only natural that one of them responded with a pile of poo emoji. How nice and cute!  💩

The lesson

This bullshit story is not the only example of something said about me (if you can call an emoji “saying”) that sounded very bad to the unknowing person, but was in fact very correct and positive. I have a couple of more examples that may be even funnier than this one but require more elaborate and boring explanations.
However, the lesson is clear. Next time you hear someone saying something unflattering about someone else, don’t jump to conclusions. Think about bullshit, and assume the best intentions.

Three most common mistakes in data visualization 
and how to avoid them. Now, the slides

collage of data visualization paper headlines

Yesterday, I talked in front of the Barcelona Data Science and Machine Learning Meetup about the most common mistakes in data visualization. I enjoyed talking with the local community very much. Judging by the feedback I received during and after the talk, they too, enjoyed my presentation. I uploaded my slides to Slideshare.

Me in front of a screen that shows a bar chart

Enjoy!

Engineering Data Science at Automattic

Data Scientist? Data Engineer? Analyst? My teammate, Yanir Seroussi writes about bridging the gaps between the different professions.

Data for Breakfast

Most data scientists have to write code to analyze data or build products. While coding, data scientists act as software engineers. Adopting best practices from software engineering is key to ensuring the correctness, reproducibility, and maintainability of data science projects. This post describes some of our efforts in the area.

Data scientist Venn diagram example One of many data science Venn diagrams. Source: Data Science Stack Exchange

Different data scientists, different backgrounds

Data science is often defined as the intersection of many fields, including software engineering and statistics. However, as demonstrated by the above Venn diagram, viewing it as an intersection tends to be too exclusive – in reality, it’s a union of many fields. Hence, data scientists tend to come from various backgrounds, and it is common to encounter data scientists with no formal training in computer science or software engineering. According to Michael Hochster, data scientists can be classified into two types

View original post 1,069 more words

Live in Barcelona. Three most common mistakes in data visualization.

Me in front of a whiteboard, pointing at a graph

On Thursday, March 20, I will give a talk titled “Three most common mistakes in data visualization and how to avoid them.” I will be a guest of the Barcelona Data Science and Machine Learning Meetup Group. Right now, less than twenty-four hours after the lecture announcement, there are already seventeen people on the waiting list. I feel a lot of responsibility and am very excited.

 

Visiting the outer space isn’t such a big deal

I know a lot of people who dreamt of being a cosmonaut or an astronaut. I was one of them. Did you know that visiting the outer space isn’t such a big deal? Since the Yuri Gagarin’s first flight to space in 1961, 557 more people flew to space. Unfortunately, not all of them survived the trip [ref].

On the other hand

There are 193 UN member countries. Do you know that, according to Wikipedia, there are only 13 (thirteen!) people who are confirmed to visit all of these countries? [ref] It’s 43 times less than the number of astronauts!

558 people visited space; 13 people visited all the countries in the world

On algorithmic fairness & transparency

Illustration: large lamp sigh that says "The same for everyone" with a sunset as a background

My teammate, Charles Earl has recently attended the Conference on Fairness, Accountability, and Transparency (FAT*). The conference site is full of very interesting material, including proceedings and video recording of lectures and tutorials.

Reading through the conference proceedings, I found a very interesting paper titled “The Cost of Fairness in Binary Classification.” This paper talks about the measures one needs to take in order not use sensitive features (such as race) as the means to discrimination, with a reasonable accuracy tradeoff.

Skimming through this paper, I recalled a conversation I had about a year ago with a chief data scientist in a startup that provides short-term loans to people who need some money now. The major job of the data science team in that company was to assess the risk of a customer. From the explanation the chief data scientist gave, and from the data sources she described, it was clear that they train their model on the information whether a person is likely to receive a loan from a financial institution. When I pointed out that they exclude categories of people that are rejected but are likely to return the money. “Yes?” she said in a tone as if she couldn’t see what the problem that I tried to raise was. “Well,” I said, it’s unfair for many customers, plus you’re missing the chance to recruit customers who were rejected by others”. “We have enough potential customers,” she said. She didn’t think fairness was an issue worth talking about.

 

The featured image is by Søren Astrup Jørgensen from Unsplash

 

Five misconceptions about data science

One item on my todo list is to write a post about “three common misconceptions about data science. Today, I found this interesting post that lists misconceptions much better than I would have been able to do. Plus, they list five of them. That 67% more than I intended to do 😉

I especially liked the section called “What is a Data Scientist” that presents six Venn diagrams of a dream data scientist.

The analogy between the data scientist and a purple unicorn is still apt – finding an individual that satisfies any one of the top four diagrams above is rare.

 

Enjoy reading  Five Misconceptions About Data Science – Knowing What You Don’t Know — Track 2 Analytics

Blogging isn’t what it used to be

From time to time, I assume something, evaluate that assumption, and discover that the reality is opposite to what I thought it was. That’s exactly what happened when I thought about the dynamics of Google searches for “create a site,” compared to the searches for “create a blog.” I was sure that there would be much more searches for “create a site.” I was wrong

blogging_is_not_what_it_used_to_be.png

There are several interesting insights that one can drive from that small analysis.

  1. The number of people who search for “create a site” is continuously dropping.
  2. Ever since 2009, the number of searches for “create a site” is smaller than the number of searches for “create a blog.” Why? I have no idea
  3. Blog creation search dynamics is also interesting. Both “start a blog” and “create a blog” have been decreasing since January 2011. However, despite the fact that both the curves started at the same height, and reached the same peak, they did so in different trajectories. “Create a blog” reached a peak gradually, following a concave path. “Start a blog,” on the other hand, reached the peak following a convex path that resembles exponential growth. For some reason, in January 2009 growth of both the searches stopped.

Usually, in posts like this, you would expect an analysis that explains the difference. I don’t have any answers. However, if you have any hypothesis, I will be glad to hear.

 

ASCII histograms are quick, easy to use and implement

screenshot. Histogram created using ASCII characters

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.