My latest post on https://data.blog. I enjoyed preparing it, and like its results very much. Happy New Year, everyone.
-
Do New Year's Resolutions Work? Data Suggests They Do!
December 26, 2017 -
The Keys to Effective Data Science Projects — Operationalize
December 20, 2017Recently, I’ve stumbled upon an interesting series of posts about effective management of data science projects. One of the posts in the series says:
“Operationalization” – a term only a marketer could love. It really just means “people using your solution”.
The main claim of that post is that, at some point, bringing actual users to your data science project may be more important than improving the model. This is exactly what I meant in my “when good enough is good enough” post (also available on YouTube)
-
We're Reading About Artificially Intelligent Harry Potter Fan Fiction, Verifying Online Identities, and More
December 19, 2017 -
Buzzword shift
December 18, 2017Many years ago, I tried to build something that today would have been called “Google Trends for Pubmed”. One thing that I’ve found during that process was how the emergence of HIV-related research reduced the number of cancer studies and how, several years later, the HIV research boom settled down, while letting the cancer research back.
I recalled about that project of mine when I took a look at the Google Trends data for, a once popular buzz-phrases, “data mining” and pattern recognition. Sic transit gloria mundi.

It’s not surprising that “Data Science” was the less popular term in 2004. As I already mentioned, “Data Science” is a relatively new term. What does surprise me is the fact that in the past, “Machine Learning” was so less popular that “Data Mining”. Even more surprising is the fact that Google Trends ranks “Machine Learning” almost twice as high, as “Data Science”. I was expecting to see the opposite.
“Pattern Recognition,” that, in 2004, was as (not) popular as “Machine Learning” become even less popular today. Does that mean that nobody is searching for patterns anymore? Not at all. The 2004 pattern recognition experts are now machine learning professors senior data scientists or if they work in academia, machine learning professors.
PS: does anybody know the reason behind the apparent seasonality in “Data Mining” trends?
-
On alert fatigue
December 17, 2017I developed an anomaly detection system for Automattic internal dashboard. When presenting this system (“When good enough is just good enough”), I used to tell that in our particular case, the cost of false alerts was almost zero. I used to explain this claim by the fact that no automatic decisions were made based on the alerts, and that the only subscribers of the alert messages were a limited group of colleagues of mine. Automattic CFO, Stu West, who was the biggest stakeholder in this project, asked me not to stop claiming the “zero cost” claim. When the CFO of the company you work for asks you to do something, you comply. So, I stopped saying “zero cost” but I still listed the error costs as a problem I can safely ignore for the time being. I didn’t fully believe Stu, which is evident from the speaker notes of my presentation deck:
[caption id=”attachment_1942” align=”alignnone” width=”2298”]

My speaker notes. Note how “error costs” was the first problem I dismissed.[/caption]
I recalled about Stu’s request to stop talking about “zero cost” of false alerts today. Today, I noticed more than 10 unread messages in the Slack channel that receives my anomaly alerts. The oldest unread message was two weeks old. The only reason this could happen is that I stopped caring about the alerts because there were too many of them. I witnessed the classical case of “alert fatigue”, described in “The Boy Who Cried Wolf”, many centuries ago.
The lesson of this story is that there is no such a thing as zero-cost false alarms. Lack of specificity is a serious problem.

Feature image by Ray Hennessy
-
What's the most important thing about communicating uncertainty?
December 14, 2017Sigrid Keydana, in her post Plus/minus what? Let’s talk about uncertainty (talk) — recurrent null, said
What’s the most important thing about communicating uncertainty? You’re doing it
Really?
Here, for example, a graph from a blog post

The graph clearly “communicates” the uncertainty but does it really convey it? Would you consider the lines and their corresponding confidence intervals very uncertain had you not seen the points?
What if I tell you that there’s a 30% Chance of Rain Tomorrow? Will you know what it means? Will a person who doesn’t operate on numbers know what it means? The answer, to both these questions, is “no”, as is shown by Gigerenzer and his collaborators in a 2005 paper).

Communicating uncertainty is not a new problem. Until recently, the biggest “clients” of uncertainty communication research were the weather forecasters. However, the recent “data era” introduced uncertainty to every aspect of our personal and professional lives. From credit risk to insurance premiums, from user classification to content recommendation, the uncertainty is everywhere. Simply “doing” uncertainty communication, as Sigrid Keydana from the Recurrent Null blog suggested isn’t enough. The huge public surprise caused by the 2016 US presidential election is the best evidence for that. Proper uncertainty communication is a complex topic. A good starting point to this complex topic is a paper Visualizing Uncertainty About the Future by David Spiegelhalter.
-
Doing the Math on Key Words and Top Level Domains
December 12, 2017My post on data.blog
-
The Y-axis doesn't have to be on the left
December 10, 2017Line charts are great to convey the evolution of a variable over the time. This is a typical chart. It has three key components, the X-axis that represents the time, the Y-axis that represents the tracked value, and the line itself.

Usually, you will see the Y-axis at the left part of the graph. Unless you design for a Right-To-Left language environment, placing the Y-axis on the left makes perfect sense. However, left-side Y-axis isn’t a hard rule.
In many cases, more importance is given to the most recent data point. For example, it might be interesting to know a stock price dynamics but today’s price is what determines how much money I can get by selling my stock portfolio.
What happens if we move the axis to the right?

Now, today’s price of XYZ stock is visible more clearly. Let’s make the most important values explicitly clear:

There are two ways to obtain right-sided Y axis in matplotib. The first way uses a combination of
[code language=”python”]
ax.yaxis.tick_right() ax.yaxis.set_label_position(“right”)
[/code]The second one creates a “twin X” axis and makes sure the first axis is invisible. It might seem that the first option is easier. However, when combined with seaborn’s despine function, strange things happen. Thus, I perform the second option. Following is the code that I used to create the last version of the graph.
[code language=”python”]
np.random.seed(123) days = np.arange(1, 31)
price = (np.random.randn(len(days)) * 0.1).cumsum() + 10fig = plt.figure(figsize=(10, 5))
ax = fig.gca()
ax.set_yticks([]) # Make 1st axis ticks disapear.
ax2 = ax.twinx() # Create a secondary axis
ax2.plot(days,price, ‘-‘, lw=3)
ax2.set_xlim(1, max(days))
sns.despine(ax=ax, left=True) # Remove 1st axis spines
sns.despine(ax=ax2, left=True, right=False)
tks = [min(price), max(price), price[-1]]
ax2.set_yticks(tks) ax2.set_yticklabels([f’min:\n{tks[0]:.1f}’, f’max:\n{tks[1]:.1f}’, f’{tks[-1]:.1f}’])
ax2.set_ylabel(‘price [$]’, rotation=0, y=1.1, fontsize=’x-large’)
ixmin = np.argmin(price); ixmax = np.argmax(price);
ax2.set_xticks([1, days[ixmin], days[ixmax], max(days)])
ax2.set_xticklabels([‘Oct, 1’,f’Oct, {days[ixmin]}’, f’Oct, {days[ixmax]}’, f’Oct, {max(days)}’ ])
ylm = ax2.get_ylim()
bottom = ylm[0]
for ix in [ixmin, ixmax]: y = price[ix] x = days[ix] ax2.plot([x, x], [bottom, y], ‘-‘, color=’gray’, lw=0.8) ax2.plot([x, max(days)], [y, y], ‘-‘, color=’gray’, lw=0.8)
ax2.set_ylim(ylm) [/code]Next time when you create a “something” vs time graph, ask yourself whether the last available point has a special meaning to the viewer. If it does, consider moving the Y axis to the left part of your graph and see whether it becomes more readable.
This post was triggered by a nice write-up by Plotting a Course: Line Charts by a new blogger David (he didn’t mention his last name) from https://thenumberist.wordpress.com/
-
Epitaphs in the Graveyard of Mathematics
December 2, 2017The excellent Ben Odrin wrote a hilarious post with fictitious tombstones of famous mathematicians. This is only one example :

I decided to jump on that bandwagon

-
The fastest way to get first N items in each group of a Pandas DataFrame
November 27, 2017In my work, the speed of code writing and reading is usually more important than the speed of its execution. Right now, I’m facing a challenge of optimizing the running time of a fairly complex data science project. After a lot of profiling, I identified the major time consumers. One of such time-consuming steps involved grouping a Pandas DataFrame by a key, sorting each group by a score column, and taking first N elements in each group. The tables in this step are pretty small not more than one hundred elements. But since I have to perform this step many times, the running time accumulates to a substantial fraction.
Let’s first construct a toy example
[code lang=”python”]
N = 100
x = np.random.randint(1, 5, N).astype(int)
y = np.random.rand(N)
d = pd.DataFrame(dict(x=x, y=y))
[/code]I’ll use %%timeit cell magic which runs a Jupyter cell many times, and measures the time it takes to run the code.
[code lang=”python”]
%%timeit
d.groupby( ‘x’ ).apply( lambda t: t.head(K) ).reset_index(drop=True)[/code]
This is the output:
3.19 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)I suspected that head() was not the most efficient way to take the first lines. I tried .iloc
[code lang=”python”]
%%timeit
d.groupby( ‘x’ ).apply( lambda t: t.iloc[0:K] ).reset_index(drop=True)[/code]
2.92 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)A 10% improvement. Not bad but not excellent either. Then I realized that Pandas groupby object have their own head function
[code lang=”python”]
%%timeit
d.groupby( ‘x’ ).head( K ).reset_index(drop=True)[/code]
674 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)647 microseconds instead of 3.2 milliseconds. The improvement is by almost a factor of five!
It’s not enough to have the right tool, it’s important to be aware of it, and to use it right. I wonder whether there is even faster way to obtain this job.
-
How to make a graph less readable? Rotate the text labels
November 23, 2017This is my “because you can” rant.
Here, you can see a typical situation. You have some sales data that you want to represent using a bar plot.

Immediately, you notice a problem: the names on the X axis are not readable. One way to make the labels readable is to enlarge the graph.

Making larger graphs isn’t always possible. So, the next default solution is to rotate the text labels.

However, there is a problem. Rotated text is read more slowly than standard horizontal text. Don’t believe me? This is not an opinion but rather a result of empirical studies [ref], [ref]. Sometimes, rotated text is unavoidable. Most of the time, it is not.
So, how do we make sure all the labels are readable without rotating them? One option is to move them up and down so that they don’t hinder each other. It is easily obtained with Python’s matplotlib
[code language=”python”]
plt.bar(range(len(people)), sales)
plt.title(‘October sales’)
plt.ylabel(‘$US’, rotation=0, ha=’right’)
ticks_and_labels = plt.xticks(range(len(people)), people, rotation=0)
for i, label in enumerate(ticks_and_labels[1]): label.set_y(label.get_position()[1] - (i % 2) * 0.05)
[/code](note, that I also rotated the Y axis label, for even more readability)

Another approach that will work with even longer labels and that requires fewer code lines it to rotate the bars, not the labels.

… and if you don’t have a compelling reason for the data order, you might also consider sorting the bars. Doing so will not only make it prettier, it will also make it easier to compare between similar values. Use the graph above to tell whether Teresa Jackson’s sales were higher or lower than those of Marie Richardson’s. Now do the same comparison using the graph below.

To sum up: the fact you can does not mean you should. Sometimes, rotating text labels is the easiest solution. The additional effort needed to decipher the graph is the price your audience pays for your laziness. They might as well skip your graphs your message won’t stick.
This was my because you can rant.
Featured image by Flickr user gullevek
-
On machine learning, job security, professional pride, and network trolling
November 21, 2017If you are a data scientist, I am sure you wondered whether deep neural networks will replace you at your job one day. Every time I read about reports of researchers who managed to trick neural networks, I wonder whether the researchers were thinking about their job security, or their professional pride while performing the experiments. I think that the first example of such a report is a 2014 paper by Christian Szegedy and his colleagues called “Intriguing properties of neural networks”. The main goal of this paper, so it seems, was to peek into the black box of neural networks. In one of the experiments, the authors designed minor, invisible perturbation of the original images. These perturbations diminished the classification accuracy of a trained model.

In the recent post “5 Ways to Troll Your Neural Network” Ben Orlin describes five different ways to “troll a network”.
Image credit: Figure 5 from “Intriguing properties of neural networks”.
-
Interactive Network Visualization in Python with NetworkX and PyQt5 Tutorial
November 20, 2017Unfortunately, there is no widely accepted, ready to use, standard way to interactively visualize networks in python. The following post shows yet another attempt to build an ad-hoc app.
-
Which of these two pictures should I choose for my gravatar?
November 16, 2017 -
We're Reading About Simplifying Without Distortion and Adversarial Image Classification
November 15, 2017Weekly reading list from the data.blog team
-
Another set of ruthless critique pieces
November 15, 2017You know that I like reading a ruthless critique of others’ work – I like telling myself that by doing so I learn good practices (in reality, I suspect I’m just a case what we call in Hebrew שמחה לאיד – the joy of some else’s failure).
Anyhow, I’d like to share a set of posts by Lior Patcher in which he calls bullshit on several reputable people and concepts. Calling bullshit is easy. Doing so with arguments is not so. Lior Patcher worked hard to justify his opinion.
* The network nonsense of Albert-László Barabási. Albert-László Barabási is a renown network scientist. There’s a network model named after him. Some people claim that prof. Barabási will receive the Nobel prize one day. * The network nonsense of Manolis Kellis. Published one day after “The “The network nonsense of Albert-László Barabási”, this post critiques another renown scientist. Again, with a lot of solid-sounding arguments. * When average is not enough: part II. (“Where is part I?”, you may ask. Read the post to discover).Unfortunately, I don’t publish academic papers. But if I do, I will definitely want prof. Patcher read it, and let the world know what he thinks about it. For good and for bad.
Speaking of calling bullshit. Believe it or not, University of Washington has a course with this exact title. The course is available onlinehttp://callingbullshit.org/ and is worth watching. I watched all the course’s videos during my last flight from Canada to Israel. The featured image of this post is a screenshot of this course’s homepage.
-
Good information + bad visualization = BAD
November 14, 2017I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.
via Math required for machine learning — Youth Innovation
I would write a post about good practices in pie charts, but Robert Kosara, of https://eagereyes.org does this so well, I don’t really think I need to reinvent the weel. Pie charts are very powerful in conveying information. Make sure you use this tool well. I strongly suggest reading everything Robert Kosara has to say on this topic.
-
What are the best practices in planning & interpreting A/B tests?
November 13, 2017Compiled by my teammate Yanir Serourssi, the following is a reading list an A/B tests that you should read even if you don’t plan to perform an A/B test anytime soon. The list is Yanir’s. The reviews are mine. Collective intelligence in action :-)
* [If you don’t pay attention, data can drive you off a cliff](https://yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/) In this post, Yanir lists seven common mistakes that are common to any data-based analysis. At some point, you might think that this is a list of trivial truths. Maybe it is. The fact that Yanir's points are trivial doesn't make them less correct. Awareness doesn't exist without knowledge. Unfortunately, knowledge doesn't assure awareness. Which is why reading trivial truths is a good thing to do from time to time. * [How to identify your marketing lies and start telling the truth](https://www.linkedin.com/pulse/how-identify-your-marketing-lies-start-telling-truth-tiberio-caetano) This post was written by Tiberio Caetano, a data science professor at the University of Sidney. If I had to summarize this post with a single phrase, that would be "confounding factors". A confounding variable is a variable hidden from your eye that influences a measured effect. One example of a confounding variable is when you start an ad campaign for ice cream, your sales go up, and you conclude that the ad campaign was effective. What you forgot was that the ad campaign started at the beginning of the summer, when people start buying more ice cream anyhow. See [this link](https://onlinecourses.science.psu.edu/stat507/node/34) for a detailed textbook-quality review of confounding variables. * [Seven rules of thumb for web site experimenters](http://www.exp-platform.com/Documents/2014%20experimentersRulesOfThumb.pdf) I read this review back in 2014, shortly after it was published by, among others, researchers from Microsoft and LinkedIn. Judging by the title, one would expect yet another list of trivial truths in a self-promoting product blog. This is not the case here. In this paper, you will find several real-life case studies, many references to marketing studies, and no advertising of shady products or schemes. * [A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments](http://exp-platform.com/Documents/2017-08%20KDDMetricInterpretationPitfalls.pdf) Another academic paper by Microsoft researchers. This one lists a lot of "dont's". Like in the previous link, every advice the authors give is based on established theory and backed up by real data. -
How to make a racist AI without really trying (a reblog)
November 10, 2017Perhaps you heard about Tay, Microsoft’s experimental Twitter chat-bot, and how within a day it became so offensive that Microsoft had to shut it down and never speak of it again. And you assumed that you would never make such a thing, because you’re not doing anything weird like letting random jerks on Twitter re-train […]
via How to make a racist AI without really trying — ConceptNet blog
-
Please leave a comment on this post
November 9, 2017Please leave a comment on this post. It doesn’t matter what you want to write. It can be short or long. Any comment. I need to know that humans read this blog. If you feel really generous, tell me how you found this blog, what you think of it.
-
Data Science or Data Hype?
November 8, 2017In his blog post Big Data Or Big Hype? , Rick Ciesla is asking a question whether the “Big Data” phenomenon is “a real thing” or just a hype? I must admit that, until recently, I was sure that the term “Data Science” was a hype too – an overbroad term to describe various engineering and scientific activities. As time passes by, I become more and more confident that Data Science matures into a separate profession. I haven’t’ yet decided whether the word “science” is fully appropriate in this case is.

We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning. Just how large of a data set constitutes Big Data? What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes? There’s a survey for that, courtesy of […]
-
Do you REALLY need the colors?
November 7, 2017Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Look at this example from the seaborn documentation site
>>> import seaborn as sns >>> sns.set_style("whitegrid") >>> tips = sns.load_dataset("tips") >>> ax = sns.barplot(x="day", y="total_bill", data=tips)
This example shows the default barplot and is the first barplot. Can you see how easy it is to add colors to the different columns? But WHY? What do those colors represent? It looks like the only information that is encoded by the color is the bar category. We already have this information in the form of bar location. Having this colorful image adds nothing but a distraction. It is sad that this is the default behavior that seaborn developers decided to adopt.
Look at the same example, without the colors
>>> ax = sns.barplot(x="day", y="total_bill", color='gray', data=tips)
Isn’t it much better? The sad thing is that a better version requires memorizing additional arguments and more typing.
This was my because you can rant.
-
Numpy vs. Pandas: functions that look the same, share the same code but behave differently
November 6, 2017I can’t imagine how my professional life would have looked like without pandas, THE data analysis library for Python. Pandas shares much of its functionality and syntax with numpy, a fundamental package for scientific computing with Python. The reason for that is that, under the hood, pandas uses numpy. This similarity is very convenient as it allows passing numpy arrays to many pandas functions and vice versa. However, sometimes it sabs you in the back. Here is a nice example that I discovered after hours (OK, minutes) of debugging.
Let’s create a numpy vector with a single element in it:
>>> import numpy as np >>> v = np.array([3.14]) Now, let's compute the standard deviaiton of this vector. According to the [definition](https://en.wikipedia.org/wiki/Standard_deviation), we expect it to be equal zero. >>> np.std(v) 0.0So far so good. No surprises.
Now, let’s make a pandas Series out of our vector. A Series is basically a vector in which the elements can be indexed by arbitrary labels. What do you expect the standard deviation should be now?
>>> import pandas as pd >>> s = pd.Series(v) >>> s.std() nanWhat? Not a number? What the hell? It’s not an empty vector! I didn’t ask to perform the corrected sample standard deviation. Wait a second…
>> s.std(ddof=0) 0.0Now I start getting it. Compare this
>>> print(np.std.__doc__) Compute the standard deviation along the specified axis. .... ddof : int, optional Means Delta Degrees of Freedom. The divisor used in calculations is ``N - ddof``, where ``N`` represents the number of elements. <span style="color:#ff0000;"><strong>By default `ddof` is zero</strong></span>.… to this
>>> print(pd.Series.std.__doc__) Return <span style="color:#ff0000;"><strong>sample</strong></span> standard deviation over requested axis. <span style="color:#ff0000;"><strong>Normalized by N-1 by default</strong></span>. This can be changed using the ddof argument .... <span style="color:#ff0000;"><strong>ddof : int, default 1</strong></span> degrees of freedomFormally, the pandas developers did nothing wrong. They decided that it makes sense to default for normalized standard deviation when working with data tables, unlike numpy that is supposedly meant to deal with arbitrary matrices of numbers. They made a decision, they wrote it at least three times in the documentation, and yet… I didn’t know that even after working with both the libraries for so long.
To sum up:
s.std()
nanv.std()
0.0
s == v
0 True
dtype: boolBeware.
-
When scatterplots are better than bar charts, and why?
November 5, 2017From time to time, you might hear that graphical method A is better at representing problem X than method B. While in case of problem Z, the method B is much better than A, but C is also a possibility. Did you ever ask yourselves (or the people who tell you that) “Says WHO?”
The guidelines like these come from theoretical and empirical studies. One such an example is a 1985 paper “Graphical perception and graphical methods for analyzing scientific data.” by Cleveland and McGill. I got the link to this paper from Varun Raj of https://varunrajweb.wordpress.com/.
It looks like a very interesting and relevant paper, despite the fact that it has been it was published 22 years go. I will certainly read it. Following is the reading list that I compiled for my data visualization students more than two years ago. Unfortunately, they didn’t want to read any of these papers. Maybe some of the readers of this blog will …
- Attention and Mental Primer
- Automating the Design of Graphical Presentations of Relational Information.
- Beyond Weber’s Law: A Second Look at Ranking Visualizations of Correlation
- Exogenous attention and color perception: Performance and appearance of saturation and hue
- High-Speed Visual Estimation Using Preattentive Processing
- How Deceptive are Deceptive Visualizations?: An Empirical Analysis of Common Distortion Techniques
- How NOT to Lie with Visualization
- How to evaluate models: Observed vs. predicted or predicted vs. observed?
- Narrative Visualization: Telling Stories with Data
- Patterns for Visualization Evaluation
- The Data-Ink Ratio and Accuracy of Information Derived from Newspaper Graphs: An Experimental Test of the Theory.
- The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations
-
Because you can — a new series of data visualization rants
November 1, 2017Here’s an old joke:
Q: Why do dogs lick their balls?
A: Because they can.Canine behavior aside, the fact that you can do something doesn’t mean that you should to it. I already wrote about one such example, when I compared between chart legends to muttonchops.
Citing myself:
Chart legends are like Muttonchops — the fact that you can have them doesn’t mean you should.
When reviewing my notes, I realized that I have more bad data visualization examples that share a common problem: effortless addition of elements or features.
Stay tuned and check the because-you-can tag.
Featured image by Unsplash user Nicolas Tessari
- Older posts Newer posts