Staying employable and relevant as a data scientist

One common wisdom is that creative jobs are immune to becoming irrelevant. This is what Brian Solis, the author of “Lifescale” says on this matter

On the positive side, historically, with every technological advancement, new jobs are created. Incredible opportunity opens up for individuals to learn new skills and create in new ways. It is your mindset, the new in-demand skills you learn, and your creativity that will assure you a bright future in the age of automation. This is not just my opinion. A thoughtful article in Harvard Business Review by Joseph Pistrui was titled, “The Future of Human Work Is Imagination, Creativity, and Strategy.” He cites research by McKinsey […]. In their research, they discovered that the more technical the work, the more replaceable it is by technology. However, work that requires imagination, creative thinking, analysis, and strategic thinking is not only more difficult to automate; it is those capabilities that are needed to guide and govern the machines.

Many people think that data science falls into the category of “creative thinking and analysis”. However, as time passes by this becomes less true. Here’s why.

As time passes by, tools become stronger, smarter, and faster. This means that a problem that could have been solved using cutting edge algorithms running by cutting edge scientists on cutting edge computers, will be solvable using a commodity product. “All you have to do” is to apply domain knowledge, select a “good enough” tool, get the results and act upon them. You’ll notice that I included two phases in quotation marks. First, “all you have to do”. I know that it’s not that simple as “just add water” but it gets simpler.

“Good enough” is also a tricky part. Selecting the right algorithm for a problem has dramatic effect on tough cases but is less important with easy ones. Think of a sorting algorithm. I remember my algorithm class professor used to talk how important it was to select the right sorting algorithm to the right problem. That was almost twenty years ago. Today, I simply write list.sort() and I’m done. Maybe, one day I will have to sort billions of data points in less than a second on a tiny CPU without RAM, which will force me into developing a specialized solution. But in 99.999% of cases, list.sort() is enough.

Back to data science. I think that in the near future, we will see more and more analogs of list.sort(). What does that mean to us, data scientists? I am not sure. What I’m sure is that in order to stay relevant we have to learn and evolve.

Featured image by Héctor López on Unsplash

Is security through obscurity back?

HBR published an opinion post by Andrew Burt, called “The AI Transparency Paradox.” This post talks about the problems that were created by tools that open up the “black box” of a machine learning model.

“Black box” refers to the situation where one can’t explain why a machine learning model predicted whatever it predicted. Predictability is not only important when one wants to improve the model or to pinpoint mistakes, but it is also an essential feature in many fields. For example, when I was developing a cancer detection model, every physician requested to know why we thought a particular patient had cancer. That is why I’m happy, so many people develop tools that allow peeking into the black box.

I was very surprised to read the “transparency paradox” post. Not because I couldn’t imagine that people will use the insights to hack the models. I was surprised because the post reads like a case for security through obscurity — an ancient practice that was mostly eradicated from the mainstream. 

Yes, ML transparency opens opportunities for hacking and abuse. However, this is EXACTLY the reason why such openness is needed. Hacking attempts will not disappear with transparency removal; they will be harder to defend. 

Book review. A Short History of Nearly Everything by Bill Bryson

TL;DR: a nice popular science book that covers many aspects of the modern science

A Short History of Nearly Everything by Bill Bryson is a popular science book. I didn’t learn anything fundamental out of this book, but it was worth reading. I was particularly impressed by the intrigues, lies, and manipulations behind so many scientific discoveries and discoverers. 

The main “selling point” of this book is that it answers the question, “how do the scientists know what they know”? How, for example, do we know the age of Earth or the skin color of the dinosaurs? The author indeed provides some insight. However, because the book tries to talk about “nearly everything,” the answer isn’t focused enough. Simon Singh’s book “Big Bang” concentrates on the cosmology and provides a better insight into the question of “how do we know what we know.” 

Interesting takeaways and highlights

  • Of the problem that our Universe is unlikely to be created by chance: “Although the creation of Universe is very unlikely, nobody knows about failed attempts.”
  • The Universe is unlimited but finite (think of a circle)
  • Developments in chemistry were the driving force of the industrial revolution. Nevertheless, chemistry wasn’t recognized as a scientific field in its own for several decades

The bottom line: Read if you have time 3.5/5. 

Cow shit, virtual patient, big data, and the future of the human species

Yesterday, a new episode was published in the Popcorn podcast, where the host, Lior Frenkel, interviewed me. Everyone who knows me knows how much I love talking about myself and what I do. I definitely used this opportunity to talk about the world of data. Some people who listened to this episode told me that they enjoyed it a lot. If you know Hebrew, I recommend that you listen to this episode

Data visualization as an engineering task – a methodological approach towards creating effective data visualization

In June 2019, I attended the NDR AI conference in Iași, Romania where I also gave a talk. Recently, the organizers uploaded the video recording to YouTube.

That was a very interesting conference, tight with interesting talks.

Next year, I plan to attend the Bucharest edition of NDR, where I will also give a talk with the working title “The biggest missed opportunity in data visualization”

A tangible productivity tool (and a book review)

One month ago, I stumbled upon a book called “Personal Kanban: Mapping Work | Navigating Life” by Jim Benson (all the book links use my affiliate code). Never before, I saw a more significant discrepancy between the value that the book gave me and its actual content. 

Even before finishing the first chapter of this book, I realized that I wanted to incorporate “personal kanban” into my productivity system. The problem was that the entire book could be summarized by a blog post or by a Youtube video (such as this one). The rest of the book contains endless repetitions and praises. I recommend not reading this book, even though it strongly affected the way I work

So, what is Personal Kanban anyhow? Kanban is a productivity approach that puts all the tasks in front of a person on a whiteboard. Usually, Kanban boards are physical boards with post-it notes, but software Kanban boards are also widely known (Trello is one of them). Following are the claims that Jim Benson makes in his book that resonated with me

  • Many productivity approaches view personal and professional life separately. The reality is that these two aspects of our lives are not separate at all. Therefore, a productivity method needs to combine them.
  • Having all the critical tasks in front of your eyes helps to get the global picture. It also helps to group the tasks according to their contexts. 
  • The act of moving notes from one place to another gives valuable tangible feedback. This feedback has many psychological benefits.
  • One should limit the number of work-in-progress tasks.
  • There are three different types of “productivity.” You are Productive when you work hard. You are Efficient when your work is actually getting done. Finally, you are Effective when you do the right job at the right time, and can repeat this process if needed. 

I’m a long user of a productivity method that I adopted from Mark Forster. You may read about my process here. Having read Personal Kanban, I decided to combine it with my approach. According to the plan, I have more significant tasks on my Kanban board, which I use to make daily, weekly, and long-term plans. For the day-to-day (and hour-to-hour) taks, I still use my notebooks. 

Initially, I used my whiteboard for this purpose, but something wasn’t right about it.

Having my Kanban on my home office whiteboard had two significant drawbacks. First, the whiteboard isn’t with me all the time. And what is the point of putting your tasks on board if you can’t see it? Secondly, listing everything on a whiteboard has some privacy issues. After some thoughts, I decided to migrate the Kanban to my notebook.

In this notebook, I have two spreads. The first spread is used for the backlog, and “this week” taks. The second spread has the “today,” “doing,” “wait,” and “done” columns. The fact that the notebook is smaller than the whiteboard turned out to be a useful feature. This physical limitation limits the number of tasks I put on my “today” and “doing” lists. 

I organize the tasks at the beginning of my working day. The rest of the system remains unchanged. After more than a month, I’m happy with this new tangible productivity method.

Data science tools with a graphical user interface

A Quora user asked about data science tools with a graphical user interface. Here’s my answer. I should mention though that I don’t usually use GUI for data science. Not that I think GUIs are bad, I simply couldn’t find a tool that works well for me.

Of the many tools that exist, I like the most Orange (https://orange.biolab.si/). Orange allows the user creating data pipelines for exploration, visualization, and production but also allows editing the “raw” python code. The combination of these features makes is a powerful and flexible tool.

The major drawback of Orange (in my opinion) is that is uses its own data format and its own set of models that are not 100% compatible with the Numpy/Pandas/Sklearn ecosystem.

I have made a modest contribution to Orange by adding a six-lines function that computes Matthews correlation coefficient.

Other tools are KNIME and Weka (none of them is natively Python).

There is also RapidMinder but I never used it.

Working in a distributed company. Communication styles

I work at Automattic, one of the largest distributed companies in the world. Working in a distributed company means that everybody in this company works remotely. There are currently about one thousand people working in this company from about seventy countries. As you might expect, the international nature of the company poses a communication challenge. Recently, I had a fun experience that demonstrates how different people are.

Remote work means that we use text as our primary communication tool. Moreover, since the company spans over all the time zones in the world, we mostly use asynchronous communication, which takes the form of posts in internal blogs. A couple of weeks ago, I completed a lengthy analysis and summarized it in a post that was meant to be read by the majority of the company. Being a responsible professional, I asked several people to review the draft of my report.

To my embarrassment, I discovered that I made a typo in the report title, and not just a typo: I misspelled the company name :-(. A couple of minutes after asking for a review, two of my coworkers pinged me on Slack and told me about the typo. One message was, “There is a typo in the title.” Short, simple, and concise.

The second message was much longer.

Do you want to guess what the difference between the two coworkers is?
.
.
.
.
.
Here’s the answer
.
.
.
.
The author of the first (short) message grew up and lives in Germany. The author of the second message is American. Germany, United States, and Israel (where I am from) have very different cultural codes. Being an Israeli, I tend to communicate in a more direct and less “sweetened” way. For me, the American communication style sounds a little bit “artificial,” even though I don’t doubt the sincerity of this particular American coworker. I think that the opposite situation is even more problematic. It happened several times: I made a remark that, in my opinion, was neutral and well-intended, and later I heard comments about how I sounded too aggressive. Interestingly, all the commenters were Americans.

To sum up. People from different cultural backgrounds have different communication styles. In theory, we all know that these differences exist. In practice, we usually are unaware of them.

Featured photo by Stock Photography on Unsplash

Sometimes, you don’t really need a legend

This is another “because you can” rant, where I claim that the fact that you can do something doesn’t mean that you necessarily need to.

This time, I will claim that sometimes, you don’t really need a legend in your graph. Let’s take a look at an example. We will plot the GDP per capita for three countries: Israel, France, and Italy. Plotting three lines isn’t a tricky task. Here’s how we do this in Python

plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.legend()

The last line in the code above does a small magic and adds a nice legend

This image has an empty alt attribute; its file name is image.png

In Excel, we don’t even need to do anything, the legend is added for us automatically.

This image has an empty alt attribute; its file name is image-1.png

So, what is the problem?

What happens when a person wants to know which line represents which country? That person needs to compare the line color to the colors in the legend. Since our working memory has a limited capacity, we do one of the following. We either jump from the graph to the legends dozens of times, or we try to find a heuristic (a shortcut). Human brains don’t like working hard and always search for shortcuts (I recommend reading Daniel Kahneman’s “Think Fast and Slow” to learn more about how our brain works).

What would be the shortcut here? Well, note how the line for Israel lies mostly below the line for Italy which lies mostly below the line for France. The lines in the legend also lie one below the other. However, the line order in these two pieces of information isn’t conserved. This results in a cognitive mess; the viewer needs to work hard to decipher the graph and misses the point that you want to convey.

And if we have more lines in the graph, the situation is even worse.

This image has an empty alt attribute; its file name is image-2.png

Can we improve the graph?

Yes we can. The simplest way to improve the graph is to keep the right order. In Python, we do that by reordering the plotting commands.

plt.plot(gdp.Year, gdp.Australia, '-', label='Australia')
plt.plot(gdp.Year, gdp.Belgium, '-', label='Belgium')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.legend()
This image has an empty alt attribute; its file name is image-3.png

We still have to work hard but at least we can trust our brain’s shortcut.

If we have more time

If we have some more time, we may get rid of the (classical) legend altogether.

countries = [c for c in gdp.columns if c != 'Year']
fig, ax = plt.subplots()
for i, c in enumerate(countries):
    ax.plot(gdp.Year, gdp[c], '-', color=f'C{i}')
    x = gdp.Year.max()
    y = gdp[c].iloc[-1]
    ax.text(x, y, c, color=f'C{i}', va='center')
seaborn.despine(ax=ax)

(if you don’t understand the Python in this code, I feel your pain but I won’t explain it here)

This image has an empty alt attribute; its file name is image-4.png

Isn’t it better? Now, the viewer doesn’t need to zap from the lines to the legend; we show them all the information at the same place. And since we already invested three minutes in making the graph prettier, why not add one more minute and make it even more awesome.

This image has an empty alt attribute; its file name is image-5.png

This graph is much easier to digest, compared to the first one and it also provides more useful information.

.

This image has an empty alt attribute; its file name is image-6.png

I agree that this is a mess. The life is tough. But if you have time, you can fix this mess too. I don’t, so I won’t bother, but Randy Olson had time. Look what he did in a similar situation.

percent-bachelors-degrees-women-usa

I also recommend reading my older post where I compared graph legends to muttonchops.

In conclusion

Sometimes, no legend is better than legend.

This post, in Hebrew: [link]

What do we see when we look at slices of a pie chart?

What do we see when we look at slices of a pie chart? Angles? Areas? Arc length? The answer to this question isn’t clear and thus “experts” recommend avoiding pie charts at all.

Robert Kosara is a Senior Research Scientist at Tableau Software (you should follow his blog https://eagereyes.org), who is very active in studying pie charts. In 2016, Robert Kosara and his collaborators published a series of studies about pie charts. There is a nice post called “An Illustrated Tour of the Pie Chart Study Results” that summarizes these studies. 

Last week, Robert published another paper with a pretty confident title (“Evidence for Area as the Primary Visual Cue in Pie Charts”) and a very inconclusive conclusion

While this study suggests that the charts are read by area, itis not conclusive. In particular, the possibility of pie chart usersre-projecting the chart to read them cannot be ruled out. Furtherexperiments are therefore needed to zero in on the exact mechanismby which this common chart type is read.

Kosara. “Evidence for Area as the Primary Visual Cue in Pie Charts.” OSF, 17 Oct. 2019. Web.

The previous Kosara’s studies had strong practical implications, the most important being that pie charts are not evil provided they are done correctly. However, I’m not sure what I can take from this one. As far as I understand the data, the answer to the questions in the beginning of this post are still unclear. Maybe, the “real answer” to these questions is “a combination of thereof”.

The problem with citation count as an impact metric

Inspired by A citation is not a citation is not a citation by Lior Patcher, this rant is about metrics.

Lior Patcher is a researcher in Caltech. As many other researchers in the academy, Dr. Patcher is measured by, among other things, publications and their impact as measured by citations. In his post, Lior Patcher criticised both the current impact metrics and also their effect on citation patterns in the academic community.

PROBLEM POINTED: citations don’t really measure “actual” citations. Most of the appeared citations are “hit and run citations” i.e: people mention other people’s research without taking anything from that research.

In fact this author has cited [a certain] work in exactly the same way in several other papers which appear to be copies of each other for a total of 7 citations all of which are placed in dubious “papers”. I suppose one may call this sort of thing hit and run citation.

via A citation is not a citation is not a citation — Bits of DNA

I think that the biggest problem with citation counts is that it costs nothing to cite a paper. When you add a research (or a post, for that matter) to your reference list, you know that most probably nobody will check whether actually read it, that nobody will check whether you got that publication correctly and that nobody will that the chances are super (SUUPER) low nobody will check whether you conclusions are right. All it takes is to click a button.

Book review. The War of Art by S. Pressfield

TL;DR: This is a long motivational book that is “too spiritual” for the cynic materialist that I am.

The War of Art by [Pressfield, Steven]

The War of Art is a strange book. I read it because “everybody” recommended it. This is what Derek Sivers’ book recommendation page says about this book

Have you experienced a vision of the person you might become, the work you could accomplish, the realized being you were meant to be? Are you a writer who doesn’t write, a painter who doesn’t paint, an entrepreneur who never starts a venture? Then you know what “Resistance” is.

As a known procrastinator, I was intrigued and started reading. In the beginning, the book was pretty promising. The first (and, I think, the biggest) part of the book is about “Resistance” — the force behind the procrastination. I immediately noticed that almost every sentence in this chapter could serve a motivational poster. For example

  • It’s not the writing part that’s hard. What’s hard is sitting down to write.
  • The danger is greatest when the finish line is in sight.
  • The most pernicious aspect of procrastination is that it can become a habit.
  • The more scared we are of a work or calling, the more sure we can be that we have to do it.

Individually, each sentence makes sense, but their concentration was a bit too much for me. The way Pressfield talks about Resistance resembles the way Jewish preachers talk about Yetzer Hara: it sits everywhere, waiting for you to fail. I’ tdon’t like this approach.

The next chapters were even harder for me to digest. Pressfield started talking about Muses, gods, prayers, and other “spiritual” stuff; I almost gave up. But I fought the Resistance and finished the book.

My main takeaways:

  • Resistance is real
  • It’s a problem
  • The more critical the task is, the stronger is the Resistance. OK, I kind of agree with this. Pressfield continues to something do not agree with: thus (according to the author), we can measure the importance of a task by the Resistance it creates.
  • Justifying not pursuing a task by commitments to the family, job, etc. is a form of Resistance.
  • The Pro does stuff.
  • The Artist is a Pro (see above) who does stuff even if nobody cares.

Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

On Sunday, I wrote about bootstrapping. On Monday, I wrote about visualization uncertainty. Let’s now talk about bootstrapping and uncertainty visualization.

Robert Grant is a data visualization expert who wrote a book about interactive data visualization (which I should read, BTW).

Robert runs an interesting blog from which I learned another approach to uncertainty visualization, bootstrapping.

Source: Robert Grant.

Read the entire post: Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

On MOOCs

When Massive Online Open Courses (a.k.a MOOCs) emerged some X years ago, I was ecstatic. I was sure that MOOCs were the Big Boom of higher education. Unfortunately, the MOOC impact turned out to be very modest. This modest impact, combined with the high production cost was one of the reasons I quit making my online course after producing two or three lectures. Nevertheless, I don’t think MOOCs are dead yet. Following are some links I recently read that provide interesting insights to MOOC production and consumption.

  • A systematic study of academic engagement in MOOCs that is scheduled for publication in the November issue of Erudit.org. This 20+ page-long survey summarizes everything we know about MOOCs today (I have to admit, I only skimmed through this paper, I didn’t read all of it)
  • A Science Magazine article from January, 2019. The article, “The MOOC pivot,” sheds light to the very low retention numbers in MOOCs.
  • On MOOCs and video lectures. Prof. Loren Barbara from George Washington University explains why her MOOCs are not built for video. If you consider creating an online class, you should read this.
  • The economic consequences of MOOCs. A concise summary of a 2018 study that suggest that MOOC’s economic impact is high despite the high churn rates.
  • Thinkful.com, an online platform that provides personalized training to aspiring data professionals, got in the news three weeks ago after being purchased for $80 million. Thinkful isn’t a MOOC per-se but I have a special relationship with it: a couple of years ago I was accepted as a mentor at Thinkful but couldn’t find time to actually mentor anyone.

The bottom line

We still don’t know how this future will look like and how MOOCs will interplay with the legacy education system but I’m sure the MOOCs are the future

Error bars in bar charts. You probably shouldn’t

This is another post in the series Because You Can. This time, I will claim that the fact that you can put error bars on a bar chart doesn’t mean you should.

It started with a paper by prof. Gerd Gigerenzer whose work in promoting numeracy I adore. The paper, “Natural frequencies improve Bayesian reasoning in simple and complex inference tasks” contained a simple graph that meant to convince the reader that natural frequencies lead to more accurate understanding (read the paper, it explains these terms). The error bars in the graph mean to convey uncertainty. However, the data visualization selection that Gigerenzer and his team selected is simply wrong.

First of all, look at the leftmost bar, it demonstrates so many problems with error bars in general, and in error bars in barplots in particular. Can you see how the error bar crosses the X-axis, implying that Task 1 might have resulted in negative percentage of correct inferences?

The irony is that Prof. Gigerenzer is a worldwide expert in communicating uncertainty. I read his book “Calculated risk” from cover to cover. Twice.

Why is this important?

Communicating uncertainty is super important. Take a look at this 2018 study with the self-explaining title “Uncertainty Visualization Influences how Humans Aggregate Discrepant Information.” From the paper: “Our study repeatedly presented two [GPS] sensor measurements with varying degrees of inconsistency to participants who indicated their best guess of the “true” value. We found that uncertainty information improves users’ estimates, especially if sensors differ largely in their associated variability”.

Image result for clinton trump polls
Source HuffPost

Also recall the surprise when Donald Trump won the presidential elections despite the fact that most of the polls predicted that Hillary Clinton had higher chances to win. Nobody cared about uncertainty, everyone saw the graphs!

Why not error bars?

Keep in mind that error bars are considered harmful, and I have a reference to support this claim. But why?

First of all, error bars tend to be symmetric (although they don’t have to) which might lead to the situation that we saw in the first example above: implying illegal values.

Secondly, error bars are “rigid”, implying that there is a certain hard threshold. Sometimes the threshold indeed exists, for example a threshold of H0 rejection. But most of the time, it doesn’t.

stacked round gold-colored coins on white surface

More specifically to bar plots, error lines break the bar analogy and are hard to read. First, let me explain the “bar analogy” part.

The thing with bar charts is that they are meant to represent physical bars. A physical bar doesn’t have soft edges and adding error lines simply breaks the visual analogy.

Another problem is that the upper part of the error line is more visible to the eye than the lower one, the one that is seen inside the physical bar. See?undefined

But that’s not all. The width of the error bars separates the error lines and makes the comparison even harder. Compare the readability of error lines in the two examples below

The proximity of the error lines in the second example (take from this site) makes the comparison easier.

Are there better alternatives?

Yes. First, I recommend reading the “Error bars considered harmful” paper that I already mentioned above. It not only explains why, but also surveys several alternatives

Nathan Yau from flowingdata.com had an extensive post about different ways to visualize uncertainty. He reviewed ranges, shades, rectangles, spaghetti charts and more.

Claus Wilke’s book “Fundamentals of Data Visualization” has a dedicated chapter to uncertainty with and even more detailed review [link].

Visualize uncertainty about the future” is a Science article that deals specifically with forecasts

Robert Kosara from Tableu experimented with visualizing uncertainty in parallel coordinates.

There are many more examples and experiments, but I think that I will stop right now.

The bottom line

Communicating uncertainty is important.

Know your tools.

Try avoiding error bars.

Bars and bars don’t combine well, therefore, try harder avoiding error bars in bar charts.

You don’t need a fast way to increase your reading speed by 25%. Or, don’t suppress subvocalization

Not long ago, I wrote a post about a fast hack that increased my reading speed by tracking the reading with a finger. I think that the logic behind using a tracking finger is to suppress subvocalization. I noticed that, at least in my case, suppressing subvocalization reduces the fun of reading. I actually enjoy hearing the inner voice that reads the book “with me”.

Bootstrapping the right way?

Many years ago, I terribly overfit a model which caused losses of a lot of shekels (a LOT). It’s not that I wasn’t aware of the potential overfitting. I was. Among other things, I used several bootstrapping simulations. It turns out that I applied the bootstrapping in a wrong way. My particular problem was that I “forgot” about confounding parameters and that I “forgot” that peeping into the future is a bad thing.

Anyhow, Yanir Seroussi, my coworker data scientist, gave a very good talk on bootstrapping.

How do I look like?

From time to time, people (mostly conference organizers) ask for a picture of mine. Feel free using any of these images

Visualizations with perceptual free-rides

Dr. Richard Brath is a data visualization expert who also blogs from time to time. Each post in Richard’s blog provides a deep, and often unexpected to me, insight into one dataviz aspect or another.

richardbrath

We create visualizations to aid viewers in making visual inferences. Different visualizations are suited to different inferences. Some visualizations offer more additional perceptual inferences over comparable visualizations. That is, the specific configuration enables additional inferences to be observed directly, without additional cognitive load. (e.g. see Gem Stapleton et al, Effective Representation of Information: Generalizing Free Rides2016).

Here’s an example from 1940, a bar chart where both bar length and width indicate data:

Walter_Weld__How_to_chart_data_1960_hathitrust2

The length of the bar (horizontally) is the percent increase in income in each industry.  Manufacturing has the biggest increase in income (18%), Contract Construction is second at 13%.

The width of the bar (vertically) is the relative size of that industry: Manufacturing is wide – it’s the biggest industry – it accounts for about 23% of all industry. Contract Construction is narrow, perhaps the third smallest industry, perhaps around 3-4%.

What’s really interesting is that

View original post 446 more words

Book review. Indistractable by Nir Eyal

Nir Eyal is known for his book “Hooked” in which he teaches how to create addictive products. In his new book “Indistractable“, Nir teaches how to live in the world full of addictive products. The book itself isn’t bad. It provides interesting information and, more importantly, practical tips and action items. Nir covers topics such as digital distraction, productivity and procrastination.

Indistractable Control Your Attention Choose Your Life Nir Eyal 3D cover

I liked the fact that the author “gives permission” to spend time on Facebook, Instagram, Youtube etc, as long as it is what you planned to do. Paraphrasing Nir, distraction isn’t distraction unless you know what it distracts you from. In other words, anything you do is a potential distraction unless you know what, why and when you are doing it.

My biggest problem with this book is that I already knew almost everything that Nir wrote. Maybe I already read too many similar books and articles, maybe I’m just that smart (not really) but for me, most of Indistractable wasn’t valuable.

Until I got to the chapter that deals with raising children (“Part 6, how to raise indistractable children”). I have to admit, when it comes to speaking about raising kids in the digital era, Nir is a refreshing voice. He doesn’t join the global hysteria of “the screens make zombies of our kids”. Moreover, Nir brings a nice collection of hysterical prophecies from the 15th, 18th and 20th centuries in which “experts” warned about the bad influence new inventions (such as printed books, affordable education, radio) had on the kids.

Another nice touch is the fact that each chapter has a short summary that consists of three-four bullet points. Even nicer is the fact that Nir copied all the “Remember this” lists at the end of the book, which is very kind of him.

The Bottom line. 4/5. Read.

14-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between 2008 and 2023 CE, and this is what we get:

Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.
(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

A fast way to increase your reading speed by 25%

I was sceptic but I tried, measured, and arrived to the conclusion. First, I set a timer to 60 seconds and read some text. I managed to read seventeen lines. Then, I used my finger to guide my eyes the same way kids do when they learn reading. It turned out that I was able to read lines of text. By simply using my finger. Impressive.

Book review: The Formula by A. L Barabasi

The bottom line: read it but use your best judgement 4/5

I recently completed reading “The Formula. The Universal Laws of Success” by Albert-László Barabási. Barabási is a network science professor who co-authored the “preferential attachment” paper (a.k.a. the Barabási-Albert model). People who follow him closely are ether vivid fabs or haters accusing him of nonsense science.

For several years, A-L Barabási is talking and writing about the “science of success” (yeah, I can hear some of my colleagues laughing right now). Recently, he summarized the research in this area in an easy-to-read book with the promising title “The Formula. The Universal Laws of Success.” The main takeaways that I took from this book are:

  • Success is about us, not about you. In other words, it doesn’t matter how hard you work and how good your work is, if “we” (i.e., the public) don’t know about it, or don’t see it, or attribute it to someone else.
  • Be known for your expertise. Talk passionately about your job. The people who talk about an idea will get the credit for it. Consider the following example from the book. Let’s say, prof. Barabasi and the Pope write a joint scientific paper. If the article is about network science, it will be perceived as if the Pope helped Barabasi with writing an essay. If, on the other hand, if it is a theosophical book, we will immediately assume that the Pope was the leading force behind it.
  • It doesn’t matter how old you are; the success can come to you at any age. It is a well-known fact that most successful people broke into success at a young age. What Barabási claims is that the reason for that is not a form of ageism but the fact that the older people try less. According to this claim, as long as you are creative and work hard, your most significant success is ahead of you.
  • Persistence pays. This is another claim that Barabasi makes in his book. It is related to the previous one but is based on a different set of observations (did you know that Harry Potter was rejected twelve times before it was published?). I must say that I’m very skeptical about this one. Right now, I don’t have the time to explain my reasons, and I promise to write a dedicated post.

Keep in mind that the author uses academic success (the Nobel prize, citation index, etc.) as the metric for most of his conclusions. This limitation doesn’t bother him, after all, Barabási is a full-time University professor, but most of us should add another grain of salt to the conclusions. 

Overall, if you find yourself thinking about your professional future, or if you are looking for a good career advice, I recommend reading this book. 

Why you should speak at conferences?

In this post, I will try to convince you that speaking at a conference is an essential tool for professional development.

Many people are afraid of public speaking, they avoid the need to speak in front of an audience and only do that when someone forces them to. This fear has deep evolutional origins (thousands of years ago, if dozens of people were staring at you that would probably mean that you were about to become their meal). However, if you work in a knowledge-based industry, your professional career can gain a lot if you force yourself to speak.

Two days ago, I spoke at NDR, a machine learning/AI conference in Iași, Romania. That was a very interesting conference, with a diverse panel of speakers from different branches of the data-related industry. However, the talk that I enjoyed the most was mine. Not because I’m a narcist self-loving egoist. What I enjoyed the most were the questions that the attendees asked me during the talk, and in the coffee breaks after it. First of all, these questions were a clear signal that my message resonated with the audience, and they cared about what I had to say. This is a nice touch to one’s ego. But more importantly, these questions pointed out that there are several topics that I need to learn to become more professional in what I’m doing. Since most of the time, we don’t know what we don’t know, such an insight is almost priceless.

That is why even (and especially) if you are afraid of public speaking, you should jump into the cold water and do it. Find a call for presentations and submit a proposal TODAY.

And if you are afraid of that awkward silence when you ask “are there any questions” and nobody reacts, you should read my post “Any Questions? How to fight the awkward silence at the end of the presentation“.

Curated list of established remote tech companies

Someone asked me about distributed companies or companies that offer remote positions. Of course, my first response was Automattic but that person didn’t think that Automattic was a good fit for them. So I googled and was surprised to discover that my colleague, Yanir Seroussi, maintains a list of companies that offer remote jobs.

I work at Automattic, one of the biggest distributed-only companies in the world (if not the biggest one). Recently, Automattic founder and CEO, Matt Mullenweg started a new podcast called (surprise) Distributed.

כוון הציר האפקי במסמכים הנכתבים מימין לשמאל

אני מחפש דוגמאות נוספות

יש לכם דוגמה של גרף עברי ״הפוך״? גרפים בערבית או פארסי? שלחו לי.

X-axis direction in Right-To-Left languages (part two)

I need more examples

Do you have more examples of graphs written in Arabic, Farsi, Urdu or another RTL language? Please send them to me.

Textbook examples

I already wrote about my interest in data visualization in Right-To-Left (RTL) languages. Recently, I got copies of high school calculus books from Jordan and the Palestinian Authority.

Both Jordan and PA use the same (Jordanian) school program. In both cases, I was surprised to discover that they almost never use Latin or Greek letters in their math notation. Not only that, the entire direction of the the mathematical notation is from right to left. Here’s an illustrative example from the Palestinian book.

Screenshot: Arabic text, Arabic math notation and a graph

And here is an example from Jordan

What do we see here?

  • the use of Arabic numerals (which are sometimes called Eastern Arabic numerals)
  • The Arabic letters س (sin) and ص (saad) are used “instead of” x and y (the Arabic alphabet doesn’t have the notion of capital letters). The letter qaf (ق) is used as the archetypical function name (f). For some reason, the capital Greek Delta is here.
  • More interestingly, the entire math is “mirrored”, compared to the Left-To-Write world — including the operand order. Not only the operand order is “mirrored”, many other pieces of math notation are mirrored, such as the square root sign, limits and others.

Having said all that, one would expect to see the numbers on the X-axis (sorry, the س-axis) run from right to left. But no. The numbers on the graph run from left to right, similarly to the LTR world.

What about mathematics textbooks in Hebrew?

Unfortunately, I don’t have a copy of a Hebrew language book in calculus, so I will use fifth grade math book

Despite the fact that the Hebrew text flows from right to left, we (the Israelis) write our math notations from left to right. I have never saw any exceptions of this rule.

In this particular textbook, the X axis is set up from left to right. This direction is obvious in the upper example. The lower example lists months — from January to December. Despite the fact the the month names are written in Hebrew, their direction is LTR. Note that this is not an obvious choice. In many version of Excel, for example, the default direction of the X axis in Hebrew document is from right to left.

I need more examples

Do you have more examples of graphs written in Arabic, Farsi, Urdu or another RTL language? Please send them to me.

Talking about productivity methods

The best way to procrastinate is to research productivity.

Boris Gorelik

This week, the majority of Automattic Data Division meets in person in Vienna. During one of the sessions I presented my productivity method to my friends and coworkers.

Presenting this method was a fun and enjoyable experience for me. I decided to try doing this again, in a more formal and structured way. If you know of a productivity-oriented meetups that might be interested in hearing me, let me know.

Some post-talk notes

It turns out that the method I’m using much closer to Mark Forster’s “Final Version” than to his AutoFocus

During the years, Mark Forster created and tested many time management approaches. Scan through this page http://markforster.squarespace.com/tm-systems to find something that might work for you to find something that might work for you.

An interesting way to beat procrastination when working from home

Working from home (or a coffee shop, or a library) is great. However, there is one tiny problem: the temptation not to work is sometimes much bigger than the temptation in a traditional office. In the traditional office you are expected to look busy which is the first step to do an actual work. When you work from home, nobody cares if you get up to have a cup of coffee or water the plants. This is GREAT but sometimes this freedom is too much. Sometimes, you wish someone would give you that look to encourage you to keep working.

This is the exact problem that Taylor Jacobson, the founder of https://focusmate.com is trying to solve. Here’s how Focusmate works. You schedule a fifty-minutes appointment with a random partner. During the session, you and your partner have exactly sixty seconds to tell each other what you want to achieve during the next fifty minutes and then start working, keeping the camera on. At the end of t the session, you and your partner tell each other how was your session. That’s it.

I signed up for this service and participated in two such session. I really liked the result. During that hour, I had the urge to get up for a coffee, to make phone calls, etc. But the fact that I saw someone on my screen, and the fact that they saw me stopped me. The result — 50 minutes of uninterrupted work. I even didn’t check Twitter, despite the fact that my buddy couldn’t see my screen.

I heard about this service in a podcast episode that was recommended to me by my coworker Ian Dunn. Focusmate is absolutely free for now. In that podcast show, Taylor (the founder) talks about the possible business models. Interestingly, when Taylor tried to crowd-fund this project he managed to get almost five time more money than he eventually planned to ([ref]).

One more thing. This podcast show, https://productivitycast.net, looks like an interesting podcast to follow if you are interested in productivity and procrastination.

The third wave data scientist – a useful point of view

In 2019, it’s hard to find a data-related blogger who doesn’t write about the essence and the future of data science as a profession. Most of these posts (like this one for example) are mostly useless both for existing data scientists who think about their professional plans and for people who consider data science as their career.

Today I saw yet another post which I find very useful. In this post, Dominik Haitz identifies a “third wave data scientist.” In Dominik’s opinion, a successful data scientist has to combine four features: (1) Business mindset (2) Software engineering craftsmanship (3) Statistics and algorithmic toolbox, and (4) Soft skills. In Dominik’s classification, the business mindset is not “another skill” but the central pillar.

The professional challenges that I have been facing during the past eighteen months or so, made me realize the importance of points 1, 2, and 3 from Dominik’s list (number 4 was already very important on my personal list). However, it took reading his post to put the puzzle parts in place.

Dominik’s additional contribution to the discussion is ditching the famous data science Venn Diagram in favor of another, “business-oriented” visual which I used as the “featured image” to this post.

Painting: sailors in a wavy sea
A fragment from an 1850 painting by the Russian Armenian marine painter Ivan Aivazovsky named “The Ninth Wave.” I wonder what the “ninth wave data scientist” will be.

To specialize, or not to specialize, that is the data scientists’ question

In my last post on data science career, I heavily promoted the idea that a data scientist needs to find his or her specialization. I back my opinion with my experience and by citing other people opinions. However, keep in mind that I am not a career advisor, I never surveyed the job market, and I might not know what I’m talking about. Moreover, despite the fact that I advocate for specialization, I think that I am more of a generalist.

Since I published the last post, I was pointed to some other posts and articles that either support or contradict my point of view. The most interesting ones are: “Why you shouldn’t be a data science generalist” and “Why Data Science Teams Need Generalists, Not Specialists“, both are very recent and very articulated but promote different points of view. Go figure

The featured image is based on a photo by Tom Parsons on Unsplash

The data science umbrella or should you study data science as a career move (the 2019 edition)?

TL/DR: Studying data science is OK as long as you know that it’s only a starting point.

Almost two years ago, I wrote a post titled “Don’t study data science as a career move.” Even today, this post is the most visited post on my blog. I was reminded about this post a couple of days ago during a team meeting in which we discussed what does a “data scientist” mean today. I re-read my original post, and I think that I was generally right, but there is a but…

The term “data science” was born as an umbrella term that meant to describe people who know programming, statistics, and business logic. We all saw those numerous Venn diagrams that tried to describe the perfect data scientist. Since its beginning, the field of “data science” has finally matured. There are more and more people that question the mere definition of data science.

Here’s what an entrepreneur Chuck Russel has to say:

Now don’t get me wrong — some of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental results and after proving or disproving the conjecture move on or iterate.

Screenshot of a Google image search showing many Venn diagrams
There can’t be enough Venn diagrams

Now, “create and test hypotheses” is a very vague requirement. After all, any A/B test is a process of “creating and testing hypotheses” using data. Is anyone who performs A/B tests a data scientist? I think not.
Moreover, a couple of years ago, if you wanted to run an A/B test, perform a regression analysis, build a classifier, you would have to write numerous lines of code, debug and tune it. This tedious and intriguing process certainly felt very “sciency,” and if it worked, you would have been very proud of our job. Today, on the other hand, we are lucky to have general-purpose tools that require less and less coding. I don’t remember the last time I had to implement an analysis or an algorithm from the first principles. With the vast amount of verified tools and libraries, writing an algorithm from scratch feels like a huge waste of time.
On the other hand, I spend more and more time trying to understand the “business logic” that I try to improve: why has this test fail? Who will use this algorithm and what will make them like the results? Does effort justify the potential improvement?

I (a data scientist) have all this extra time to think of a business logic thanks to the huge arsenal of generalized tools to choose from. These tools were created mostly by those data scientists whose primary job is to implement, verify, and tune algorithms. My job and the job of these data scientists is different and requires different sets of skills.

There is another ever-growing group of professionals who work hard to make sure someone can apply all those algorithms to any amount of data they feel suitable. These people know that any model is at most as good as the data it is based on. Therefore, they build systems that deliver the right information on time, distribute the data among computation nodes, and make sure no crazy “scientist” sends a production server to a non-responsive state due to a bad choice of parameters. We already have a term for professionals whose job is to build fail-proof systems. We call them engineers, or “data engineers” in this case.

The bottom line

Up till now, I mentioned three major activities that used to be covered by the data science umbrella: building new algorithms, applying algorithms to business logic, and engineering reliable data systems. I’m sure there are other areas under that umbrella that I forgot. In 2019, we reached the point where one has to decide what field of data science does one want to practice. If you consider stying data science think of it as studying medicine. The vast majority of physicians don’t end up general practitioners but rather invest at least five more years of their lives professionalize. Treat your data science studies as an entry ticket into the life-long learning process, and you’ll be OK. Otherwise, (I’m citing myself here): You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.

PS. Here’s a one-week-old article on Forbes.com with very similar theses: link.

Chișinău Jewish cemetery

Two years ago I visited Chișinău (Kishinev), the city in Moldova where I was born and where I grew up until the age of fifteen. Today I saw a post with photos from the ancient Chișinău Jewish cemetery and recalled that I too, took many pictures from that sad place. Less than half of the original cemetery survived to these days. The bigger part of it was demolished in the 1960s in favor of a park and a residential area. If you scroll through the pictures below, you will be able to see how they used tombstones to build the park walls.

Another notable feature of many Jewish cemeteries is memorial plates in memoriam of the relatives who don’t have their own graves — the relatives who were murdered over the course of the Jewish history.

On procrastination, or why too good can be bad

I’m a terrible procrastinator. A couple of years ago, I installed RescueTime to fight this procrastination. The idea behind RescueTime is simple — it tracks the sites you visit and the application you use and classifies them according to how productive you are. Using this information, RescueTime provides a regular report of your productivity. You can also trigger the productivity mode, in which RescueTime will block all the distractive sites such as Facebook, Twitter, news sites, etc. You can also configure RescueTime to trigger this mode according to different settings. This sounded like a killer feature for me and was the main reason behind my decision to purchase a RescueTime subscription. Yesterday, I realized how wrong I was.

RescueTime logo

When I installed RescueTime, I was full of good intentions. That is why I configured it to block all the distractive sites for one hour every time I accumulate more than 10 minutes of surfing such sites. However, from time to time, I managed to find a good excuse to procrastinate. Although RescueTime allows you to open a “bad” site after a certain delay, I found this delay annoying and ended up killing the RescueTime process (killing a process is faster than temporary disabling a filter). As a result, most of my workday stayed untracked, unmonitored, and unfiltered.

So, I decided to end this absurd situation. As of today, RescueTime will never block any sites. Instead of blocking, I configured it to show a reminder and to open my RescueTime dashboard, as a reminder to behave myself. I don’t know whether this non-intrusive reminder will be effective or not but at least I will have correct information about my day.

I have 101 followers!

Yesterday, the follower list of my blog exceeded one hundred followers! Even though I know that some of these followers are bots, this number makes me happy! Thank you all (humans and bots) for clicking the “follow” button.

Against A/B tests

Traditional A/B testsing rests on a fundamentally flawed premise. Most of the time, version A will be better for some subgroups, and version B will be better for others. Choosing either A or B is inherentlyinferior to choosing a targeted mix of A and B.

Michael Kaminsky locallyoptimistic.com

The quote above is from a post by Michael Kaminsky “Against A/B tests“. I’m still not fully convinced by Michael’s thesis but it is very interesting and thought-provoking. 

Useful redundancy — when using colors is not completely useless

The maximum data-ink ratio principle implies that one should not use colors in their graphs if the graph is understandable without the colors. The fact that you can do something, such as adding colors, doesn’t mean you should do it. I know it. I even have a dedicated tag on this blog for that. Sometimes, however, consistent use of colors serves as a useful navigation tool in a long discussion. Keep reading to learn about the justified use of colors.

Pew Research Center is a “is a nonpartisan American fact tank based in Washington, D.C. It provides information on social issues, public opinion, and demographic trends shaping the United States and the world.” Recently, I read a report prepared by the Pew Center on the religious divide in the Israeli society. This is a fascinating report. I recommend reading without any connection to data visualization.

But this post does not deal with the Isreali society but with graphs and colors.

Look at the first chart in that report. You may see a tidy pie chart with several colored segments. 

Pie chart: Religious composition of Israeli society. The chart uses several colored segments

Aha! Can’t they use a single color without losing the details? Of course the can! A monochrome pie chart would contain the same information:

Pie chart: Religious composition of Israeli society. The chart uses monochrome segments

In most of the cases, such a transformation would make a perfect sense. In most of the cases, but not in this report. This report is a multipage research document packed with many facts and analyses. The pie chart above is the first graph in that report that provides a broad overview of the Israeli society. The remaining of this report is dedicated to the relationships between and within the groups represented by the colorful segments in that pie chart. To help the reader navigating through this long report, its authors use a consistent color scheme that anchors every subsequent graph to the relevant sections of the original pie chart.

All these graphs and tables will be readable without the use of colors. Despite the fact that the colors here are redundant, this is a useful redundancy. By using the colors, the authors provided additional information layers that make the navigation within the document easier. I learned about the concept of useful redundancy from “Trees, Maps, and Theorems” by Jean-luc Dumout. If you can only read one book about data communication, it should be this book.

Microtext Line Charts

Why adding text labels to graph lines, when you can build graph lines using text labels? On microtext lines

richardbrath

Tangled Lines

Line charts are a staple of data visualization. They’ve existed at least since William Playfair and possibly earlier. Like many charts, they can be very powerful and also have their limitations. One limitation is the number of lines that can be displayed. One line works well: you can see trend, volatility, highs, lows, reversals. Two lines provides opportunity for comparison. 5 lines might be getting crowded. 10 lines and you’re starting to run out of colors. But what if the task is to compare across a peer group of 30 or 40 items? Lines get jumbled, there aren’t enough discrete colors, legends can’t clearly distinguish between them. Consider this example looking at unemployment across 37 countries from the OECD: which country had the lowest unemployment in 2010?

unemployment_plain

Tooltips are an obvious way to solve this, but tooltips have problems – they are much slower than just shifing visual attention…

View original post 1,323 more words

On the importance of perspective

Stalin was a relatively short man, his height was 1.65 m. Khrushchev was even shorter, his height was 1.60. It seems that the difference wasn’t enough for the official Soviet propaganda of that time. Take a look at this photo. We can clearly see that Stalin is taller than Khrushchev.

stalin.png

Do you notice something strange? Take a look at the windows in the background. I added horizontal and vertical guides for your convenience.

Screen Shot 2018-11-05 at 8.38.08

Now, look what happens when we fix the horizontal and vertical lines

Screen Shot 2018-11-05 at 8.39.03

Now, Khrushchev is still shorter than Stalin but not by that much.

Data visualization in right-to-left languages

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does.

Right-to-left (RTL) languages such as Hebrew, Arabic, and Farsi are used by roughly 1.8 billion people around the world. Many of them consume data in their native languages. Nevertheless, I have never seen any research or study that explores data visualization in RTL languages. Until a couple of days ago, when I saw this interesting observation by Nick Doiron “Charts when you read right-to-left“.

I teach data visualization in Israeli colleges. Whenever a student asks me RTL-related questions, I always answer something like “it’s complicated, let’s not deal with that”. Moreover, in the assignments, I even allow my students to submit graphs in English, even if they write the report in Hebrew.

Nick’s post made me wonder about data visualization do’s and don’ts in RTL environments. Should Hebrew charts differ from Arabic or Farsi? What are the accepted practices?

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does. I want to collect as many examples of data visualization in RTL languages. Links to research articles are more than welcome. You can leave your comments here or send them to boris@gorelik.net.

Thank you.

The image at the top of this post is a modified version of a graph that appears in the post that I cite. Unfortunately, I wasn’t able to find the original publication.

Can error correction cause more error? (The answer is yes)

This is an interesting thought experiment. Suppose that you have some appliance that acts in a normally distributed way. For example, a nerf gun. Let’s say now that you aim and fire the gun. What happens if you miss by some amount of X? Should you correct your aim in the opposite direction? My intuition says “yes.” So does the intuition of many other people with whom I talked about this problem. However, when we start thinking about this problem, we realize that the intuition is wrong. Since we aim the gun, our assumption should be that the deviation is zero. A single observation is not sufficient to reject this assumption. By continually adjusting the data generating process based on a single observation, we reduce the precision (increase the dispersion).
Below is a simulation of adjusted and non-adjusted processes (the code is here). The broader spread of the adjusted data (blue line) is evident.

Two curves. Blues: high dispersion of values when adjustments are performed after every observation. Orange: smaller dispersion when no adjustments are done.

Due to the nature of the normal random variable, a single large accidental deviation can cause an extreme “correction,” which in turn will create a prolonged period of highly inaccurate points. This is precisely what you see in my simulation.
The moral of this simple experiment is that you shouldn’t let a single affect your actions.

 

“Any questions?” How to fight the awkward silence at the end of a presentation?

If you ever gave or attended a presentation, you are familiar with this situation: the presenter asks whether there are any questions and … nobody asks anything. This is an awkward situation. Why aren’t there any questions? Is it because everything is clear? Not likely. Everything is never clear. Is it because nobody cares? Well, maybe. There are certainly many people that don’t care. It’s a fact of life. Study your audience, work hard to make the presentation relevant and exciting but still, some people won’t care. Deal with it.

However, the bigger reasons for lack of the questions are human laziness and the fear of being stupid. Nobody likes asking a question that someone will perceive as a stupid one. Sometimes, some people don’t mind asking a question but are embarrassed and prefer not being the first one to break the silence.

What can you do? Usually, I prepare one or two questions by myself. In this case, if nobody asks anything, I say something like “Some people, when they see these results ask me whether it is possible to scale this method to larger sets.”. Then, depending on how confident you are, you may provide the answer or ask “What do you think?”.

You can even prepare a slide that answers your question. In the screenshot below, you may see the slide deck of the presentation I gave in Trento. The blue slide at the end of the deck is the final slide, where I thank the audience for the attention and ask whether there are any questions.

My plan was that if nobody asks me anything, I would say “Thank you again. If you want to learn more practical advises about data visualization, watch the recording of my tutorial, where I present this method  <SLIDE TRANSFER, show the mockup of the “book”>. Also, many people ask me about reading suggestions, this is what I suggest you read: <SLIDE TRANSFER, show the reading pointers>

Screen Shot 2018-09-17 at 10.10.21

Luckily for me, there were questions after my talk. Luckily, one of these questions was about practical advice so I had a perfect excuse to show the next, pre-prepared, slide. Watch this moment on YouTube here.

Graphing Highly Skewed Data – Tom Hopper

My colleague, Chares Earl, pointed me to this interesting 2010 post that explores different ways to visualize categories of drastically different sizes.

The post author, Tom Hopper, experiments with different ways to deal with “Data Giraffes”. Some of his experiments are really interesting (such as splitting the graph area). In one experiment, Tom Hopper draws bar chart on a log scale. Doing so is considered as a bad practice. Bar charts value (Y) axis must include meaningful zero, which log scale can’t have by its definition.

Other than that, a good read Graphing Highly Skewed Data – Tom Hopper

Back to Mississippi: Black migration in the 21st century. By Charles Earl

I wonder how this analysis of remained unnoticed by the social media

The recent election of Doug Jones […] got me thinking: What if the Black populations of Southern cities were to experience a dramatic increase? How many other elections would be impacted?

via Back to Mississippi: Black migration in the 21st century — Charlescearl’s Weblog

16-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between 2008 and 2023 CE, and this is what we get:

Dynamics of the number of working days in Tishrei over the years. The average fluctuation is around 16 days

Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:

tishrei_2018_calendar

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.
(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

Sometimes, less is better than more

Today, during the EuroSciPy conference, I gave a presentation titled “Three most common mistakes in data visualization and how to avoid them”. The title of this presentation is identical to the title of the presentation that I gave in Barcelona earlier this year. The original presentation was approximately one and a half hours long. I knew that EuroSciPy presentations were expected to be shorter, so I was prepared to shorten my talk to half an hour. At some point, a couple of days before departing to Trento, I realized that I was only allocated 15 minutes. Fifteen minutes! Instead of ninety.

Frankly speaking, I was in a panic. I even considered contacting EuroSciPy organizers and asking them to remove my talk from the program. But I was too embarrassed, so I decided to take the risk and started throwing slides away. Overall, I think that I spent eight to ten working hours shortening my presentation. Today, I finally presented it. Based on the result, and on the feedback that I got from the conference audience, I now know that the 15-minutes version is better than the original, longer one. Video recording of my talk is available on Youtube and is embedded below. Below is my slide deck

 

 

Illustration image credit: Photo by Jo Szczepanska on Unsplash

An even better data visualization workshop

Boris Gorelik teaching in front of an audience.

Yesterday, I gave a data visualization workshop at EuroSciPy 2018 in Trento. I spent HOURs building and improving it. I even developed a “simple to use, easy to follow, never failing formula” for data visualization process (I’ll write about it later).

I enjoyed this workshop so much. Both preparing it, and (even more so) delivering it. There were so many useful questions and remarks. The most important remark was made by Gael Varoquaux who pointed out that one of my examples was suboptimal for vision impaired people. The embarrassing part is that one of the last lectures that I gave in my college data visualization course was about visual communication for the visually impaired. That is why the first thing I did when I came to my hotel after the workshop was to fix the error. You may find all the (corrected) material I used in this workshop on GitHub. Below, is the video of the workshop, in case you want to follow it.

 

 

 

Photo credit: picture of me delivering the workshop is by Margriet Groenendijk

Meet me at EuroSciPy 2018

I am excited to run a data visualization tutorial, and to give a data visualization talk during the 2018 EuroSciPy meeting in Trento, Italy.

My tutorial “Data visualization — from default and suboptimal to efficient and awesome”will take place on Sep 29 at 14:00. This is a two-hours tutorial during which I will cover between two to three examples. I will start with the default Matplotlib graph, and modify it step by step, to make a beautiful aid in technical communication. I will publish the tutorial notebooks immediately after the conference.

My talk “Three most common mistakes in data visualization” will be similar in nature to the one I gave in Barcelona this March, but more condensed and enriched with information I learned since then.

If you plan attending EuroSciPy and want to chat with me about data science, data visualization, or remote working, write a message to boris@gorelik.net.

Full conference program is available here.

Value-Suppressing Uncertainty Palettes – UW Interactive Data Lab – Medium

Uncertainty is one of the most neglected aspects of number-based communication and one of the most important concepts in general numeracy. Comprehending uncertainty is hard. Visualizing it is, apparently, even harder.

Last week I read a paper called Value-Suppressing Uncertainty Palettes, by M.Correll, D. Moritz, and J. Heer from the Data visualization and interactive analysis research at the University of Washington. This paper describes an interesting approach to color-encoding uncertainty.

Value-Suppressing Uncertainty Palette

Uncertainty visualization is commonly done by reducing color saturation and opacity.  Cornell et al suggest combining saturation reduction with limiting the number of possible colors in a color palette. Unfortunately, there the authors used Javascript and not python for this paper, which means that in the future, I might try implementing it in python.

Two figures visualizing poll data over the USA map, using different approaches to visualize uncertainty

 

Visualizing uncertainty is one of the most challenging tasks in data visualization. Uncertain

 

via Value-Suppressing Uncertainty Palettes – UW Interactive Data Lab – Medium

Evolution of a complex graph. Part 1. What do you want to say?

From time to time, people ask me for help with non-trivial data visualization tasks. A couple of weeks ago, a friend-of-a-friend-of-a-friend showed me a set of graphs with the following note:

Each row is a different use case. Each use case was tested on three separate occasions – columns 1,2,3. We hope to show that the lines in each row behave similarly, but that there are differences between the different rows.

Before looking at the graphs, note the last sentence in the above comment. Knowing what you want to show is an essential and not trivial part of a data visualization task. Specifying what is it precisely that you want to say is the first required task in any communication attempt, technical or not.

For the obvious reasons, I cannot share the original graphs that that person gave me. I managed to re-create the spirit of those graphs using a combination of randomly generated arrays.
The original graph: A 3-by-4 panel of line charts
Notice how the X- and Y- axes are aligned between all the subplots. Such alignment is a smart move that provides a shared scale and allows faster and more natural comparison between the curves. You should always try aligning your axes. If aligning isn’t possible, make sure that it is absolutely, 100%, clear that the scales are different. Slight differences are very confusing.

There are several small things that we can do to improve this graph. First, the identical legends in every subplot are a useless waste of ink and thus, of your viewers’ processing power. Since they are identical, these legends do nothing but distracting the viewer. Moreover, while I understand how a variable name such as event_prob appeared on a graph, showing such names outside technical teams is a bad practice. People who don’t share intimate knowledge with the underlying data will find human-readable labels easier to comprehend, making your message “stickier.”
Let’s improve the signal-to-noise ratio of this plot.
An improved version of the 3-by-4 grid of line charts

According to our task, each row is a different use case. Notice that I accompanied each row with a human-readable label. I didn’t use cryptic code such as group_001, age_0_10 or the such.
Now, let’s go back to the task specification. “We hope to show that the lines in each row behave similarly, but that there are differences between the separate rows.” Remember my advice to always use conclusions as graph titles? Let’s test how such a title will look like

A hypothetical screenshot. The title says: "low intra- & high inter- group variability"

Really? Is there a better way to justify the title? I claim that there is.

Let’s experiment a little bit. What will happen if we will plot all the lines on the same graph? By doing so, we might create a stronger emphasize of the similarities and the differences.

Overlapping lines that show several repetitions in four different groups
Not bad. The separate lines create some excessive noise, and the legend isn’t the best way to label multiple lines, so let’s improve the graph even further.

Curves representing four different data groups. Shaded areas represent inter-group variability

Note that meaningful ticks on the X-axis. The 30, 180, and 365-day marks provide useful anchors.

Now, let us go back to our title. “Low intra- and high inter- group variability” is, in fact, two conclusions. If you have ever read any text about technical presentations, you should remember the “one point per slide” rule. How do we solve this problem? In cases like these, I like to use the same graph in two different slides, one for each conclusion.

Screenshot showing two slides. The first one is titled "low within-group variability". The second one is titled "High between-group variability". The graphs in the slides is the same

During a presentation, I would show this graph with the first conclusion as a title. I would talk about the implications of that conclusion. Next, I will say “wait! There is more”, will promote the slide and start talking about the second conclusion.

To sum up,

First, decide what is it that you want to say. Then ask whether your graph says what you want to say. Next, emphasize what you want to say, and finally, say what you want to say.

To be continued

The case that you see in this post is a relatively easy one because it only compares four groups. What will happen if you will need to compare six, sixteen or sixty groups? I will try answering this question in one of my next posts

C for Conclusion

From time to time, I give a lecture about most common mistakes in data visualization. In this lection, I say that not adding a graph’s conclusion as a title is an opportunity wasted

Screenshot. Slide deck. The slide says

In one of these lectures, a fresh university graduate commented that in her University, she was told to never write a conclusion in a graph. According to to the logic she was tought, a scientist is only supposed to show the information, and let his or her peer scientists draw the conclusions by themselves. This sounds like a valid demand except that it is, in my non-humble opinion, wrong. To understand why is that, let’s review the arguments in favor of spelling out the conclusions.

The cynical reason

We cannot “unlearn” how to read. If you show a piece of graphic for its aesthetic value, it is perfectly OK not to suggest any conclusions. However, most of the time, you will show a graph to persuade someone, to convince them that you did a good job, that your product is worth investing in, or that your opponent is ruining the world. You hope that your audience will reach the conclusion that you want them to reach, but you are not sure. Spelling out your conclusion ensures that the viewers use it as a starting point. In many cases, they will be too lazy to think of objections and will adopt your point of view. You don’t have to believe me on this one. The Nobel Prize winner Daniel Kahneman wrote a book about this phenomenon.

What if you want to hear genuine criticism? Use the same trick to ask for it. Write an open question instead of the conclusion to ensure everybody wakes up and start thinking critically.

The self-discipline reason

Some people are not comfortable with the cynical way I suggest to exploit the limitations of the human mind. Those people might be right. For them, I have another reason, self-discipline. Coming up with a short, concise and descriptive title requires effort. This effort slows you down and ensures that you start thinking critically and asking questions. “What does this graph really tells?” “Is this the best way to demonstrate this conclusion?” “Is this conclusion relevant to the topic of my talk, is it worth the time?”. These are very important questions that someone has to ask you. Sure, having a professional and devoted reviewer on your prep team is great but unless you are a Fortune-500 CEO, you are preparing your presentations by yourself.

The philosophical reason

You will notice that my two arguments sound like a hack. They do not talk about the “pure science attitude”, and seem to be detached from the theoretical picture of the idealized scientific process. That is why, when that student objected to my suggestion, I admitted defeat. Being a data scientist, I want to feel good about my scientific practice. It took me a while but at some point, I realized that writing a conclusion as the sole title of a graph or a slide is a good scientific practice and not a compromise.

According to the great philosopher Karl Popper, a mandatory characteristic of any scientific theory is that they make claims that future observations might show to be false. Popper claims that without taking a risk of being proved wrong,  a scientist misses the point  [ref]. And what is the best way to make a clear, risky statement, if not spelling it out as a clear, non-ambiguous title of your graph?

Don’t feel bad, your bases are covered

To sum up, whenever you create a graph or a slide, think hard about what conclusion you want your audience to make out of it. Use this conclusion as your title. This will help you check yourself, and will help your fellow scientists assess your theory. And if a purist professor says you shouldn’t write your conclusions, tell him or her that the great Karl Popper thought otherwise.

 

In defense of three-dimensional graphs

“There is only one thing worse than a pie chart. It’s a 3-D pie chart”. This is what I used to think for quite a long time. Recently, I have revised my attitude towards pie charts, mainly due to the works of Rober Kosara from Tableau. I am no so convinced that pie charts can be a good visualization choice, I even included a session “Pie charts as an alternative to bar charts” in my recent workshop.

What about three-dimensional graphs? I’m not talking about the situations where the data is intrinsically three-dimensional. Such situations lie within the consensus. I’m talking about adding a third dimension to graphs that can work in two dimensions. Like the example below that is taken from a 2017 post by Deven Wisner.

Screenshot: a 3D pie chart with text "The only good thing about this pie chart is that it's starting to look more like [a] real pie"

Of course, this is not a hypothetical example. We all remember how the late Steve Jobs tried to create a false impression of Apple market share

Steve Jobs during a presentation, in front of a

Having said all that, can you think of a legitimate case where adding the third dimension adds more value than distraction? I worked hard, and I finally found it.

 

Take a look at the overlapping density plot (a.k.a “joy plot”).

Three joyplot examples

If you think of this, joyplots are nothing more than 3-d line graphs done well. Most of the time, they provide information-rich data overview that also enables digging into fine details. I really like joyplots. I included one in my recent workshop. Many libraries now provide ready-to-use implementations of joyplots. This is a good thing to have. The only reservation that I have about those implementations is the fact that many of them, including my favorite seaborn, add meaningless colors to the curves. But this is a topic for another rant.

Today’s workshop material

Today, I hosted a data visualization workshop, as a part of the workshop day adjacent to the fourth Israeli Data Science Summit. I really enjoyed this workshop, especially the follow-up questions. These questions are  the reason I volunteer talking about data visualization every time I can. It may sound strange, but I learn a lot from the questions people ask me.

If you want to see the code, you may find it on GitHub. The slide deck is available on Slideshare

Me in front of an audience

 

 

If you know matplolib and are in Israel on May 27th, I need your help

So, the data visualization workshop is fully booked. The organizers told me to expect 40-50 attendees and I need some assistance. I am looking for a person who will be able to answer technical questions such as “I got a syntax error”, “why can’t I see this graph?”, “my graph has different colors”.

It’s a good opportunity to attend the workshop for free, to learn a lot of useful information, and to meet a lot of smart people.

It’s a win-win situation. Contact me now at boris@gorelik.net

Prerequisites for the upcoming data visualization workshop

I have been told that the data visualization workshop (“Data Visualization from default to outstanding. Test cases of tough data visualization“) is completely sold out. If you plan to attend this workshop, please check out the repository that I created for it [link]. In that repository, you will find a list of pre-requisites that you absolutely need to meet before the workshop. Also, it will be very helpful if you could fill this poll which will help me prepare for the workshop.

See you soon

 

 

I will host a data visualization workshop at Israel’s biggest data science event

TL/DR

 

What: Data Visualization from default to outstanding. Test cases of tough data visualization

Why:  You would never settle for default settings of a machine learning algorithm. Instead, you would tweak them to obtain optimal results. Similarly, you should never stop with the default results you receive from a data visualization framework. Sadly, most of you do.

When: May 27, 2018 (a day before the DataScience summit)/ 13:00 – 16:00

Where:  Interdisciplinary Center (IDC) at Herzliya.

More info: here.

Timeline:
1. Theoretical introduction: three most common mistakes in data visualization (45 minutes)
2. Test case (LAB): Plotting several radically different time series on a single graph (45 minutes)
3. Test case (LAB): Bar chart as an effective alternative to a pie chart (45 minutes)
4. Test case (LAB): Pie chart as an effective alternative to a bar chart (45 minutes)

More words

According to the conference organizers, the yearly Data Science Summit is the biggest data science event in Israel. This year, the conference will take place in Tel Aviv on Monday, May 28. One day before the main conference, there will be a workshop day, hosted at the Herzliya Interdisciplinary Center. I’m super excited to host one of the workshops, during the afternoon session. During this workshop, we will talk about the mistakes data scientist make while visualizing their data and the way to avoid them. We will also have some fun creating various charts, comparing the results, and trying to learn from each others’ mistakes.

Register here.

Whoever owns the metric owns the results — don’t trust benchmarks

Other factors being equal, what language would you choose for heavy numeric computations: Python or PHP? This is not a language war but a serious question. For me, the choice seems to be obvious: I would choose Python, and I’m not the only one. In this survey, for example, 45% of data scientist use Python, compared to 24% who use PHP. The two sets of data scientists aren’t mutually exclusive, but we do see the picture.

This is why I was very surprised when a colleague of mine suggested switching to PHP due to a three times faster performance in a benchmark. I was very surprised and intrigued. Especially, when I noticed that they used a heavy number crunching for the benchmark.

In that benchmark, the authors compute prime numbers using the following Python code

def get_primes7(n):
	"""
	standard optimized sieve algorithm to get a list of prime numbers
	--- this is the function to compare your functions against! ---
	"""
	if n &lt; 2:
		return []
	if n == 2:
		return [2]
	# do only odd numbers starting at 3
	if sys.version_info.major &lt;= 2:
		s = range(3, n + 1, 2)
	else:  # Python 3
		s = list(range(3, n + 1, 2))
	# n**0.5 simpler than math.sqr(n)
	mroot = n ** 0.5
	half = len(s)
	i = 0
	m = 3
	while m &lt;= mroot:
		if s[i]:
			j = (m * m - 3) // 2  # int div
			s[j] = 0
			while j =6, Returns a array of primes, 2 &lt;= p <span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>&lt; n &quot;&quot;&quot;
    sieve = np.ones(n//3 + (n%6==2), dtype=np.bool)
    sieve[0] = False
    for i in range(int(n**0.5)//3+1):
        if sieve[i]:
            k=3*i+1|1
            sieve[      ((k*k)//3)      ::2*k] = False
            sieve[(k*k+4*k-2*k*(i&amp;1))//3::2*k] = False
    return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]

Did you notice the problem? The code above is a pure Python code. I can't think of a good reason to use pure python code for computationally-intensive, time-sensitive tasks. When you need to crunch numbers with Python, and when the computational time is even remotely important, you will most certainly use tools that were specifically optimized for such tasks. One of the most important such tools is numpy, in which the most important loops are implemented in C++ or in Fortran. Many other packages, such as Pandas, scipy, sklearn, and others rely on numpy or other form of speed optimization.

The following snippet uses numpy to perform the same computation as the first one.

def numpy_primes(n):
    # http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
    """ Input n&gt;=6, Returns a array of primes, 2 &lt;= p <span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>&lt; n &quot;&quot;&quot;
    sieve = np.ones(n//3 + (n%6==2), dtype=np.bool)
    sieve[0] = False
    for i in range(int(n**0.5)//3+1):
        if sieve[i]:
            k=3*i+1|1
            sieve[      ((k*k)//3)      ::2*k] = False
            sieve[(k*k+4*k-2*k*(i&amp;1))//3::2*k] = False
    return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]

On my computer, the timings to generate primes smaller than 10,000,000 is 1.97 seconds for the pure Python implementation, and 21.4 milliseconds for the Numpy version. The numpy version is 92 times faster!

What does that mean? 
Whoever owns the metric owns the results. Never trust a benchmark result before you understand how the benchmark was performed, and before making sure the benchmark was performed under the conditions that are relevant to you and your problem.

 

 

Three most common mistakes in data visualization 
and how to avoid them. Now, the slides

Yesterday, I talked in front of the Barcelona Data Science and Machine Learning Meetup about the most common mistakes in data visualization. I enjoyed talking with the local community very much. Judging by the feedback I received during and after the talk, they too, enjoyed my presentation. I uploaded my slides to Slideshare.

Me in front of a screen that shows a bar chart

Enjoy!

Live in Barcelona. Three most common mistakes in data visualization.

On Thursday, March 20, I will give a talk titled “Three most common mistakes in data visualization and how to avoid them.” I will be a guest of the Barcelona Data Science and Machine Learning Meetup Group. Right now, less than twenty-four hours after the lecture announcement, there are already seventeen people on the waiting list. I feel a lot of responsibility and am very excited.

 

Visiting the outer space isn’t such a big deal

I know a lot of people who dreamt of being a cosmonaut or an astronaut. I was one of them. Did you know that visiting the outer space isn’t such a big deal? Since the Yuri Gagarin’s first flight to space in 1961, 557 more people flew to space. Unfortunately, not all of them survived the trip [ref].

On the other hand

There are 193 UN member countries. Do you know that, according to Wikipedia, there are only 13 (thirteen!) people who are confirmed to visit all of these countries? [ref] It’s 43 times less than the number of astronauts!

558 people visited space; 13 people visited all the countries in the world

On algorithmic fairness & transparency

My teammate, Charles Earl has recently attended the Conference on Fairness, Accountability, and Transparency (FAT*). The conference site is full of very interesting material, including proceedings and video recording of lectures and tutorials.

Reading through the conference proceedings, I found a very interesting paper titled “The Cost of Fairness in Binary Classification.” This paper talks about the measures one needs to take in order not use sensitive features (such as race) as the means to discrimination, with a reasonable accuracy tradeoff.

Skimming through this paper, I recalled a conversation I had about a year ago with a chief data scientist in a startup that provides short-term loans to people who need some money now. The major job of the data science team in that company was to assess the risk of a customer. From the explanation the chief data scientist gave, and from the data sources she described, it was clear that they train their model on the information whether a person is likely to receive a loan from a financial institution. When I pointed out that they exclude categories of people that are rejected but are likely to return the money. “Yes?” she said in a tone as if she couldn’t see what the problem that I tried to raise was. “Well,” I said, it’s unfair for many customers, plus you’re missing the chance to recruit customers who were rejected by others”. “We have enough potential customers,” she said. She didn’t think fairness was an issue worth talking about.

 

The featured image is by Søren Astrup Jørgensen from Unsplash

 

Five misconceptions about data science

One item on my todo list is to write a post about “three common misconceptions about data science. Today, I found this interesting post that lists misconceptions much better than I would have been able to do. Plus, they list five of them. That 67% more than I intended to do 😉

I especially liked the section called “What is a Data Scientist” that presents six Venn diagrams of a dream data scientist.

The analogy between the data scientist and a purple unicorn is still apt – finding an individual that satisfies any one of the top four diagrams above is rare.

 

Enjoy reading  Five Misconceptions About Data Science – Knowing What You Don’t Know — Track 2 Analytics

Blogging isn’t what it used to be

From time to time, I assume something, evaluate that assumption, and discover that the reality is opposite to what I thought it was. That’s exactly what happened when I thought about the dynamics of Google searches for “create a site,” compared to the searches for “create a blog.” I was sure that there would be much more searches for “create a site.” I was wrong

blogging_is_not_what_it_used_to_be.png

There are several interesting insights that one can drive from that small analysis.

  1. The number of people who search for “create a site” is continuously dropping.
  2. Ever since 2009, the number of searches for “create a site” is smaller than the number of searches for “create a blog.” Why? I have no idea
  3. Blog creation search dynamics is also interesting. Both “start a blog” and “create a blog” have been decreasing since January 2011. However, despite the fact that both the curves started at the same height, and reached the same peak, they did so in different trajectories. “Create a blog” reached a peak gradually, following a concave path. “Start a blog,” on the other hand, reached the peak following a convex path that resembles exponential growth. For some reason, in January 2009 growth of both the searches stopped.

Usually, in posts like this, you would expect an analysis that explains the difference. I don’t have any answers. However, if you have any hypothesis, I will be glad to hear.

 

ASCII histograms are quick, easy to use and implement

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

Mammogram, breast cancer, and manipulative statistics

Here’s a quiz

A healthy woman with no risk factors gets a positive mammogram result during a routine annual check. What is the probability that she actually has a breast cancer?
Baseline data: The probability that a woman has breast cancer is 0.8%. If she has breast cancer, the probability that a mammogram will show a positive result is 90%. If a woman does not have breast cancer, the probability of a positive result is 7%.

Prof. Gerd Gigerenzer gave this quiz to numerous students, physicians, and professors. Most of them failed this quiz. The correct answer is 9%. The probability that a healthy woman has a breast cancer if she has a positive mammogram test is only nine percent! This means that ninety percent of women who get a positive result will undergo stressful and painful series of tests only to discover that that was a false alarm. In his book “Calculated Risks“, prof. Gigerenzer uses this low probability as a starting (but not the only) argument against the common practice of routine population-wide mammogram tests. However, I would like to propose another way to look at this problem.
To understand my concern, let me first explain how we get the 9% figure.
There are several ways to get to this result. One of them is as follows. Eighty out of 10,000 women have breast cancer. Of those women, 72 (90% of 80) will test positive during a mammogram. Of the remaining 9,920 healthy women, about 694 (7%) will also have a positive mammogram test. The total number of women with a positive test is 766. Of those 766 women, only 72 have breast cancer, which is about 9%. The following diagram will help you track the numbers.

Diagram that presents natural occurrence of breast cancer, and the statistics of mammogram tests

Nine percent is indeed a low number. If a woman gets ten mammogram tests in her lifetime, there is a 60+% chance that she will have at least one false positive test. This is not something that can be easily ignored.

However

Let’s think about another way to look at this problem. Yes, the probability of a woman to have a breast cancer given that she has a positive mammogram result is nine percent (72 out of 697+72=766). However, the probability of a woman to have a breast cancer given that she has a negative mammogram result is 8 out of (9,223+8)=9,231 which is approximately 0.09%. That means that a woman with a positive mammogram test is 100 times more likely to have a breast cancer, compared to the woman with a negative result. Increase by a factor of 100 sounds like a serious threat. Much more serious than the nine percent! Moreover, a woman with a negative mammogram result knows that she is approximately ten times less likely to have a breast cancer than an average woman who didn’t undergo the test (0.09% vs 0.8%).

Conclusion?

Frankly, I don’t know. One thing is for sure; one can use statistics to steer an “average person” towards the desired decision. If my goal is to increase reduce the number of women who undergo routine mammogram tests, I will talk in terms of absolute risk (9%). If, on the other hand, I’m selling mammogram equipment, I will definitely talk in terms of the odds ratio, i.e., the 100-times risk increase. Think about this every time someone is talking to you about hazards.

One of the reasons I don’t like R

I never liked R. I didn’t like it for the first time I tried to learn it, I didn’t like it when I had to switch to R as my primary work tool at my previous job. And didn’t like it one and a half year later, when I was comfortable enough to add R to my CV, right before leaving my previous job.

Today, I was reminded of one feature (out of so many) that made dislike R. It’s its import (or library, as they call it in R) feature. In Python, you can import a_module and then use its components by calling a_model.a_function. Simple and predictable. In R, you have to read the docs in order to understand what will happen to your namespace after you have library(a.module) (I know, those dots grrrr) in your code. This feature is so annoying that people write modules that help them using other modules. Like in this blog post, which looks like an interesting thing to do, but … wouldn’t it be easier to use Python?

 

Overfitting reading list

Overfitting is a situation in which a model accurately describes some data but not the phenomenon that generates that data. Overfitting was a huge problem in the good old times, where each data point was expensive, and researchers operated on datasets that could fit a single A4 sheet of paper. Today, with mega- giga- and tera-bytes datasets, overfitting is … still a problem. A very painful one. Following is a short reading list on overfitting.

I would like to start with Mehmet Suzen mllib.wordpress.com who treats overfitting as “inaccurate meme in supervised learning

cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model.

Another blogger, whose name I couldn’t find, has two very detailed posts on overfitting:

Understanding overfitting from bias-variance trade-off and Understanding overfitting from Haussler 1988 theorem

Finally, Adrian from the “morning paper” (please don’t tell me you don’t follow that blog) has a summary of another paper, titled “Understanding deep learning requires re-thinking generalization” (I only read Adrian’s summary).

Conclusion

No conclusions here. It’s a reading list.

Featured image credit: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg

Tips on making remote presentations

Today, I made a presentation to the faculty of the Chisinau
Institute of Mathematics and Computer Science. The audience gathered in a conference room in Chisinau, and I was in my home office in Israel.

Me presenting in front of the computer

Following is a list of useful tips for this kind of presentations.

  • When presenting, it is very important to see your audience. Thus, use two monitors. Use one monitor for screen sharing, and the other one to see the audience
  • Put the (Skype) window that shows your audience under the camera. This way you’ll look most natural on the other side of the teleconference.
  • Starting a presentation in Powerpoint or Keynote “kidnaps” all the displays. You will not be able to see the audience when that happens. I export the presentation to a PDF file and use Acrobat Reader in full-screen mode. The up- and down- buttons in my presentation remote control work with the Reader. The “make screen black” button doesn’t.
  • I open a “lightable view” of my presentation and put it next to the audience screen. It’s not as useful as seeing the presenter’s notes using a “real” presentation program, but it is good enough.
  • Stand up! Usually, we stand up when we present in front of live audience. For some reason, when presenting remotely, people tend to sit. A sitting person is less dynamic and looks less engaging. I have a standing desk which allows me to stand up and to raise the camera to my face level. If you can’t raise the camera, stay sitting. You don’t want your audience staring at your groin.

Auditorium in Chisinau showing me on their screen

 

The best productivity system I know

I am an awful procrastinator. I realized that, many years ago. Once I did, I started searching for productivity tips and systems. Of course, most of these searches are another form of procrastination. After all, it’s much more fun to read about productivity than writing that boring report. In 2012, I discovered a TiddlyWiki that implements AutoFocus — a system developed by Mark Forster (AutoFocus instructions: link, TiddlyWiki page link)

I loved the simplicity of that system and used it for a while. I also started following Mark Forster’s blog. Pretty soon after that, Mark published another, even simpler version of that system, which he called “The Final Version.” I loved it even better and readily adopted it. For many reasons, I moved from TiddlyWiki to Trello and made several personal adjustments to the system.

At some point, I read “59 seconds”  in which the psychologist Richard Wiseman summarizes many psychological studies in the field of happiness, productivity, decision making, etc. From that book, I learned about the power of writing things down. It turns out, that when you write things down, your brain gets a better chance to analyze your thoughts and to make better decisions. I also learned from other sources about the importance to disconnect from the Internet several times a day. So, on November 2016, I made a transition from electronic productivity system to an old school notebook. In the beginning, I decided to keep that notebook as a month-long experiment, but I loved that very much. Since then, I have always had my analog productivity system and an introspection device with me. Today, I started my sixth notebook. I love my system so much, I actually consider writing a book about it.

Blank notebook page with #1 in the page corner
The first page of my new notebook. The notebook is left-to-right since I write in Hebrew

Once again on becoming a data scientist

My stand on learning data science is known: I think that learning “data science” as a career move is a mistake. You may read this long rant of mine to learn why I think so. This doesn’t mean that I think that studying data science, in general, is a waste of time.

Let me explain this confusion. Take this blogger for example https://thegirlyscientist.com/. As of this writing, “thegirlyscientst” has only two posts: “Is my finance degree useless?” and “How in the world do I learn data science?“. This person (whom I don’t know) seems to be a perfect example of someone may learn data science tools to solve problems in their professional domain. This is exactly how my professional career evolved, and I consider myself very lucky about that. I’m a strong believer that successful data scientists outside the academia should evolve either from domain knowledge to data skills or from statistical/CS knowledge to domain-specific skills. Learning “data science” as a collection of short courses, without deep knowledge in some domain, is in my opinion, a waste of time. I’m constantly doubting myself with this respect but I haven’t seen enough evidence to change my mind. If you think I miss some point, please correct me.

 

 

The case of meaningless comparison

Exposé, an Australian-based data analytics company, published a use case in which they analyze the benefits of a custom-made machine learning solution. The only piece of data in their report [PDF] was a graph which shows the observed and the predicted

Screenshot that shows two time series curves: one for the observed and one for the predicted values

Graphs like this one provide an easy-to-digest overview of the data but are meaningless with respect to our ability to judge model accuracy. When predicting values of time series, it is customary to use all the available data to predict the next step. In cases like that, “predicting” the next value to be equal to the last available one will result in an impressive correlation. Below, for example, is my “prediction” of Apple stock price. In my model, I “predict” tomorrow’s prices to be equal to today’s closing price plus random noise.

Two curves representing two time series - Apple stock price and the same data shifted by one day

Look how impressive my prediction is!

I’m not saying that Exposé constructed a nonsense model. I have no idea what their model is. I do say, however, that their communication is meaningless. In many time series, such as consumption dynamics, stock price, etc, each value is a function of the previous ones. Thus, the “null hypothesis” of each modeling attempt should be that of a random walk, which means that we should not compare the actual values but rather the changes. And if we do that, we will see the real nature of the model. Below is such a graph for my pseudo-model (zoomed to the last 20 points)

diff_series

 

Suddenly, my bluff is evident.

To sum up, a direct comparison of observed and predicted time series can only be used as a starting point for a more detailed analysis. Without such an analysis, this comparison is nothing but a meaningless illustration.

I should read more about procrastination. Maybe tomorrow.

You’ve been there: you need to complete a project, submit a report, or document your code. You know how important all these tasks are, but you can’t find the power to do so. Instead, you’re researching those nice pictures the Opportunity rover sent to the Earth, type random letters in Google to see where they will lead you to, tidy up your desk, or make another cup of coffee. You are procrastinating.

Because I procrastinate a lot, and because I have several important tasks to complete, I decided to read more about the psychological background of procrastination. I went to Google Scholar and typed “procrastination.” One of the first results was a paper with a promising title. “The Nature of Procrastination: A Meta-Analytic and Theoretical Review of Quintessential Self-Regulatory Failure” by Piers Steel. Why was I intrigued by this paper? First of all, it’s a meta-analysis, meaning that it reviews many previous quantitative studies. Secondly, it promises a theoretical review, which is also a good thing. So, I decided to read it. I started from the abstract, and here’s what I see:

Strong and consistent predictors of procrastination were task aversiveness, task delay, selfefficacy, and impulsiveness, as well as conscientiousness and its facets of self-control, distractibility, organization, and achievement motivation.

Hmmm, isn’t this the very definition of procrastination? Isn’t this sentence similar to “A strong predictor of obesity is a high ratio between person’s weight to their height?”. Now, I’m really intrigued. I am sure that reading this paper will shed some light, not only on the procrastination itself but also on the self-assuring sentence. I definitely need to read this paper. Maybe tomorrow.

 

PS. After writing this post, I discovered that the paper author, Piers Steel, has a blog dedicated to “procrastination and science” https://procrastinus.com/. I will read that blog too. But not today

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

Never read reviews before reading a book (except for this one). On “Surely You’re Joking, Mr. Feynman!”

Several people suggested that I read “Surely You’re Joking, Mr. Feynman!“. That is why, when I got my new Kindle, “Surely You’re Joking, Mr. Feynman!” was the first book I bought.
Richard Feynman was a trained theoretical physics who co-won the Nobel Prize. From reading the book, I discovered that Feynman was also a drummer, a painter, an expert on Native American mathematics, safecracker, a samba player, and an educator. The more I read this book, the more astonished I was about Feynman’s personality and his story.

When I was half the way through the book, I decided to read the Amazone reviews. When reading reviews, I tend to look for the one- and two- stars, to seed my critical thinking. I wish I haven’t done that. The reviewers were talking about how arrogant and self-bragging man Feynman was, and how it must have been terrible to work with him. I almost stopped reading the book after being exposed to those reviews.

Admittedly, Richard Feynman never missed an opportunity to brag about himself and to emphasize how many achievements he made without meaning to do so, almost by accident. Every once in a while, he mentioned many people who were much better than him in that particular field that managed to conquer. I call this pattern a self-bragging modesty, and it is a pattern typical of many successful people. Nevertheless, given all his achievements, I think that Feynman deserves the right to be self-bragging. Being proud of your accomplishments isn’t arrogance, and is a natural thing to do. “Surely You’re Joking, Mr. Feynman!” is fun to read, is very informative and inspirational. I think that everyone who calls themselves a scientist or considers being a scientist should read this book.

P.S. After completing the book, I took some time to watch several Feynman’s lectures on YouTube. It turned out that besides being a good physicist, Feynman was also a great teacher.

Is Data Science a Science?

Is Data Science a Science? I think that there is no data scientist who doesn’t ask his- or herself this question once in a while. I recalled this question today when I watched a fascinating lecture “Theory,  Prediction, Observation” made by Richard Feynman in 1964.  For those who don’t know, Richard Feynman was a physicist who won the Nobel Prize, and who is considered one of the greatest explainers. In that particular lecture, Prof. Feynman talked about science as a sequence of  Guess ⟶ Compute Consequences ⟶ Compare to Experiment

Richard Feynman in front of a blackboard that says: Guess ⟶ Compute Consequences ⟶ Compare to Experiment

This is exactly what we do when we build models: we first guess what the model should be, compute the consequences (i.e. fit the parameters). Finally, we evaluate our models against observations.

My favorite quote from that lecture is

… and therefore, experiment produces troubles, every once in a while …

I strongly recommend watching this lecture. It’s one hour long, so if you don’t have time, you may listen to it while commuting. Feynman is so clear, you can get most of the information by ear only.

 

 

Does chart junk really damage the readability of your graph?

Screen Shot 2018-02-12 at 16.32.56Data-ink ratio is considered to be THE guiding principle in data visualization. Coined by Edward Tufte, data-ink is “the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” According to Tufte, the ratio of the data-ink out of all the “ink” in a graph should be as high as possible, preferably, 100%.
Everyone who considers themselves serious about data visualization knows (either formally, or intuitively) about the importance to keep the data-ink ratio high, the merits of high signal-to-noise ratio, the need to keep the “chart junk” out. Everybody knows it. But are there any empirical studies that corroborate this “knowledge”? One of such studies was published in 1988 by James D. Kelly in a report titled “The Data-Ink Ratio and Accuracy of Information Derived from Newspaper Graphs: An Experimental Test of the Theory.”

In the study presented by J.D. Kelly, the researchers presented a series of newspaper graphs to a group of volunteers. The participants had to look at the graphs and answer questions. A different group of participants was exposed to similar graphs that underwent rigorous removal of all the possible “chart junk.” One such an example is shown below

Two bar charts based on identical data. One - with "creative" illustrations. The other one only presents the data.

Unexpectedly enough, there was no difference between the error rate the two groups made. “Statistical analysis of results showed that control groups and treatment groups made a nearly identical number of errors. Further examination of the results indicated that no single graph produced a significant difference between the control and treatment conditions.”

I don’t remember how this report got into my “to read” folder. I am amazed I have never heard about it. So, what is my take out of this study? It doesn’t mean we need to abandon the data-ink ratio at all. It does not provide an excuse to add “chart junk” to your charts “just because”. It does, however, show that maximizing the data-ink ratio shouldn’t be followed zealously as a religious rule. The maximum data-ink ratio isn’t a goal, but rather a tool. Like any tool, it has some limitations. Edward Tufte said, “Above all, show data.” My advice is “Show data, enough data, and mostly data.” Your goal is to convey a message, if some decoration (a.k.a chart junk) makes your message more easily digestible, so be it.

On statistics and democracy, or why exposing a fraud may mean nothing

“stat” in the word “statistics” means “state”, as in “government/sovereignty”. Statistics was born as a state effort to use data to rule a country. Even today, every country I know has its own statistics authority. For many years, many governments, have been hiding the true statistics from the public, under the assumption that knowledge means power. I was reminded of this after reading Charles Earl’s (my teammate) post “Mathematicians, rock the vote!“, in which he encourages mathematicians to fight gerrymandering. Gerrymandering is a dubious practice in the American voting system, where a regulatory body forms voting districts in such a way that the party that appointed that body has the highest chance to win. Citing Charles:

It is really heartening that discrete geometry and other branches of advanced mathematics can be used to preserve democracy

I can’t share Charles’s optimism. In the past, statistics have been successfully used for several times to expose election frauds in Russia (see, for example, these two links, but there are much much more [one] [two]). People went to the streets, waving posters such as “We don’t believe Churov [a Russian politician], we believe Gauss.”

Demonstration in Russia. Poster: "We don't believe Churov. We believe Gauss"
“We don’t believe Churov. We believe Gauss”. Taken from Anatoly Karlin’s site http://akarlin.com/2011/12/measuring-churovs-beard/

Why, then, am I not optimistic? After all, even the great Terminator, one of my favorite Americans, Arnold Schwarzenegger fights gerrymandering.

schwarznegger-on-the-gerrymandering-problem-00025416-super-169.jpg

The problem is not that the American’s don’t know how to eliminate Gerrymandering. The information is there, the solution is known [ref, as an example]. In theory, it is a very easy problem. In practice, however,  power, even more than drugs and sex, is addictive. People don’t tend to give up their power easily. What happened in Russia, after an election fraud was exposed using statistics? Another election fraud. And then yet another. What will happen in the US? I’m afraid that nothing will change there either.

 

What is the best way to handle command line arguments in Python?

The best way to handle command line arguments with Python is defopt. It works like magic. You write a function, add a proper docstring using any standard format (I use [numpy doc]), and see the magic


import defopt

def main(greeting, *, count=1):
    """Display a friendly greeting.

    :param str greeting: Greeting to display
    :param int count: Number of times to display the greeting
    """
    for _ in range(count):
        print(greeting)

if __name__ == '__main__':
    defopt.run(main)

 

You have:

  • help string generation
  • data type conversion
  • default arguments
  • zero boilerplate code

Magic!

Illustration: the famous XKCD

Measuring the wall time in python programs

[UPDATE Feb 2020]: TicToc is now a package. See this post.

Measuring the wall time of various pieces of code is a very useful technique for debugging, profiling, and computation babysitting.  The first time I saw a code that performs time measurement was many years ago when a university professor used Matlab’s tic-toc pair. Since then, whenever I learn a new language, the first “serious” code that I write is a tic-toc mechanism. This is my Python Tictoc class: [Github gist].

Why bar charts should always start at zero?

In the data visualization world, not starting a bar chart at zero is a “BIG NO”. Some people protest. “How come can anyone tell me how to start my bar chart? The Paper/Screen can handle anything! If I want to start a bar chart at 10, nobody can stop me!”

Data visualization is a language. Like any language, data visualization has its set of rules,  grammar if you wish. Like in any other language, you are free to break any rule, but if you do so, don’t be surprised if someone underestimates you. I’m not a native English speaker. I certainly break many English grammar rules when I write or speak. However, I never argue if someone knowledgeable corrects me. If you agree that one should try respecting grammar rules of a spoken language, you have to agree to respect the grammar of any other language, including data visualization.

Natan Yau from flowingdata.com has a very informative post

Screenshot of flowingdata.com post "Bar Chart Baselines Start at Zero"

that explores this exact point. Read it.

Another related discussion is called “When to use the start-at-zero rule” and is also worth reading.

Also, do remember is that the zero point has to be a meaningful one. That is why, you cannot use a bar chart to depict the weather because, unless you operate in Kelvin, the zero temperature is meaningless and changes according to the arbitrary choice the temperature scale.

Yet another thing to remember is that

It’s true that every rule has its exception. It’s just that with this particular rule, I haven’t seen a worthwhile reason to bend it yet.

(citing Natan Yau)

Gender salary gap in the Israeli high-tech — now the code

Several people have asked me about the technology I used to create the graphs in my recent post about the gender salary gap in the Israeli high-tech. Like 99% of the graphs I create, I used matplotlib. I have uploaded the notebook that I used for that post to Github. Here’s the link. The published version uses seaborn style settings. The original one uses a slightly customized style.

 

The Monty Hall Problem simulator

A couple of days ago, I told to my oldest daughter about the Monty Hall problem, the famous probability puzzle with a counter-intuitive solution. My daughter didn’t believe me. Even when I told her all about the probabilities, the added information, and the other stuff, she still couldn’t “feel” it. I looked for an online simulator and couldn’t find anything that I liked. So, I decided to create a simulation Jupyter notebook.

Illustration: Screenshot of a Jupyter notebook that shows the output of one round of Monty Hall simulation

I’ve uploaded the notebook to GitHub, in case someone else wants to play with it [link].

In defense of double-scale and double Y axes

If you had a chance to talk to me about data visualization, you know that I dislike the use of double Y-axis for anything except for presenting different units of the same measurement (for example inches and meters). Of course, I’m far from being a special case.  Double axis ban is a standard stand among all the people in the field of data visualization education. Nevertheless, double-scale axes (mostly Y-axis) are commonly used both in popular and technical publications. One of my data visualization students in the Azrieli College of Engineering of Jerusalem told me that he continually uses double Y scales when he designs dashboards that are displayed on a tiny screen in a piece of sophisticated hardware. He claimed that it was impossible to split the data into separate graphs, due to space constraints, and that the engineers that consume those charts are professional enough to overcome the shortcomings of the double scales. I couldn’t find any counter-argument.

When I tried to clarify my position on that student’s problem, I found an interesting article by Financial Times commentator John Auther, called “Lies, Damned Lies and Statistics.” In this article, John Auther reviews the many problems a double scale can create. He also shows different alternatives (such as normalization). However, at the end of that article, John Auther also provides strong and valid arguments in favor of the moderate use of double scales. John Auther notices strange dynamics of two metrics

A chart with two Y axes - one for EURJPY exchange rate and the other for SPX Index
Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)

It is extraordinary that two measures with almost nothing in common with each other should move this closely together. In early 2007 I noticed how they were moving together, and ended up writing an entire book attempting to explain how this happened.

It is relatively easy to modify chart scales so that “two measures with almost nothing in common […] move […] closely together”. However, it is hard to argue with the fact that it was the double scale chart that triggered that spark in the commentator’s head.  He acknowledges that normalizing (or rebasing, as he puts it) would have resulted in a similar picture

Graph that depicts the dynamics of two metrics, brought to the same scale
Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)

But

However, there is one problem with rebasing, which is that it does not take account of the fact that a stock market is naturally more variable than a foreign exchange market. Eye-balling this chart quickly, the main message might be that the S&P was falling faster than the euro against the yen. The more important point was that the two were as correlated as ever. Both stock markets and foreign exchange carry trades were suffering grievous losses, and they were moving together — even if the S&P was moving a little faster.

I am not a financial expert, so I can’t find an easy alternative that will provide the insight John Auther is looking for while satisfying my purist desire to refrain from using double Y axes. The real question, however, is whether such an alternative is something one should be looking for. In many fields, double scales are the accepted language. Thanks to the standard language, many domain experts are able to exchange ideas and discover novel insights.  Reinventing the language might bring more harm than good. Thus, my current recommendations regarding double scales are:

Avoid double scales when possible, unless its a commonly accepted practice. In which case, be persistent and don’t lie.

 

What is the best way to collect feedback after a lecture or a presentation?

I consider teaching and presenting an integral part of my job as a data scientist. One way to become better at teaching is to collect feedback from the learners. I tried different ways of collecting feedback: passing a questionnaire, Polldaddy surveys or Google forms, or simply asking (no, begging) the learners to send me an e-mail with the feedback. Nothing really worked.  The response rate was pretty low. Moreover, most of the feedback was a useless set of responses such as “it was OK”, “thank you for your time”, “really enjoyed”. You can’t translate this kind of feedback to any action.

Recently, I figured out how to collect the feedback correctly. My recipe consists of three simple ingredients.

Collecting feedback. The recipe.

working time: 5 minutes

Ingredients

  • Open-ended mandatory questions: 1 or 2
  • Post-it notes: 1 – 2 per a learner
  • Preventive amnesty: to taste

Procedure

Our goal is to collect constructive feedback. We want to improve and thus, are mainly interested in aspects that didn’t work well. In other words, we want the learners to provide constructive criticism. Sometimes, we may learn from things that worked well. You should decide whether you have enough time to ask for positive feedback. If your time is limited, skip it. Criticism is more valuable than praises.

Pass post-it notes to your learners.

Next, start with preventive amnesty, followed by mandatory questions, followed by another portion of preventive amnesty. This is what I say to my learners.

[Preventive amnesty] Criticising isn’t easy. We all tend to see criticism as an attack and to react accordingly. Nobody likes to be attacked, and nobody likes to attack. I know that you mean well. I know that you don’t want to attack me. However, I need to improve.

[Mandatory question] Please, write at least two things you would improve about this lecture/class. You cannot pass on this task. You are not allowed to say “everything is OK”. You will not leave this room unless you handle me a post-it with two things you liked the less about this class/lecture.

[Preventive amnesty] I promise that I know that you mean good. You are not attacking me, you are giving me a chance to improve.

That’s it.

When I teach using the Data Carpentry methods, each of my learners already has two post-it notes that they use to signal whether they are done with an assignment (green) or are stuck with it (red). In these cases, I ask them to use these notes to fill in their responses — one post-it note for the positive feedback, and another one for the criticism. It always works like a charm.

A pile of green and red post-it notes with feedback on them

 

Data is the new

I stumbled upon a rant titled  Data is not the new oil — Tech Insights

You’ve heard it many times and so have I: “Data is the new oil” Well it isn’t. At least not yet. I don’t care how I get oil for my car or heating. I simply decide what to cook and where to drive when I want. I’m unconcerned which mechanism is used to refine oil […]

Funny, in my own rant “data is not the new gold“, I claimed that “oil” was a better analogy for data than gold. Obviously, any “X is the new Y” sentences are problematic but it’s still funny how we like them.

Yes, your friends are more successful than you are. On “The Majority Illusion in Social Networks”

Recently, I re-read “The Majority Illusion in Social Networks” (by Lerman, Yan and Wu).

The starting point of this paper is the friendship paradox — a situation when a node in a network has fewer friends that its friends have. The authors expand this paradox to what they call “the majority illusion” — a situation in which a node may observe that the majority of its friends have a particular property, despite the fact that such a property is rare in the entire network.

An illustration of the “majority illusion” paradox. The two networks are identical, except for which three nodes are colored. These are the “active” nodes and the rest are “inactive.” In the network on the left, all “inactive” nodes observe that at least half of their neighbors are “active,” while in the network on the right, no “inactive” node makes this observation.F

Besides pointing out the existence of majority illusion phenomenon, the authors used synthetic networks to characterize the situations in which this phenomenon is most prevalent.

 

Quoting the authors:

the paradox is stronger in networks in which the better-connected nodes are active, and also in networks with a heterogeneous degree distribution. […] The paradox is strongest in networks where low degree nodes have the tendency to connect to high degree nodes. […] Activating the high degree nodes in such networks biases the local observations of many nodes, which in turn impacts collective phenomena

The conditions listed in the quote above describe a lot of known social networks. The last sentence in that quote is of a special interest. It explains the contagious nature of many actions, from sharing a meme to buying a new car.

 

Gender salary gap in the Israeli high-tech

A large and popular Israeli Facebook group, “The High-Tech Troubles,” has recently surveyed its participants. The responders provided personal, demographic, and professional information. The group owners have published the aggregated results of that survey. In this post, I analyze a particular aspect of these findings, namely, how the responders’ gender and experience affect their salary. It is worth noting that this survey is by no means a representative one. It’s most noticeable but not the only problem is the participation bias. Another problem is the fact that the result tables do not contain any information about the number of responders in any group. Without this information, it is impossible to compute confidence intervals of any findings. Despite these problems, the results are interesting and worth noting.

The data that I used in my analysis is available in this spreadsheet. The survey organizers promise that they excluded groups and categories with insufficiently few answers, and we have to trust them in that. The results are divided into twenty professional categories such as ‘Account Management,’ ‘Data Science’, ‘Support’ and ‘CXO’ (which stands for an executive role). The salary groups are organized in exponential bins according to the years of experience: 0-1, 1-2, 2-4, 4-7; and more than seven years of experience. Some of the cell values are missing, I assume that these are the categories with too few responders. I took a look at the gap between the salary reported by women and the compensation reported by men.

Let’s take a look at the most complete set of data — the groups of people with 1-2 years of experience. As we may see from the figure below, in thirteen out of twenty groups (65%), women get less money than men.
Gender compensation gap, 1-2 years of experience. Women earn less in 13 of 20 categories

Among the workers with 1 – 2 years of experience, the most discriminating fields are executives and security researchers. It is interesting to note the difference between two closely related fields: Data Science and BI/Data Analysts. The former is considered a more lucrative position. On average, the male data scientists get 11% more than their female colleagues, while male data analysts get 13% less than their female counterparts. I wonder how this difference relates to my (very limited) observation that most of the people who call themselves a BI expert are females, while most of the data scientists whom I know are males.

As we have seen, there is no much gender equality for the young professionals. What happens when people gain experience? How does the gender compensation gap change after eight years of professional life? The situation is even worse. In fourteen, out of sixteen available fields, women get less money than men. The only field in which it pays to be a woman is the executive roles, where the women get 19% more than the men.

Gender compensation gap, more than 7 years of experience. Women earn less in 14 of 16 categories

To complete the picture, let’s look at the gap dynamics over the years in all the occupation fields in that report.

Gender gap dynamics. 20 professional fields over different experience bins

What do we learn from these findings?

These findings are real. We cannot use the non-representativity of these data, and the lack of confidence intervals to dismiss these findings. I don’t expect the participants to lies, neither do I not expect any bias from the participation patterns. It is true that I can’t obtain confidence intervals for these results. However, the fact that the vast majority of the groups lie on one side of the equality line suggests the overall validity of the gender gap notion. How can we fix this gap? I frankly don’t know. As a father of three daughters (9, 12, and 14 years old), I talk to them about this gap. I make sure they are aware of this problem so that, when it’s their turn to negotiate compensation, they are aware of the systematic bias. I hope that this knowledge will give them the right tools to fight for justice.

Don’t take career advises from people who mistreat graphs this badly

Recently, I stumbled upon a report called “Understanding Today’s Chief Data Scientist” published by an HR company called Heidrick & Struggles. This document tries to draw a profile of the modern chief data scientist in today’s Big Data Era. This document contains the ugliest pieces of data visualization I have seen in my life. I can’t think of a more insulting graphical treatment of data. Publishing graph like these ones in a document that tries to discuss careers in data science is like writing a profile of a Pope candidate while accompanying it with pornographic pictures.

Before explaining my harsh attitude, let’s first ask an important question.

What is the purpose of graphs in a report?

There are only two valid reasons to include graphs in a report. The first reason is to provide a meaningful glimpse into the document. Before a person decided whether he or she wants to read a long document, they want to know what is it about, what were the methods used, and what the results are. The best way to engage the potential reader to provide them with a set of relevant graphs (a good abstract or introduction paragraph help too). The second reason to include graphs in a document is to provide details that cannot be effectively communicating by text-only means.

That’s it! Only two reasons. Sometimes, we might add an illustration or two, to decorate a long piece of text. Adding illustrations might be a valid decision provided that they do not compete with the data and it is obvious to any reader that an illustration is an illustration.

Let the horror begin!

The first graph in the H&S report stroke me with its absurdness.

Example of a bad chart. I have no idea what it means

At first glance, it looks like an overly-artistic doughnut chart. Then, you want to understand what you are looking at. “OK”, you say to yourself, “there were 100 employees who belonged to five categories. But what are those categories? Can someone tell me? Please? Maybe the report references this figure with more explanations? Nope.  Nothing. This is just a doughnut chart without a caption or a title. Without a meaning.

I continued reading.

Two more bad charts. The graphs are meaningless!

OK, so the H&S geniuses decided to hide the origin or their bar charts. Had they been students in a dataviz course I teach, I would have given them a zero. Ooookeeyy, it’s not a college assignment, as long as we can reconstruct the meaning from the numbers and the labels, we are good, right? I tried to do just that and failed. I tried to use the numbers in the text to help me filling the missing information and failed. All in all, these two graphs are a meaningless graphical junk, exactly like the first one.

The fourth graph gave me some hope.

Not an ideal pie chart but at least we can understand it

Sure, this graph will not get the “best dataviz” award, but at least I understand what I’m looking at. My hope was too early. The next graph was as nonsense as the first three ones.

Screenshot with an example of another nonsense graph

Finally, the report authors decided that it wasn’t enough to draw smartly looking color segments enclosed in a circle. They decided to add some cool looking lines. The authors remained faithful to their decision to not let any meaning into their graphical aidsScreenshot with an example of a nonsense chart.

Can’t we treat these graphs as illustrations?

Before co-founding the life-changing StackOverflow, Joel Spolsky was, among other things, an avid blogger. His blog, JoelOnSoftware, was the first blog I started following. Joel writes mostly about the programming business and. In order not to intimidate the readers with endless text blocks, Joel tends to break the text with illustrations. In many posts, Joel uses pictures of a cute Husky as an illustration. Since JoelOnSoftware isn’t a cynology blog, nobody gets confused by the sudden appearance of a Husky. Which is exactly what an illustration is – a graphical relief that doesn’t disturb. But what would happen if Joel decided to include a meaningless class diagram? Sure a class diagram may impress the readers. The readers will also want to understand it and its connection to the text. Once they fail, they will feel angry, and rightfully so

Two screenshots of Joel's blog. One with a Husky, another one with a meaningless diagram

The bottom line

The bottom line is that people have to respect the rules of the domain they are writing about. If they don’t, their opinion cannot be trusted. That is why you should not take any pieces of advice related to data (or science) from H&S. Don’t get me wrong. It’s OK not to know the “grammar” of all the possible business domains. I, for example, know nothing about photography or dancing; my English is far from being perfect. That is why, I don’t write about photography, dancing or creative writing. I write about data science and visualization. It doesn’t mean I know everything about these fields. However, I did study a lot before I decided I could write something without ridiculing myself. So should everyone.

 

AI and the War on Poverty, by Charles Earl

It’s such a joy to work with smart and interesting people. My teammate,  Charles Earl, wrote a post about machine learning and poverty. It’s not short, but it’s worth reading.

A.I. and Big Data Could Power a New War on Poverty is the title of on op-ed in today’s New York Times by Elisabeth Mason. I fear that AI and Big Data is more likely to fuel a new War on the Poor unless a radical rethinking occurs. In fact this algorithmic War on the Poor […]

via AI and the War on Poverty — Charlescearl’s Weblog

Одна голова хорошо, а две лучше; или как не забросить свой блог

Запись моего доклада на WordCamp Moscow (август 2017г.) доступна онлайн.

The recording of my presentation at WordCamp Moscow (Aug 2017) is finally available online: Two Heads are Better Than One – on blogging persistence (Russian)

The Keys to Effective Data Science Projects — Operationalize

Recently, I’ve stumbled upon an interesting series of posts about effective management of data science projects.  One of the posts in the series says:

 “Operationalization” – a term only a marketer could love. It really just means “people using your solution”.

The main claim of that post is that, at some point, bringing actual users to your data science project may be more important than improving the model. This is exactly what I meant in my “when good enough is good enough” post (also available on YouTube)

Buzzword shift

Many years ago, I tried to build something that today would have been called “Google Trends for Pubmed”. One thing that I’ve found during that process was how the emergence of HIV-related research reduced the number of cancer studies and how, several years later, the HIV research boom settled down, while letting the cancer research back.

I recalled about that project of mine when I took a look at the Google Trends data for, a once popular buzz-phrases, “data mining” and pattern recognition.  Sic transit gloria mundi.

Screenshot of Google Trends data for (in decreasing order): "Machine Learning" , "Data Science", "Data Mining", "Pattern Recognition"

It’s not surprising that “Data Science” was the less popular term in 2004. As I already mentioned, “Data Science” is a relatively new term. What does surprise me is the fact that in the past, “Machine Learning” was so less popular that “Data Mining”. Even more surprising is the fact that Google Trends ranks “Machine Learning” almost twice as high, as “Data Science”. I was expecting to see the opposite.

“Pattern Recognition,” that, in 2004, was as (not) popular as “Machine Learning” become even less popular today. Does that mean that nobody is searching for patterns anymore? Not at all. The 2004 pattern recognition experts are now machine learning professors senior data scientists or if they work in academia, machine learning professors.

PS: does anybody know the reason behind the apparent seasonality in “Data Mining” trends?

On alert fatigue 

I developed an anomaly detection system for Automattic internal dashboard. When presenting this system (“When good enough is just good enough“), I used to tell that in our particular case, the cost of false alerts was almost zero. I used to explain this claim by the fact that no automatic decisions were made based on the alerts, and that the only subscribers of the alert messages were a limited group of colleagues of mine. Automattic CFO, Stu West, who was the biggest stakeholder in this project, asked me not to stop claiming the “zero cost” claim. When the CFO of the company you work for asks you to do something, you comply. So, I stopped saying “zero cost” but I still listed the error costs as a problem I can safely ignore for the time being. I didn’t fully believe Stu, which is evident from the speaker notes of my presentation deck:

 

Screenshot of the presentation speaker notes.
My speaker notes. Note how “error costs” was the first problem I dismissed.

 

I recalled about Stu’s request to stop talking about “zero cost” of false alerts today. Today, I noticed more than 10 unread messages in the Slack channel that receives my anomaly alerts. The oldest unread message was two weeks old. The only reason this could happen is that I stopped caring about the alerts because there were too many of them. I witnessed the classical case of “alert fatigue”, described in “The Boy Who Cried Wolf”, many centuries ago.

The lesson of this story is that there is no such a thing as zero-cost false alarms. Lack of specificity is a serious problem.

Screenshot: me texting Stu that he was right

Feature image by Ray Hennessy

What’s the most important thing about communicating uncertainty?

Sigrid Keydana, in her post Plus/minus what? Let’s talk about uncertainty (talk) — recurrent null, said

What’s the most important thing about communicating uncertainty? You’re doing it

Really?

Here, for example, a graph from a blog post

Thousands of randomly looking points. From https://myscholarlygoop.wordpress.com/2017/11/20/the-all-encompassing-figure/

The graph clearly “communicates” the uncertainty but does it really convey it? Would you consider the lines and their corresponding confidence intervals very uncertain had you not seen the points?

What if I tell you that there’s a 30% Chance of Rain Tomorrow? Will you know what it means? Will a person who doesn’t operate on numbers know what it means? The answer, to both these questions, is “no”, as is shown by Gigerenzer and his collaborators in a 2005 paper.

Screenshot: many images for the 2016 US elections

Communicating uncertainty is not a new problem. Until recently, the biggest “clients” of uncertainty communication research were the weather forecasters.  However, the recent “data era” introduced uncertainty to every aspect of our personal and professional lives. From credit risk to insurance premiums, from user classification to content recommendation, the uncertainty is everywhere. Simply “doing” uncertainty communication, as Sigrid Keydana from the Recurrent Null blog suggested isn’t enough. The huge public surprise caused by the 2016 US presidential election is the best evidence for that. Proper uncertainty communication is a complex topic. A good starting point to this complex topic is a paper Visualizing Uncertainty About the Future by David Spiegelhalter.

The Y-axis doesn’t have to be on the left

Line charts are great to convey the evolution of a variable over the time. This is a typical chart. It has three key components, the X-axis that represents the time, the Y-axis that represents the tracked value, and the line itself.

A typical line chart. The Y-axis is on the left

Usually, you will see the Y-axis at the left part of the graph. Unless you design for a Right-To-Left language environment, placing the Y-axis on the left makes perfect sense. However, left-side Y-axis isn’t a hard rule.

In many cases, more importance is given to the most recent data point. For example, it might be interesting to know a stock price dynamics but today’s price is what determines how much money I can get by selling my stock portfolio.

What happens if we move the axis to the right?

A slightly improved version. The Y-axis is on the right, adjacent to the most recent data point

Now, today’s price of XYZ stock is visible more clearly. Let’s make the most important values explicitly clear:

The final version. The Y-axis is on the right, adjacent to the most recent data point. The axis ticks correspont to actual data points

There are two ways to obtain right-sided Y axis in matplotib. The first way uses a combination of

ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")

The second one creates a “twin X” axis and makes sure the first axis is invisible. It might seem that the first option is easier. However, when combined with seaborn’s despine function, strange things happen. Thus, I perform the second option. Following is the code that I used to create the last version of the graph.

np.random.seed(123)
days = np.arange(1, 31)
price = (np.random.randn(len(days)) * 0.1).cumsum() + 10

fig = plt.figure(figsize=(10, 5))
ax = fig.gca()
ax.set_yticks([]) # Make 1st axis ticks disapear.
ax2 = ax.twinx() # Create a secondary axis
ax2.plot(days,price, '-', lw=3)
ax2.set_xlim(1, max(days))
sns.despine(ax=ax, left=True) # Remove 1st axis spines
sns.despine(ax=ax2, left=True, right=False)
tks = [min(price), max(price), price[-1]]
ax2.set_yticks(tks)
ax2.set_yticklabels([f'min:\n{tks[0]:.1f}', f'max:\n{tks[1]:.1f}', f'{tks[-1]:.1f}'])
ax2.set_ylabel('price [$]', rotation=0, y=1.1, fontsize='x-large')
ixmin = np.argmin(price); ixmax = np.argmax(price);
ax2.set_xticks([1, days[ixmin], days[ixmax], max(days)])
ax2.set_xticklabels(['Oct, 1',f'Oct, {days[ixmin]}', f'Oct, {days[ixmax]}', f'Oct, {max(days)}' ])
ylm  = ax2.get_ylim()
bottom = ylm[0]
for ix in [ixmin, ixmax]:
    y = price[ix]
    x = days[ix]
    ax2.plot([x, x], [bottom, y], '-', color='gray', lw=0.8)
    ax2.plot([x, max(days)], [y, y], '-', color='gray', lw=0.8)
ax2.set_ylim(ylm)

Next time when you create a “something” vs time graph, ask yourself whether the last available point has a special meaning to the viewer. If it does, consider moving the Y axis to the left part of your graph and see whether it becomes more readable.

This post was triggered by a nice write-up by  Plotting a Course: Line Charts by a new blogger David (he didn’t mention his last name) from https://thenumberist.wordpress.com/