On proper selection of colors in graphs

Photo by Sharon McCutcheon on Pexels.com

How do you properly select a colormap for a graph? What makes the rainbow color map a wrong choice, and what are the proper alternatives?

Today, I stumbled upon a lengthy post that provides an in-depth review of the theory behind our color perception. The article concentrates on quantitative colormaps but also includes information relevant to selecting proper colors for categories. 

If you never learned the theory behind the color and are interested in data visualization, I strongly suggest investing 45-60 minutes of your life in reading this post.

Another example of the power of data visualization

I stumbled upon a great graph that tells a complex story compellingly.

Comparison of two COVID-19 waves in the UK, taken from here.

This graph compares the last two waves of COVID-19 in the United Kingdom and is shows so clearly that the new wave (that is supposedly composed of the Delta variant) is much more infections on the one hand, but on the other hand, causes much less damage. Is the more moderate damage the result of the Delta variant nature of the protective effect of the vaccination is still an open question, but the difference is still striking.

Super useful videos for advanced data visualizers

The great Robert Kosara, also known as the “eager eyes” has started publishing a series of videos he calls Chart Appreciation. In these videos, Robert takes a piece of data visualization from a reputable and known source, and discusses why this particular piece is so good, what decisions were made that made it possible, what alternatives are, and more. If you consider yourself an intermediate or advanced practitioner of data visualization, you should subscribe. Here’s one example.

One of the first dataviz blogs that I used to follow is now a book. Better Posters

I started following data visualization news and opinions quite a few years ago. One of the first bloggers who were active in this area NeurDojo, by the (now) professor Zen Faulkes. On of Zen’s spin-off blogs was devoted to better posters. This poster blog is called, surprisingly enough, Better Posters. Since I’m not in academia anymore, stopped caring about posters many years ago. Today, I stumbled upon this blog and was pleasantly surprised to discover that Better Posters is still active and that it is also now a book.

Graphical comparison of changes in large populations with “volcano plots”

I recently rediscovered a volcano plot — a scatter plot that aims to visualize changes in large populations.

Volcano plots are very technical and specialized and, most probably, are not a good fit for explanatory data visualization. However, they can be useful during the exploration phase, and they come with a set of well-established metrics.

Moreover, if you are lucky enough to have well-behaved data, the plots look very cool

Visualization of RNA-Seq results with Volcano Plot
From here

Of course, in real life, the data is messy. Add bad visualization practices to the mess and you get a marvel like this one

From here

The bottom line: if you have two populations to compare, consider volcano plots. But do remember dataviz good practices.

Before and after — stacked bar charts

A fellow data analyst asked a question? What do we do when we need to draw a stacked bar chart that has too many colors? How do we select the colors so that they are nice but also are easily distinguishable? To answer this question, let’s look at the data similar to what appeared in the original question. I also tried to recreate the actual chart’s style

So, how do we select colors?
The answer to this question is pretty complicated. To have a set of easily distinguishable colors, one needs to model the color perception in a typical human being properly. Luckily, a tool called I Want Hue that’s based on a solid theory explained here. The problem, however, isn’t in colors.

This is not the right question

Distinguishing between eight colors in a graph is a challenging task. Selecting the right color scheme might help, but it won’t solve this fundamental problem. Moreover, stacked bar plots are tricky due to another complication.

We, the humans, are somewhat good are comparing positions but not as good at comparing sizes. This is why comparing the heights of the bars is relatively easy. It is easy because the bars start at the same line, and our task is to compare the bar end position, not the bar size. Reading the heights of the lowest segment in the bars is also an easy task for the same reason: we don’t compare the sizes but the heights.

However, comparing the sizes of the middle components is more challenging. As a result, the intermediate parts of a graph don’t add useful information but rather add noise. Thus, let us explain two options. First, we will reduce the number of groups. Next, we will explore what happens when reducing the number of groups is not an option.

Option 1. Reduce the number of categories

It is hard to advise about data visualization when I don’t know what conclusion the author wants to convey. However, I am sure that in many cases, the number of categories that are relevant to the viewer is much smaller than the number of types that are relevant to the analyst. The viewer might not care about all the hard job you did while collecting the data; what they are about is an insight. For example, if we reduce the discussion to two groups: the USA and non-USA data centers, the graph becomes much more readable.

Note how two groups in a stacked graph pose no problem in deciphering the sizes. If we take care of readability and improve the data-ink ratio, we get a nice data visualization piece.

Option 2. When reducing the number of categories is not an option

But what if reducing the number of categories is not an option? If you are absolutely sure that the audience absolutely needs to see all the information, you can split the different groups into separate subgraphs.

Have you noticed that the X-axis in our case represents time? In this case, we can replace the bars with an evolution plot and create a separate chart for each category in the data set. I took special care to keep the Y-axis scale equal between all the graphs so that the viewer can easily distinguish between data centers with a lot of errors and data centers with only a few of them. Here’s the result:

But what if the overall error rate is of greater importance than the individual groups. In that case, we can plot them in a larger graph and add the separate groups below, in smaller, un-emphasized subplots.

Summary — the Why and the What define the How

When you have a technical question about improving a graph, make sure you ask yourself “why.” Why is, does technical problems matter? Why will it improve the chart? To answer this question, you will have to ask another question: “what?”. “What is it that I want to say.” The easiest way to force yourself to ask these questions is to force yourself to add titles to every graph you create (see my how to suck less in data visualization post for more details).

Once you have your conclusion ready, you will notice that you don’t need a technical solution but rather a conceptual one. In this case, we solved the technical problem of looking for eight distinct colors by reducing the number of categories to two or splitting one elaborate graph into several straightforward ones.

So, remember, the Why and the What define the How

Python code that was used to generate all this graphs is available on (https://gist.github.com/bgbg/6c645a5fc48e61b1a917c9d1d66fa72f)

Before and after: Alternatives to a radar chart (spider chart)

A radar chart (sometimes called “spider charts”) look cool but are, in fact,
pretty lame. So much so that when the data visualization author Stephen Few mentioned them in his book Show me the numbers, he did so in a chapter called “Silly graphs that are best forsaken.”

Here, I will demonstrate some of its problems, and will suggest an alternative

Before: The problems of a radar (spyder) plot

Above is my reconstruction of the original plot that I saw in a Facebook discussion. The graph looks pretty cool, I have to admit, but it is full of problems.
What are the problems of a spyder plot or a radar plot?
Let’s start with readability. Can you quickly tell the value of “Substance abuse” for the red series? Not that easy.

But a more significant problem emerges when one realizes that in most cases, the order of the categories is arbitrary and that different sorting options may result in entirely different visual pictures.

After: conclusion-based graph design

I have been continually preaching to add meaningful titles to all the graphs you are creating. (See How to suck less in data visualization and professional communication).

One of the byproducts of adding a title is the fact that when you write down your main takeaway of a graph, you force yourself to think, “does this graph show what it says it shows?” Thus, you guide yourself to better graph choices.

Let’s say that we conclude that there is no correlation between the two series of data. Is this conclusion evident from the graphs? I would say, not so much.

Instead of a radar chart, I suggest creating two aligned, horizontal graph plots. This way, we may sort one subplot according to the values, and then, correlation (or lack of thereof) will be evident.

But what if we noticed something interesting about the differences between A and B groups? If this is true, let’s show precisely this: the differences.

Notice how the bars in this version are sorted according to the difference. Sorting a bar chart is the easiest way to make it readable.

Python code that I used to create these graphs is available here https://gist.github.com/bgbg/db833db723998cd244b5049bfe01f5ac

Basic data visualization video course (in Hebrew)

I had the honor to record an introductory data visualization course for high school students as a part of the Israeli national distance learning project. The course is in Hebrew, and since it targets high schoolers, it does not require any prior knowledge.

I got paid for this job. However, when I divide the money that I received for this job by the time I spent on it, I get a ridiculously low rate. On the other hand, I enjoyed the process, and I view this as my humble donation to the public education system.
Since a government agency makes the course site, it’s UI is complete shit. For example, the site doesn’t support playlists, and the user is expected to search through the video clips by their titles. To fix that, I created a page that lists all the videos in the right order.

Text Visualization Browser

I’ve stumbled upon an exciting project — text visualization browser. It’s a web page that allows one to search for different text visualization techniques using keywords and publication time. 

Text visualization browser https://textvis.lnu.se

The ability to limit the search to various years gives a nice historical perspective on this interesting topic

This site’s information is based on a 2015 paper Text visualization techniques: Taxonomy, visual survey, and community insights. I wish the authors updated it with more recent data, though. 

The information is beautiful. The graphs are shit!

I apologize for my harsh language, but recently I was exposed to a bunch of graphs on the “information is beautiful” site, and I was offended (well, ot really, but let’s pretend I was). I mean, I’m a liberal person, and I don’t care what graphs people do in their own time. Many people visit that site because they try to learn good visualization practices, but some charts on that site are wrong. Very wrong.

Here’s the gem:

I deliberately don’t share the link to this site. I don’t want let Google think it’s valuable in any way.

Now, the geniuses from “Information is beautiful” (let’s call them IB for brevety) wanted to share with us some positive stats. How nice of them. So what they did? They gathered together nine pairs of metrics collected at two different time points: one in the past and one furthermore in history. They used nice colors to create some sleeky shapes. So, what’s the problem? What’s wrong with that?

Everything is wrong!

Let’s start from my guess that they cherry-picked the stats with “positive” changes. Secondly, the comparison of this sort is mostly meaningless if we compare points at different years. What stopped the authors of that tasteless “infographic” from collecting data from the same years? I guess, their laziness. That’s how we ended up comparing the number of death penalties in 1990 and 2016, but the malaria deaths numbers are for 2000 and 2016, and dying mothers are compared for years 2000 and 2017?

Now, let’s talk about data viz.

Take a look at this graph.

The only time we use shapes like that is when we want to convey information about uncertainty. To do that, the X-axis represents the thing we are measuring, and the Y-axis represents our certainty about the current value. When we compare to uncertain measurements, we may judge the difference between these measurements by the distance between the curve peaks, and the width of the curve represents the uncertainty.

Here’s a good example from [this link]:

Can you see how the metric of interest is on the X-axis? The width of each bell curve represents the uncertainty and the difference between any pair of cases is the difference on the horizontal (X) axis, not the vertical one.

Instead, what do the IB authors did? They obviously like sleek looking shapes but know nothing about how to use them. They could have used two bars and let the viewer compare their heights. But nooooo! Bars are not c3wl! Bars are boring! Instead, they took probability density curves (that’s how they are technically called) and made them pretend to be bars.

Bars. Is this THAT hard?

I can hear some of you saying, “Stop being so purist! What’s wrong with comparing the heights of bell curves?” I’ll tell you what’s wrong! Data visualization is a language. As with any language, it has some rules and traditions. If you hear me saying, “me go home,” you will understand me without any problem. However, you will silently judge me for my poor use of the English language. I know that, and since English is my third language, I use all the help to make as few mistakes as possible. The same is correct with data visualization. Please respect its rules and traditions, even if (and especially if) are not fluent in it.

I never write more than two sentences in English without Grammarly

Visit the worst practice tag in this blog to see more bad examples

Exploring alternatives to population pyramids

A population pyramid also called an “age-gender-pyramid”, is a graphical illustration that shows the distribution of various age groups in a population (typically that of a country or region of the world), which forms the shape of a pyramid when the population is growing [citation from Wikipedia].

In some cases, the pyramid provides interesting insights into the entire population. In this post, I will explore ways to make some of these insights more visible. 

The basic case

Let’s start with the basic case. If you have two-three hours of spare time, you can go to the site devoted to population pyramids — https://www.populationpyramid.net. There, you will find population pyramids for every country in the world. The site provides present and past data, as well as future forecasts. To understand how insightful age pyramids can be, look at the graph that represents the entire world.

(this and most other images in this post are from the site http://populationpyramid.net/)

You can clearly see that the world is mostly young, that the amount of people declines as the age progresses, and that there is a rough balance between men and women in the world, at least before the ages of 70+.

Now, examine the stark difference between the populations of Western Africa and Western Europe. Citing the late professor Hans Rosling, we can still see two worlds, one with large families and short lives, and one with small families and long lives. 

Another starking example of an age pyramid is the following

Do you want to guess what country is that? This particular graph shows the age distribution of the United Arab Emirates. Such a vast distortion in symmetry and age distribution stems from the fact that more than 80% of the UAE’s population is composed of expats who come to this rich country to work. The pyramid below (taken from [this article]) sheds some light to the population composition of UAE. (Note that the genders in this graph are reversed).

Whose bar is longer?

The male-female disbalance in the UAE and some other Gulf countries is very striking and cannot be missed. But what about other, more subtle cases? Take a look at the world graph above. If you follow the numbers on the bars, you will notice that more boys are born than girls, but there are more old ladies than old gents in the world. Can we make such differences less subtle?

To answer this question, we need to understand why we find it hard to compare almost equal bars. The reason for that is that our eyes (or brains) are not so good at comparing sizes. They do, however, do a much better job comparing positions. Thus, if we overlap these bars, we will see the small differences in a much more precise manner. 

(I thank the data visualization expert Bella Graf from InfoServiz.co.il for the idea of this graph).

Now, the subtle differences in gender composition are more visible. 

What am I looking at?

When I teach data visualization, I always tell my students to add a meaningful title to the graph. By “meaningful,” I mean a title that does not answer the question “what” but rather “so what”? (See my posts “How to suck less in data visualization,” and “C for conclusion“). What would a good title for this graph be? Let’s try the following

OK, so now, when we have a title, we can ask ourselves, “does the graph show what it says it shows”? And the answer is no. Right now, the title talks about differences, but we don’t see the differences. We see the differences and other stuff. Let’s look only at the differences.

I don’t like this.

What about this?

Now, this is not an age pyramid. That’s for sure. This graph doesn’t show the wealth of data that the classical pyramid shows. On the other hand, it does offer one thing, and it does it very well. Look, for example, at the male/female distortion in China in 1990.

You may find the code I used to create the graphs in this post [on GitHub].

ASCII histograms are quick, easy to use and to implement

From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Recently, I updated the python function that I use to create ASCII histograms. The updated function [link] uses more modern formatting and includes several signal-to-noise improvements. One can also use it with custom output functions, such as logging.info.

What is the biggest problem of the Jet and Rainbow color maps, and why is it not as evil as I thought?

There was a consensus among the data visualization purists that the rainbow color map, and it’s close cousin Jet are bad. Really bad. These colormaps used to be popular at the beginning of the computational data visualization era. However, their popularity decreased in the last five years or so. The sentiment isn’t as bad as it used to be a couple of years ago, but still.

A screenshot from circa 2016. Today we are less fanatic than that

What is the biggest problem of the rainbow colormap? The most apparent problem with this particular colormap is that it not perceptually uniform. By “perceptually uniform,” I mean that equal changes in the value that we encode using a colormap should correspond to same changes in the color perception. This is not the case with the rainbow or the Jet colormaps. They have distinct bright and dark stripes within the number range, making them the wrong choice to encode numerical data. The situation is even worse for people with impaired color vision.

Can you be less perceptually uniform?

The solution to this problem was proposed in the form of better colormaps. The first one that I know of is Parula by Matlab, and it’s opensource alternative Viridis that is available in matplotlib and many other plotting libraries. (Watch this video about viridis to get a good introduction to color perception and color maps).

Viridis, the new rainbow

Everything was nice and good, and I was trashing the rainbow colormap whenever I could. Until yesterday, when I read about Turbo, the improved rainbow colormap developed by Google.

In the long and interesting blog post that describes Turbo, Anton Mikhailov, a software engineer in Google, describes several relevant applications of a “good rainbow” scheme. 

According to Anton, “Because of rapid color and lightness changes, Jet accentuates detail in the background that is less apparent with Viridis and even Inferno. Depending on the data, some detail may be lost entirely to the naked eye. The background in the following images is barely distinguishable with Inferno (which is already punchier than Viridis), but clear with Turbo.”

I must admit that I’m convinced. 

The biggest problem with that is mentioned concerning the original rainbow scheme that its brightness varies too much. However, it turns out that the color saturation and hue attract our attention more than the lightness (here’s the reference which I haven’t read yet). As such, it makes sense to construct a colormap that relies more on color and hue changes. 

Moreover, in many cases, the interesting details appear in the extreme values of the data range, not in the middle. In thes cases, a properly applied rainbow-like color scheme becomes a valid choice.

The bottom line is that one should not refrain from using rainbow(-like) color maps in their visualizations anymore, provided that they use a modern implementation. Luckily, it’s even available in matplotlib

How to suck less in data visualization and professional communication

In technical communication, the main thing is to keep the main thing the main thing. There are multiple ways to ensure this principle. Some of these ways require careful chart fine-tuning. However, there is one tool that is easy to master, fast to apply, and that provides a high return on the investment rate. I refer to chart titles. In this talk, I had two main theses. My first thesis is that most of you suck in communication (and not only data visualization).

My second thesis is that you can quickly improve your graphs by merely adding a good title. The importance of good titles is not new to my preaching, but I thought it was an excellent thing to formalize this thesis a bit, and I’m thankful to the NDR organizers for giving me this opportunity.

Following is the slide stack from my NDR presentation.

Before and after. Even excellent graphs can be improved

Being a data visualization consultant, I can’t help looking for dataviz problems in graphs that I see. Even if the graph is good. Even if I know that I would not be able to create a graph that good. Even if the overall graph is excellent, and the problems are minor, or maybe especially when the graph is excellent, and the problems are minor.

This is a nice graph published by Nevo Benita on Linkedin.

The graph presents the gap between the men and the women in the Israeli job market. As I said, the graph is excellent. However, there are several small problems that, like grains of sand in a chocolate mousse, stand in the way. Let’s take a look at them.

The time-series line in the upper right part of the graph shows good use of the real estate. The problem is that the X-axis ticks (the years) look as if they belong to the chart below. It takes some time to realize that the numbers are years of the upper graph, and not the X-axis of the graph below. Moving the numbers upwards by several pixels would have fixed that.

Now, it is more clear that “1990” and “2018” relate to the time-series graph above.
Before (left) and after (right).

Let’s talk about the left-side bar chart. It took me a while to understand what it is. As a matter of fact, I managed to write a critique paragraph about that bar chart, how it is unclear what the percentages are, and how they were computed. Only then had I noticed the explanation below. Such confusion isn’t the viewer’s fault. Since we usually scan images from top to bottom, moving the title to the top of the chart will reduce this confusion. The word “percent” is also redundant in that title since it comes after the percent sign.

Moving the explanation to the top makes it easier to notice. Before (left) and after (right)

The last point that is worth optimizing is the color order. Consistent element order in an image makes navigation and comprehension much easier. When the order is preserved, our brain can use mental shortcuts without losing much information. When these shortcuts are broken, the brain has to work harder. What am I talking about? The graph author made the correct decision to use different font colors in the graph title to specify which color stands for which gender. This way, we don’t need a separate legend, and this is good. The title is an ordered sequence of words. The visualizer could use this order to create the order heuristic that is so helpful. Such a heuristic isn’t always possible. Fortunately for the visualizer (and sadly for the society), the salary gap in all the occupations in this graph have the same direction: men earn more than women. As a result, the rightmost part has all the green dots on the right, and the purple dots are on the left. This direction is opposite to the gender direction in the title and the color direction in the bar chart. To fix this situation, I made sure that the color that stands for the women (purple) is always to the left of the color that designates the men (green).

Keeping the color order. Before (left) and after (right)

So, this is the final result. I hope you can see why I like it better.

That’s how I took and excellent graph and made it even more awesome.

Data visualization is not only dots, bars, and pies

Look at this wonderful piece of data visualization (taken from here). If you know the terms “tertiary structure” and “glycan”, there is NO way you miss the message that the author of this figure wanted to convey.

Also, note how using appropriate colors in the title, the authors got rid of graph legend.

Lie factor in ad graphs

It’s fun to look at the visit statistics and to discover old stories. I wrote this post in 2016. For a reason I don’t know, this post has been one of the most viewed posts in my blogs during the last week. 

So, I decided to publish it again. I won’t add any new examples, but if you want to see more stuff, type [lying with data visualization] in your favorite search engine

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

Nice but useless data visualization

Network visualization can mesmerize and hypnotize. Chord diagrams are especially cool because they are so colorful and smooth. The problem is that sometimes, the result doesn’t provide any actual value, and serves as a cute illustration. Cute illustrations are cute; they help put some “easiness” to the text without the risk of looking too unprofessional. 

Take the two examples below.

One example (taken from here) shows worldwide migration patterns in a clear and useful way. You can take a look at the graph and make real conclusions.

The other example (taken from here) is mostly a useless illustration.

The only “conclusion” that a viewer can make out of this graph is “everything is connected with everything.”

This type of conclusion is OK for an ad or a general overview of a problem, but it is NOT a valid way to end a discussion. 

Logarithmic scale misinforms. Period

Being a data scientist and a self-proclaimed data visualization expert, I like using log scale graphs when I find them appropriate. However, as a speaker and a communicator, I refrain from using them in presentations as much as possible. From my experience as a data visualization lecturer, I noticed that even “technical” struggle grasping the concept of log scale graphs. 

One of the Coronavirus side effects was the introduction of the term “exponential growth” to every living room. Naturally (to some of us), exponential growth is best presented using a semi-log graph, where the X-axis represents the time (linear), and the Y-axis represents the degree of magnitude of a value (log scale). 

A recent study (link) tested and demonstrated how bad log-scale is. The research title is “The Logarithmic Scale Misinforms the Public and Affects Policy Preferences.” From my experience, log scale graphs misinform everybody. Except for experienced data scientists. Nothing can confuse or misinform us, obviously 😉

It is a bummer though that data visualization in that paper sucks so much.

Don’t publish graphs like this. Especially not in data visualization papers.

Thanks to Bella Graph who pointed me to the original study.

Visualising Odds Ratio — Henry Lau

Besides being a freelancer data scientist and visualization expert, I teach. One of the toughest concepts to teach and to visualize is odds ratio. Today, I stumbled upon a very interesting post that deals exactly with that

On Thursday 7 May, the ONS published analysis comparing deaths involving COVID-19 by ethnicity. There’s an excellent summary on twitter but the headline is that when taking into account age and other socio-demographic factors, such as deprivation, household composition, education, health and disability, there is higher risk for some ethnic groups of a COVID related…

Visualising Odds Ratio — Henry Lau

Bad advice from a reputable source is bad advice.

Would you buy a grammar book with a clear spelling mistake on its cover? I hope not. That’s what happened to IBM when it published it’s new data visualization guide. I didn’t bother reading the manual because of what IBM decided to use as the first image of their guide.

We use graphs to transfer information into images that are supposed to be later transformed in our brains to information. What visual attributes do we use to interpret the information behind a pie chart? It is the segment angle, its area, or maybe the arc length? Most probably, the answer is “all of the above” (see Robert Kosara’s works for more info). When done right, the three attributes of pie segments are linearly connected one to another, which allows synergism between the visual clues.
But what did our friends at IBM do? The deliberately distorted the data! I took the screenshot from the guide homepage and made some measurements.
The purple segment has the angle of 182 degrees, and the angle of the black segment is 75 degrees, which gives us the ratio of 2.42. However, while the radius of the purple segment is 135 pixels, the radius of the black one is only 110 pixels. Why is this a problem? Well, due to the radius differences, the ratio between the arc lengths is 2.91, and the ratio between the areas is 3.66. So now, let me ask you: what is the ratio between the numbers represented by the purple and the black segments?
It is correct that the colors that IBM people used in their guide are neat, but data visualization that distorts information is not visualization but a piece of garbage. I assume that IBM produces decent computers, but don’t learn data visualization from them

The quintessence of data visualization usefulness

I have to admit, I was skeptical at the beginning of the COVID-19 crisis. I started becoming skeptical now when it seems that the crisis didn’t hit my country too hard. But then I saw the graphs in this Financial Times article, and the skepticism disapeared. The graphs are accompanied by hundreds of words, but there is no need for reading the text to understand almost everything.

These graphs are so good, so convincing, so well performed, they don’t leave any place for doubt or misunderstanding of the message the author wants to convey.

If you study data visualization, look at these graphs. Look at the color choice, legend location, and design. Look at the ticks on the X- and Y-axes, how they are spaced and typeset. Note the amount of details on the axes, specifically how sparse these details are.

If there is only one document you can read about data visualization, this is the one

I’m sorting my teaching material, and I found this gem. The UK Government Statistical Service published a guideline for effective data visualization and tables. If you know a busy person who doesn’t have time to study data visualization and can only read one document, this document is for them (it has less than 40 pages full of examples). Click o the image above to go to the guideline

Data giraffe is sometimes a feature, not a problem

I wrote about data giraffes two weeks ago. Usually, “data giraffes” are a problem and we need to work hard in order to solve it. Sometimes, they are a useful feature. Take a look at this NYT front page that shows the number of new unemployment applications in the United States over the time

And this is the pseudochartchart version of the same data

Credits: I’ve found these examples on Stott Berkun’s page.

ASCII histograms are quick, easy to use and to implement

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at the distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when we did most of our work in the console, and when creating a plot from Python required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

Data visualization as an engineering task – a methodological approach towards creating effective data visualization

In June 2019, I attended the NDR AI conference in Iași, Romania where I also gave a talk. Recently, the organizers uploaded the video recording to YouTube.

That was a very interesting conference, tight with interesting talks.

Next year, I plan to attend the Bucharest edition of NDR, where I will also give a talk with the working title “The biggest missed opportunity in data visualization”

Sometimes, you don’t really need a legend

This is another “because you can” rant, where I claim that the fact that you can do something doesn’t mean that you necessarily need to.

This time, I will claim that sometimes, you don’t really need a legend in your graph. Let’s take a look at an example. We will plot the GDP per capita for three countries: Israel, France, and Italy. Plotting three lines isn’t a tricky task. Here’s how we do this in Python

plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.legend()

The last line in the code above does a small magic and adds a nice legend

This image has an empty alt attribute; its file name is image.png

In Excel, we don’t even need to do anything, the legend is added for us automatically.

This image has an empty alt attribute; its file name is image-1.png

So, what is the problem?

What happens when a person wants to know which line represents which country? That person needs to compare the line color to the colors in the legend. Since our working memory has a limited capacity, we do one of the following. We either jump from the graph to the legends dozens of times, or we try to find a heuristic (a shortcut). Human brains don’t like working hard and always search for shortcuts (I recommend reading Daniel Kahneman’s “Think Fast and Slow” to learn more about how our brain works).

What would be the shortcut here? Well, note how the line for Israel lies mostly below the line for Italy which lies mostly below the line for France. The lines in the legend also lie one below the other. However, the line order in these two pieces of information isn’t conserved. This results in a cognitive mess; the viewer needs to work hard to decipher the graph and misses the point that you want to convey.

And if we have more lines in the graph, the situation is even worse.

This image has an empty alt attribute; its file name is image-2.png

Can we improve the graph?

Yes we can. The simplest way to improve the graph is to keep the right order. In Python, we do that by reordering the plotting commands.

plt.plot(gdp.Year, gdp.Australia, '-', label='Australia')
plt.plot(gdp.Year, gdp.Belgium, '-', label='Belgium')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.legend()
This image has an empty alt attribute; its file name is image-3.png

We still have to work hard but at least we can trust our brain’s shortcut.

If we have more time

If we have some more time, we may get rid of the (classical) legend altogether.

countries = [c for c in gdp.columns if c != 'Year']
fig, ax = plt.subplots()
for i, c in enumerate(countries):
    ax.plot(gdp.Year, gdp[c], '-', color=f'C{i}')
    x = gdp.Year.max()
    y = gdp[c].iloc[-1]
    ax.text(x, y, c, color=f'C{i}', va='center')
seaborn.despine(ax=ax)

(if you don’t understand the Python in this code, I feel your pain but I won’t explain it here)

This image has an empty alt attribute; its file name is image-4.png

Isn’t it better? Now, the viewer doesn’t need to zap from the lines to the legend; we show them all the information at the same place. And since we already invested three minutes in making the graph prettier, why not add one more minute and make it even more awesome.

This image has an empty alt attribute; its file name is image-5.png

This graph is much easier to digest, compared to the first one and it also provides more useful information.

.

This image has an empty alt attribute; its file name is image-6.png

I agree that this is a mess. The life is tough. But if you have time, you can fix this mess too. I don’t, so I won’t bother, but Randy Olson had time. Look what he did in a similar situation.

percent-bachelors-degrees-women-usa

I also recommend reading my older post where I compared graph legends to muttonchops.

In conclusion

Sometimes, no legend is better than legend.

This post, in Hebrew: [link]

What do we see when we look at slices of a pie chart?

What do we see when we look at slices of a pie chart? Angles? Areas? Arc length? The answer to this question isn’t clear and thus “experts” recommend avoiding pie charts at all.

Robert Kosara is a Senior Research Scientist at Tableau Software (you should follow his blog https://eagereyes.org), who is very active in studying pie charts. In 2016, Robert Kosara and his collaborators published a series of studies about pie charts. There is a nice post called “An Illustrated Tour of the Pie Chart Study Results” that summarizes these studies. 

Last week, Robert published another paper with a pretty confident title (“Evidence for Area as the Primary Visual Cue in Pie Charts”) and a very inconclusive conclusion

While this study suggests that the charts are read by area, itis not conclusive. In particular, the possibility of pie chart usersre-projecting the chart to read them cannot be ruled out. Furtherexperiments are therefore needed to zero in on the exact mechanismby which this common chart type is read.

Kosara. “Evidence for Area as the Primary Visual Cue in Pie Charts.” OSF, 17 Oct. 2019. Web.

The previous Kosara’s studies had strong practical implications, the most important being that pie charts are not evil provided they are done correctly. However, I’m not sure what I can take from this one. As far as I understand the data, the answer to the questions in the beginning of this post are still unclear. Maybe, the “real answer” to these questions is “a combination of thereof”.

Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

On Sunday, I wrote about bootstrapping. On Monday, I wrote about visualization uncertainty. Let’s now talk about bootstrapping and uncertainty visualization.

Robert Grant is a data visualization expert who wrote a book about interactive data visualization (which I should read, BTW).

Robert runs an interesting blog from which I learned another approach to uncertainty visualization, bootstrapping.

Source: Robert Grant.

Read the entire post: Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

Error bars in bar charts. You probably shouldn’t

This is another post in the series Because You Can. This time, I will claim that the fact that you can put error bars on a bar chart doesn’t mean you should.

It started with a paper by prof. Gerd Gigerenzer whose work in promoting numeracy I adore. The paper, “Natural frequencies improve Bayesian reasoning in simple and complex inference tasks” contained a simple graph that meant to convince the reader that natural frequencies lead to more accurate understanding (read the paper, it explains these terms). The error bars in the graph mean to convey uncertainty. However, the data visualization selection that Gigerenzer and his team selected is simply wrong.

First of all, look at the leftmost bar, it demonstrates so many problems with error bars in general, and in error bars in barplots in particular. Can you see how the error bar crosses the X-axis, implying that Task 1 might have resulted in negative percentage of correct inferences?

The irony is that Prof. Gigerenzer is a worldwide expert in communicating uncertainty. I read his book “Calculated risk” from cover to cover. Twice.

Why is this important?

Communicating uncertainty is super important. Take a look at this 2018 study with the self-explaining title “Uncertainty Visualization Influences how Humans Aggregate Discrepant Information.” From the paper: “Our study repeatedly presented two [GPS] sensor measurements with varying degrees of inconsistency to participants who indicated their best guess of the “true” value. We found that uncertainty information improves users’ estimates, especially if sensors differ largely in their associated variability”.

Image result for clinton trump polls
Source HuffPost

Also recall the surprise when Donald Trump won the presidential elections despite the fact that most of the polls predicted that Hillary Clinton had higher chances to win. Nobody cared about uncertainty, everyone saw the graphs!

Why not error bars?

Keep in mind that error bars are considered harmful, and I have a reference to support this claim. But why?

First of all, error bars tend to be symmetric (although they don’t have to) which might lead to the situation that we saw in the first example above: implying illegal values.

Secondly, error bars are “rigid”, implying that there is a certain hard threshold. Sometimes the threshold indeed exists, for example a threshold of H0 rejection. But most of the time, it doesn’t.

stacked round gold-colored coins on white surface

More specifically to bar plots, error lines break the bar analogy and are hard to read. First, let me explain the “bar analogy” part.

The thing with bar charts is that they are meant to represent physical bars. A physical bar doesn’t have soft edges and adding error lines simply breaks the visual analogy.

Another problem is that the upper part of the error line is more visible to the eye than the lower one, the one that is seen inside the physical bar. See?undefined

But that’s not all. The width of the error bars separates the error lines and makes the comparison even harder. Compare the readability of error lines in the two examples below

The proximity of the error lines in the second example (take from this site) makes the comparison easier.

Are there better alternatives?

Yes. First, I recommend reading the “Error bars considered harmful” paper that I already mentioned above. It not only explains why, but also surveys several alternatives

Nathan Yau from flowingdata.com had an extensive post about different ways to visualize uncertainty. He reviewed ranges, shades, rectangles, spaghetti charts and more.

Claus Wilke’s book “Fundamentals of Data Visualization” has a dedicated chapter to uncertainty with and even more detailed review [link].

Visualize uncertainty about the future” is a Science article that deals specifically with forecasts

Robert Kosara from Tableu experimented with visualizing uncertainty in parallel coordinates.

There are many more examples and experiments, but I think that I will stop right now.

The bottom line

Communicating uncertainty is important.

Know your tools.

Try avoiding error bars.

Bars and bars don’t combine well, therefore, try harder avoiding error bars in bar charts.

Visualizations with perceptual free-rides

Dr. Richard Brath is a data visualization expert who also blogs from time to time. Each post in Richard’s blog provides a deep, and often unexpected to me, insight into one dataviz aspect or another.

richardbrath

We create visualizations to aid viewers in making visual inferences. Different visualizations are suited to different inferences. Some visualizations offer more additional perceptual inferences over comparable visualizations. That is, the specific configuration enables additional inferences to be observed directly, without additional cognitive load. (e.g. see Gem Stapleton et al, Effective Representation of Information: Generalizing Free Rides2016).

Here’s an example from 1940, a bar chart where both bar length and width indicate data:

Walter_Weld__How_to_chart_data_1960_hathitrust2

The length of the bar (horizontally) is the percent increase in income in each industry.  Manufacturing has the biggest increase in income (18%), Contract Construction is second at 13%.

The width of the bar (vertically) is the relative size of that industry: Manufacturing is wide – it’s the biggest industry – it accounts for about 23% of all industry. Contract Construction is narrow, perhaps the third smallest industry, perhaps around 3-4%.

What’s really interesting is that

View original post 446 more words

Pseudochart. It’s like a pseudocode but for charts

Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. People write pseudocode to isolate the “bigger picture” of an algorithm. Pseudocode doesn’t care about the particular implementation details that are secondary to the problem, such as memory management, dealing with different encoding, etc. Writing out the pseudocode version of a function is frequently the first step in planning the implementation of complex logic.

Similarly, I use sketches when I plan non-trivial charts, and when I discuss data visualization alternatives with colleagues or students.

One can use a sheet of paper, a whiteboard, or a drawing application. You may recognize this approach as a form of “paper prototyping,” but it deserves its own term. I suggest calling such a sketch a “pseudochart”*. Like a piece of pseudocode, the purpose of a pseudochart is to show the visualization approach to the data, not the final graph itself.

* Initially, I wanted to use the term “pseudograph” but the network scientists already took it for themselves.

** The first sentence of this post is a taken from the Wikipedia.

Useful redundancy — when using colors is not completely useless

The maximum data-ink ratio principle implies that one should not use colors in their graphs if the graph is understandable without the colors. The fact that you can do something, such as adding colors, doesn’t mean you should do it. I know it. I even have a dedicated tag on this blog for that. Sometimes, however, consistent use of colors serves as a useful navigation tool in a long discussion. Keep reading to learn about the justified use of colors.

Pew Research Center is a “is a nonpartisan American fact tank based in Washington, D.C. It provides information on social issues, public opinion, and demographic trends shaping the United States and the world.” Recently, I read a report prepared by the Pew Center on the religious divide in the Israeli society. This is a fascinating report. I recommend reading without any connection to data visualization.

But this post does not deal with the Isreali society but with graphs and colors.

Look at the first chart in that report. You may see a tidy pie chart with several colored segments. 

Pie chart: Religious composition of Israeli society. The chart uses several colored segments

Aha! Can’t they use a single color without losing the details? Of course the can! A monochrome pie chart would contain the same information:

Pie chart: Religious composition of Israeli society. The chart uses monochrome segments

In most of the cases, such a transformation would make a perfect sense. In most of the cases, but not in this report. This report is a multipage research document packed with many facts and analyses. The pie chart above is the first graph in that report that provides a broad overview of the Israeli society. The remaining of this report is dedicated to the relationships between and within the groups represented by the colorful segments in that pie chart. To help the reader navigating through this long report, its authors use a consistent color scheme that anchors every subsequent graph to the relevant sections of the original pie chart.

All these graphs and tables will be readable without the use of colors. Despite the fact that the colors here are redundant, this is a useful redundancy. By using the colors, the authors provided additional information layers that make the navigation within the document easier. I learned about the concept of useful redundancy from “Trees, Maps, and Theorems” by Jean-luc Dumout. If you can only read one book about data communication, it should be this book.

Microtext Line Charts

Why adding text labels to graph lines, when you can build graph lines using text labels? On microtext lines

richardbrath

Tangled Lines

Line charts are a staple of data visualization. They’ve existed at least since William Playfair and possibly earlier. Like many charts, they can be very powerful and also have their limitations. One limitation is the number of lines that can be displayed. One line works well: you can see trend, volatility, highs, lows, reversals. Two lines provides opportunity for comparison. 5 lines might be getting crowded. 10 lines and you’re starting to run out of colors. But what if the task is to compare across a peer group of 30 or 40 items? Lines get jumbled, there aren’t enough discrete colors, legends can’t clearly distinguish between them. Consider this example looking at unemployment across 37 countries from the OECD: which country had the lowest unemployment in 2010?

unemployment_plain

Tooltips are an obvious way to solve this, but tooltips have problems – they are much slower than just shifing visual attention…

View original post 1,323 more words

איך אומרים דאטה ויזואליזיישן בעברית?

This post is written in Hebrew about a Hebrew issue. I won’t translate it to English.

אני מלמד data visualization בשתי מכללות בישראלבמכללת עזריאלי להנדסה בירושלים ובמכון הטכנולוגי בחולון. כשכתבתי את הסילבוס הראשון שלי הייתי צריך למצוא מונח ל־data visualization וכתבתיהדמיית נתונים״ אומנם זה הזכיר לי קצת תהליך של סימולציה, אבל האופציה האחרת ששקלתי היתה ״דימות״ וידעתי שהיא שמורה ל־imaging, דהיינו תהליך של יצירת דמות או צורה של עצם, בעיקר בעולם הרפואה.

הבנתי שהמונח בעייתי בשיעור הראשון שהעברתי. מסתברששניים מארבעת הסטודנטים שהגיעו לשיעור חשבו שקורס ״הדמיית נתונים בתהליך מחקר ופיתוח״ מדבר על סימולציות.

מתישהו שמעתי מחבר של חבר שהמונח הנכון ל־visualization זה הדמאה, אבל זה נשמע לי פלצני מדי, אז השארתי את ה־״הדמיה״ בשם הקורס והוספתי “data visualization” בסוגריים.

היום, שלוש שנים אחרי ההרצאה הראשונה שהעברתי, ויומיים לפני פתיחת הסמסטר הבא, החלטתי לגגל (יש מילה כזאת? יש!) את התשובה. ומה מסתבר? עלון ״למד לשונך״ מס׳ 109 של האקדמיה ללשון עברית שיצא לאור בשנת 2015 קובע שהמונח ל־visualization הוא הַחְזָיָה. לא יודע מה אתכם, אבל אני לא משתגע על החזיה. עוד משהו שאני לא משתגע עליו הוא שבתור הדוגמא להחזיה, האקדמיה החלטיה לשים תרשים עוגה עם כל כך הרבה שגיאות!

Screen Shot 2018-10-23 at 20.35.52

נראה לי שאני אשאר עם הדמיה. ויקימילון מרשה לי.

נ.ב. שמתם לב שפוסט זה השתמשתי במקף עברי? אני מאוד אוהב את המקף העברי.

Data visualization in right-to-left languages

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does.

Right-to-left (RTL) languages such as Hebrew, Arabic, and Farsi are used by roughly 1.8 billion people around the world. Many of them consume data in their native languages. Nevertheless, I have never seen any research or study that explores data visualization in RTL languages. Until a couple of days ago, when I saw this interesting observation by Nick Doiron “Charts when you read right-to-left“.

I teach data visualization in Israeli colleges. Whenever a student asks me RTL-related questions, I always answer something like “it’s complicated, let’s not deal with that”. Moreover, in the assignments, I even allow my students to submit graphs in English, even if they write the report in Hebrew.

Nick’s post made me wonder about data visualization do’s and don’ts in RTL environments. Should Hebrew charts differ from Arabic or Farsi? What are the accepted practices?

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does. I want to collect as many examples of data visualization in RTL languages. Links to research articles are more than welcome. You can leave your comments here or send them to boris@gorelik.net.

Thank you.

The image at the top of this post is a modified version of a graph that appears in the post that I cite. Unfortunately, I wasn’t able to find the original publication.

“Any questions?” How to fight the awkward silence at the end of a presentation?

If you ever gave or attended a presentation, you are familiar with this situation: the presenter asks whether there are any questions and … nobody asks anything. This is an awkward situation. Why aren’t there any questions? Is it because everything is clear? Not likely. Everything is never clear. Is it because nobody cares? Well, maybe. There are certainly many people that don’t care. It’s a fact of life. Study your audience, work hard to make the presentation relevant and exciting but still, some people won’t care. Deal with it.

However, the bigger reasons for lack of the questions are human laziness and the fear of being stupid. Nobody likes asking a question that someone will perceive as a stupid one. Sometimes, some people don’t mind asking a question but are embarrassed and prefer not being the first one to break the silence.

What can you do? Usually, I prepare one or two questions by myself. In this case, if nobody asks anything, I say something like “Some people, when they see these results ask me whether it is possible to scale this method to larger sets.”. Then, depending on how confident you are, you may provide the answer or ask “What do you think?”.

You can even prepare a slide that answers your question. In the screenshot below, you may see the slide deck of the presentation I gave in Trento. The blue slide at the end of the deck is the final slide, where I thank the audience for the attention and ask whether there are any questions.

My plan was that if nobody asks me anything, I would say “Thank you again. If you want to learn more practical advises about data visualization, watch the recording of my tutorial, where I present this method  <SLIDE TRANSFER, show the mockup of the “book”>. Also, many people ask me about reading suggestions, this is what I suggest you read: <SLIDE TRANSFER, show the reading pointers>

Screen Shot 2018-09-17 at 10.10.21

Luckily for me, there were questions after my talk. Luckily, one of these questions was about practical advice so I had a perfect excuse to show the next, pre-prepared, slide. Watch this moment on YouTube here.

Graphing Highly Skewed Data – Tom Hopper

My colleague, Chares Earl, pointed me to this interesting 2010 post that explores different ways to visualize categories of drastically different sizes.

The post author, Tom Hopper, experiments with different ways to deal with “Data Giraffes”. Some of his experiments are really interesting (such as splitting the graph area). In one experiment, Tom Hopper draws bar chart on a log scale. Doing so is considered as a bad practice. Bar charts value (Y) axis must include meaningful zero, which log scale can’t have by its definition.

Other than that, a good read Graphing Highly Skewed Data – Tom Hopper

Sometimes, less is better than more

Today, during the EuroSciPy conference, I gave a presentation titled “Three most common mistakes in data visualization and how to avoid them”. The title of this presentation is identical to the title of the presentation that I gave in Barcelona earlier this year. The original presentation was approximately one and a half hours long. I knew that EuroSciPy presentations were expected to be shorter, so I was prepared to shorten my talk to half an hour. At some point, a couple of days before departing to Trento, I realized that I was only allocated 15 minutes. Fifteen minutes! Instead of ninety.

Frankly speaking, I was in a panic. I even considered contacting EuroSciPy organizers and asking them to remove my talk from the program. But I was too embarrassed, so I decided to take the risk and started throwing slides away. Overall, I think that I spent eight to ten working hours shortening my presentation. Today, I finally presented it. Based on the result, and on the feedback that I got from the conference audience, I now know that the 15-minutes version is better than the original, longer one. Video recording of my talk is available on Youtube and is embedded below. Below is my slide deck

 

 

Illustration image credit: Photo by Jo Szczepanska on Unsplash

An even better data visualization workshop

Boris Gorelik teaching in front of an audience.

Yesterday, I gave a data visualization workshop at EuroSciPy 2018 in Trento. I spent HOURs building and improving it. I even developed a “simple to use, easy to follow, never failing formula” for data visualization process (I’ll write about it later).

I enjoyed this workshop so much. Both preparing it, and (even more so) delivering it. There were so many useful questions and remarks. The most important remark was made by Gael Varoquaux who pointed out that one of my examples was suboptimal for vision impaired people. The embarrassing part is that one of the last lectures that I gave in my college data visualization course was about visual communication for the visually impaired. That is why the first thing I did when I came to my hotel after the workshop was to fix the error. You may find all the (corrected) material I used in this workshop on GitHub. Below, is the video of the workshop, in case you want to follow it.

 

 

 

Photo credit: picture of me delivering the workshop is by Margriet Groenendijk

Meet me at EuroSciPy 2018

I am excited to run a data visualization tutorial, and to give a data visualization talk during the 2018 EuroSciPy meeting in Trento, Italy.

My tutorial “Data visualization — from default and suboptimal to efficient and awesome”will take place on Sep 29 at 14:00. This is a two-hours tutorial during which I will cover between two to three examples. I will start with the default Matplotlib graph, and modify it step by step, to make a beautiful aid in technical communication. I will publish the tutorial notebooks immediately after the conference.

My talk “Three most common mistakes in data visualization” will be similar in nature to the one I gave in Barcelona this March, but more condensed and enriched with information I learned since then.

If you plan attending EuroSciPy and want to chat with me about data science, data visualization, or remote working, write a message to boris@gorelik.net.

Full conference program is available here.

Value-Suppressing Uncertainty Palettes – UW Interactive Data Lab – Medium

Uncertainty is one of the most neglected aspects of number-based communication and one of the most important concepts in general numeracy. Comprehending uncertainty is hard. Visualizing it is, apparently, even harder.

Last week I read a paper called Value-Suppressing Uncertainty Palettes, by M.Correll, D. Moritz, and J. Heer from the Data visualization and interactive analysis research at the University of Washington. This paper describes an interesting approach to color-encoding uncertainty.

Value-Suppressing Uncertainty Palette

Uncertainty visualization is commonly done by reducing color saturation and opacity.  Cornell et al suggest combining saturation reduction with limiting the number of possible colors in a color palette. Unfortunately, there the authors used Javascript and not python for this paper, which means that in the future, I might try implementing it in python.

Two figures visualizing poll data over the USA map, using different approaches to visualize uncertainty

 

Visualizing uncertainty is one of the most challenging tasks in data visualization. Uncertain

 

via Value-Suppressing Uncertainty Palettes – UW Interactive Data Lab – Medium

Evolution of a complex graph. Part 1. What do you want to say?

From time to time, people ask me for help with non-trivial data visualization tasks. A couple of weeks ago, a friend-of-a-friend-of-a-friend showed me a set of graphs with the following note:

Each row is a different use case. Each use case was tested on three separate occasions – columns 1,2,3. We hope to show that the lines in each row behave similarly, but that there are differences between the different rows.

Before looking at the graphs, note the last sentence in the above comment. Knowing what you want to show is an essential and not trivial part of a data visualization task. Specifying what is it precisely that you want to say is the first required task in any communication attempt, technical or not.

For the obvious reasons, I cannot share the original graphs that that person gave me. I managed to re-create the spirit of those graphs using a combination of randomly generated arrays.
The original graph: A 3-by-4 panel of line charts
Notice how the X- and Y- axes are aligned between all the subplots. Such alignment is a smart move that provides a shared scale and allows faster and more natural comparison between the curves. You should always try aligning your axes. If aligning isn’t possible, make sure that it is absolutely, 100%, clear that the scales are different. Slight differences are very confusing.

There are several small things that we can do to improve this graph. First, the identical legends in every subplot are a useless waste of ink and thus, of your viewers’ processing power. Since they are identical, these legends do nothing but distracting the viewer. Moreover, while I understand how a variable name such as event_prob appeared on a graph, showing such names outside technical teams is a bad practice. People who don’t share intimate knowledge with the underlying data will find human-readable labels easier to comprehend, making your message “stickier.”
Let’s improve the signal-to-noise ratio of this plot.
An improved version of the 3-by-4 grid of line charts

According to our task, each row is a different use case. Notice that I accompanied each row with a human-readable label. I didn’t use cryptic code such as group_001, age_0_10 or the such.
Now, let’s go back to the task specification. “We hope to show that the lines in each row behave similarly, but that there are differences between the separate rows.” Remember my advice to always use conclusions as graph titles? Let’s test how such a title will look like

A hypothetical screenshot. The title says: "low intra- & high inter- group variability"

Really? Is there a better way to justify the title? I claim that there is.

Let’s experiment a little bit. What will happen if we will plot all the lines on the same graph? By doing so, we might create a stronger emphasize of the similarities and the differences.

Overlapping lines that show several repetitions in four different groups
Not bad. The separate lines create some excessive noise, and the legend isn’t the best way to label multiple lines, so let’s improve the graph even further.

Curves representing four different data groups. Shaded areas represent inter-group variability

Note that meaningful ticks on the X-axis. The 30, 180, and 365-day marks provide useful anchors.

Now, let us go back to our title. “Low intra- and high inter- group variability” is, in fact, two conclusions. If you have ever read any text about technical presentations, you should remember the “one point per slide” rule. How do we solve this problem? In cases like these, I like to use the same graph in two different slides, one for each conclusion.

Screenshot showing two slides. The first one is titled "low within-group variability". The second one is titled "High between-group variability". The graphs in the slides is the same

During a presentation, I would show this graph with the first conclusion as a title. I would talk about the implications of that conclusion. Next, I will say “wait! There is more”, will promote the slide and start talking about the second conclusion.

To sum up,

First, decide what is it that you want to say. Then ask whether your graph says what you want to say. Next, emphasize what you want to say, and finally, say what you want to say.

To be continued

The case that you see in this post is a relatively easy one because it only compares four groups. What will happen if you will need to compare six, sixteen or sixty groups? I will try answering this question in one of my next posts

C for Conclusion

From time to time, I give a lecture about most common mistakes in data visualization. In this lection, I say that not adding a graph’s conclusion as a title is an opportunity wasted

Screenshot. Slide deck. The slide says

In one of these lectures, a fresh university graduate commented that in her University, she was told to never write a conclusion in a graph. According to to the logic she was tought, a scientist is only supposed to show the information, and let his or her peer scientists draw the conclusions by themselves. This sounds like a valid demand except that it is, in my non-humble opinion, wrong. To understand why is that, let’s review the arguments in favor of spelling out the conclusions.

The cynical reason

We cannot “unlearn” how to read. If you show a piece of graphic for its aesthetic value, it is perfectly OK not to suggest any conclusions. However, most of the time, you will show a graph to persuade someone, to convince them that you did a good job, that your product is worth investing in, or that your opponent is ruining the world. You hope that your audience will reach the conclusion that you want them to reach, but you are not sure. Spelling out your conclusion ensures that the viewers use it as a starting point. In many cases, they will be too lazy to think of objections and will adopt your point of view. You don’t have to believe me on this one. The Nobel Prize winner Daniel Kahneman wrote a book about this phenomenon.

What if you want to hear genuine criticism? Use the same trick to ask for it. Write an open question instead of the conclusion to ensure everybody wakes up and start thinking critically.

The self-discipline reason

Some people are not comfortable with the cynical way I suggest to exploit the limitations of the human mind. Those people might be right. For them, I have another reason, self-discipline. Coming up with a short, concise and descriptive title requires effort. This effort slows you down and ensures that you start thinking critically and asking questions. “What does this graph really tells?” “Is this the best way to demonstrate this conclusion?” “Is this conclusion relevant to the topic of my talk, is it worth the time?”. These are very important questions that someone has to ask you. Sure, having a professional and devoted reviewer on your prep team is great but unless you are a Fortune-500 CEO, you are preparing your presentations by yourself.

The philosophical reason

You will notice that my two arguments sound like a hack. They do not talk about the “pure science attitude”, and seem to be detached from the theoretical picture of the idealized scientific process. That is why, when that student objected to my suggestion, I admitted defeat. Being a data scientist, I want to feel good about my scientific practice. It took me a while but at some point, I realized that writing a conclusion as the sole title of a graph or a slide is a good scientific practice and not a compromise.

According to the great philosopher Karl Popper, a mandatory characteristic of any scientific theory is that they make claims that future observations might show to be false. Popper claims that without taking a risk of being proved wrong,  a scientist misses the point  [ref]. And what is the best way to make a clear, risky statement, if not spelling it out as a clear, non-ambiguous title of your graph?

Don’t feel bad, your bases are covered

To sum up, whenever you create a graph or a slide, think hard about what conclusion you want your audience to make out of it. Use this conclusion as your title. This will help you check yourself, and will help your fellow scientists assess your theory. And if a purist professor says you shouldn’t write your conclusions, tell him or her that the great Karl Popper thought otherwise.

 

Meaningless slopes

That fact that you can doesn’t mean that you should! I will say it once again.That fact that you can doesn’t mean that you should! Look at this slopegraph that was featured by “Information is Beautiful”

What does it say? What do the slopes mean? It’s a slopegraph, its slopes should have a meaning. Sure, you can draw a line between one point to another but can you see the problem here? In this nonsense graph, the viewer is invited to look at slopes of lines that connect dollars with years. The proverbial “apples and oranges” are not even close to the nonsense degree of this graph. Not even close.

This page attributes this graph to National Geographic, which makes me even sadder.

 

In defense of three-dimensional graphs

“There is only one thing worse than a pie chart. It’s a 3-D pie chart”. This is what I used to think for quite a long time. Recently, I have revised my attitude towards pie charts, mainly due to the works of Rober Kosara from Tableau. I am no so convinced that pie charts can be a good visualization choice, I even included a session “Pie charts as an alternative to bar charts” in my recent workshop.

What about three-dimensional graphs? I’m not talking about the situations where the data is intrinsically three-dimensional. Such situations lie within the consensus. I’m talking about adding a third dimension to graphs that can work in two dimensions. Like the example below that is taken from a 2017 post by Deven Wisner.

Screenshot: a 3D pie chart with text "The only good thing about this pie chart is that it's starting to look more like [a] real pie"

Of course, this is not a hypothetical example. We all remember how the late Steve Jobs tried to create a false impression of Apple market share

Steve Jobs during a presentation, in front of a

Having said all that, can you think of a legitimate case where adding the third dimension adds more value than distraction? I worked hard, and I finally found it.

 

Take a look at the overlapping density plot (a.k.a “joy plot”).

Three joyplot examples

If you think of this, joyplots are nothing more than 3-d line graphs done well. Most of the time, they provide information-rich data overview that also enables digging into fine details. I really like joyplots. I included one in my recent workshop. Many libraries now provide ready-to-use implementations of joyplots. This is a good thing to have. The only reservation that I have about those implementations is the fact that many of them, including my favorite seaborn, add meaningless colors to the curves. But this is a topic for another rant.

I will host a data visualization workshop at Israel’s biggest data science event

TL/DR

 

What: Data Visualization from default to outstanding. Test cases of tough data visualization

Why:  You would never settle for default settings of a machine learning algorithm. Instead, you would tweak them to obtain optimal results. Similarly, you should never stop with the default results you receive from a data visualization framework. Sadly, most of you do.

When: May 27, 2018 (a day before the DataScience summit)/ 13:00 – 16:00

Where:  Interdisciplinary Center (IDC) at Herzliya.

More info: here.

Timeline:
1. Theoretical introduction: three most common mistakes in data visualization (45 minutes)
2. Test case (LAB): Plotting several radically different time series on a single graph (45 minutes)
3. Test case (LAB): Bar chart as an effective alternative to a pie chart (45 minutes)
4. Test case (LAB): Pie chart as an effective alternative to a bar chart (45 minutes)

More words

According to the conference organizers, the yearly Data Science Summit is the biggest data science event in Israel. This year, the conference will take place in Tel Aviv on Monday, May 28. One day before the main conference, there will be a workshop day, hosted at the Herzliya Interdisciplinary Center. I’m super excited to host one of the workshops, during the afternoon session. During this workshop, we will talk about the mistakes data scientist make while visualizing their data and the way to avoid them. We will also have some fun creating various charts, comparing the results, and trying to learn from each others’ mistakes.

Register here.

Three most common mistakes in data visualization 
and how to avoid them. Now, the slides

Yesterday, I talked in front of the Barcelona Data Science and Machine Learning Meetup about the most common mistakes in data visualization. I enjoyed talking with the local community very much. Judging by the feedback I received during and after the talk, they too, enjoyed my presentation. I uploaded my slides to Slideshare.

Me in front of a screen that shows a bar chart

Enjoy!

Live in Barcelona. Three most common mistakes in data visualization.

On Thursday, March 20, I will give a talk titled “Three most common mistakes in data visualization and how to avoid them.” I will be a guest of the Barcelona Data Science and Machine Learning Meetup Group. Right now, less than twenty-four hours after the lecture announcement, there are already seventeen people on the waiting list. I feel a lot of responsibility and am very excited.

 

ASCII histograms are quick, easy to use and implement

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

The case of meaningless comparison

Exposé, an Australian-based data analytics company, published a use case in which they analyze the benefits of a custom-made machine learning solution. The only piece of data in their report [PDF] was a graph which shows the observed and the predicted

Screenshot that shows two time series curves: one for the observed and one for the predicted values

Graphs like this one provide an easy-to-digest overview of the data but are meaningless with respect to our ability to judge model accuracy. When predicting values of time series, it is customary to use all the available data to predict the next step. In cases like that, “predicting” the next value to be equal to the last available one will result in an impressive correlation. Below, for example, is my “prediction” of Apple stock price. In my model, I “predict” tomorrow’s prices to be equal to today’s closing price plus random noise.

Two curves representing two time series - Apple stock price and the same data shifted by one day

Look how impressive my prediction is!

I’m not saying that Exposé constructed a nonsense model. I have no idea what their model is. I do say, however, that their communication is meaningless. In many time series, such as consumption dynamics, stock price, etc, each value is a function of the previous ones. Thus, the “null hypothesis” of each modeling attempt should be that of a random walk, which means that we should not compare the actual values but rather the changes. And if we do that, we will see the real nature of the model. Below is such a graph for my pseudo-model (zoomed to the last 20 points)

diff_series

 

Suddenly, my bluff is evident.

To sum up, a direct comparison of observed and predicted time series can only be used as a starting point for a more detailed analysis. Without such an analysis, this comparison is nothing but a meaningless illustration.

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

Does chart junk really damage the readability of your graph?

Screen Shot 2018-02-12 at 16.32.56Data-ink ratio is considered to be THE guiding principle in data visualization. Coined by Edward Tufte, data-ink is “the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” According to Tufte, the ratio of the data-ink out of all the “ink” in a graph should be as high as possible, preferably, 100%.
Everyone who considers themselves serious about data visualization knows (either formally, or intuitively) about the importance to keep the data-ink ratio high, the merits of high signal-to-noise ratio, the need to keep the “chart junk” out. Everybody knows it. But are there any empirical studies that corroborate this “knowledge”? One of such studies was published in 1988 by James D. Kelly in a report titled “The Data-Ink Ratio and Accuracy of Information Derived from Newspaper Graphs: An Experimental Test of the Theory.”

In the study presented by J.D. Kelly, the researchers presented a series of newspaper graphs to a group of volunteers. The participants had to look at the graphs and answer questions. A different group of participants was exposed to similar graphs that underwent rigorous removal of all the possible “chart junk.” One such an example is shown below

Two bar charts based on identical data. One - with "creative" illustrations. The other one only presents the data.

Unexpectedly enough, there was no difference between the error rate the two groups made. “Statistical analysis of results showed that control groups and treatment groups made a nearly identical number of errors. Further examination of the results indicated that no single graph produced a significant difference between the control and treatment conditions.”

I don’t remember how this report got into my “to read” folder. I am amazed I have never heard about it. So, what is my take out of this study? It doesn’t mean we need to abandon the data-ink ratio at all. It does not provide an excuse to add “chart junk” to your charts “just because”. It does, however, show that maximizing the data-ink ratio shouldn’t be followed zealously as a religious rule. The maximum data-ink ratio isn’t a goal, but rather a tool. Like any tool, it has some limitations. Edward Tufte said, “Above all, show data.” My advice is “Show data, enough data, and mostly data.” Your goal is to convey a message, if some decoration (a.k.a chart junk) makes your message more easily digestible, so be it.

Why bar charts should always start at zero?

In the data visualization world, not starting a bar chart at zero is a “BIG NO”. Some people protest. “How come can anyone tell me how to start my bar chart? The Paper/Screen can handle anything! If I want to start a bar chart at 10, nobody can stop me!”

Data visualization is a language. Like any language, data visualization has its set of rules,  grammar if you wish. Like in any other language, you are free to break any rule, but if you do so, don’t be surprised if someone underestimates you. I’m not a native English speaker. I certainly break many English grammar rules when I write or speak. However, I never argue if someone knowledgeable corrects me. If you agree that one should try respecting grammar rules of a spoken language, you have to agree to respect the grammar of any other language, including data visualization.

Natan Yau from flowingdata.com has a very informative post

Screenshot of flowingdata.com post "Bar Chart Baselines Start at Zero"

that explores this exact point. Read it.

Another related discussion is called “When to use the start-at-zero rule” and is also worth reading.

Also, do remember is that the zero point has to be a meaningful one. That is why, you cannot use a bar chart to depict the weather because, unless you operate in Kelvin, the zero temperature is meaningless and changes according to the arbitrary choice the temperature scale.

Yet another thing to remember is that

It’s true that every rule has its exception. It’s just that with this particular rule, I haven’t seen a worthwhile reason to bend it yet.

(citing Natan Yau)

Gender salary gap in the Israeli high-tech — now the code

Several people have asked me about the technology I used to create the graphs in my recent post about the gender salary gap in the Israeli high-tech. Like 99% of the graphs I create, I used matplotlib. I have uploaded the notebook that I used for that post to Github. Here’s the link. The published version uses seaborn style settings. The original one uses a slightly customized style.

 

In defense of double-scale and double Y axes

If you had a chance to talk to me about data visualization, you know that I dislike the use of double Y-axis for anything except for presenting different units of the same measurement (for example inches and meters). Of course, I’m far from being a special case.  Double axis ban is a standard stand among all the people in the field of data visualization education. Nevertheless, double-scale axes (mostly Y-axis) are commonly used both in popular and technical publications. One of my data visualization students in the Azrieli College of Engineering of Jerusalem told me that he continually uses double Y scales when he designs dashboards that are displayed on a tiny screen in a piece of sophisticated hardware. He claimed that it was impossible to split the data into separate graphs, due to space constraints, and that the engineers that consume those charts are professional enough to overcome the shortcomings of the double scales. I couldn’t find any counter-argument.

When I tried to clarify my position on that student’s problem, I found an interesting article by Financial Times commentator John Auther, called “Lies, Damned Lies and Statistics.” In this article, John Auther reviews the many problems a double scale can create. He also shows different alternatives (such as normalization). However, at the end of that article, John Auther also provides strong and valid arguments in favor of the moderate use of double scales. John Auther notices strange dynamics of two metrics

A chart with two Y axes - one for EURJPY exchange rate and the other for SPX Index
Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)

It is extraordinary that two measures with almost nothing in common with each other should move this closely together. In early 2007 I noticed how they were moving together, and ended up writing an entire book attempting to explain how this happened.

It is relatively easy to modify chart scales so that “two measures with almost nothing in common […] move […] closely together”. However, it is hard to argue with the fact that it was the double scale chart that triggered that spark in the commentator’s head.  He acknowledges that normalizing (or rebasing, as he puts it) would have resulted in a similar picture

Graph that depicts the dynamics of two metrics, brought to the same scale
Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)

But

However, there is one problem with rebasing, which is that it does not take account of the fact that a stock market is naturally more variable than a foreign exchange market. Eye-balling this chart quickly, the main message might be that the S&P was falling faster than the euro against the yen. The more important point was that the two were as correlated as ever. Both stock markets and foreign exchange carry trades were suffering grievous losses, and they were moving together — even if the S&P was moving a little faster.

I am not a financial expert, so I can’t find an easy alternative that will provide the insight John Auther is looking for while satisfying my purist desire to refrain from using double Y axes. The real question, however, is whether such an alternative is something one should be looking for. In many fields, double scales are the accepted language. Thanks to the standard language, many domain experts are able to exchange ideas and discover novel insights.  Reinventing the language might bring more harm than good. Thus, my current recommendations regarding double scales are:

Avoid double scales when possible, unless its a commonly accepted practice. In which case, be persistent and don’t lie.