On proper selection of colors in graphs

Photo by Sharon McCutcheon on Pexels.com

How do you properly select a colormap for a graph? What makes the rainbow color map a wrong choice, and what are the proper alternatives?

Today, I stumbled upon a lengthy post that provides an in-depth review of the theory behind our color perception. The article concentrates on quantitative colormaps but also includes information relevant to selecting proper colors for categories. 

If you never learned the theory behind the color and are interested in data visualization, I strongly suggest investing 45-60 minutes of your life in reading this post.

Another example of the power of data visualization

I stumbled upon a great graph that tells a complex story compellingly.

Comparison of two COVID-19 waves in the UK, taken from here.

This graph compares the last two waves of COVID-19 in the United Kingdom and is shows so clearly that the new wave (that is supposedly composed of the Delta variant) is much more infections on the one hand, but on the other hand, causes much less damage. Is the more moderate damage the result of the Delta variant nature of the protective effect of the vaccination is still an open question, but the difference is still striking.

Super useful videos for advanced data visualizers

The great Robert Kosara, also known as the “eager eyes” has started publishing a series of videos he calls Chart Appreciation. In these videos, Robert takes a piece of data visualization from a reputable and known source, and discusses why this particular piece is so good, what decisions were made that made it possible, what alternatives are, and more. If you consider yourself an intermediate or advanced practitioner of data visualization, you should subscribe. Here’s one example.

Graphical comparison of changes in large populations with “volcano plots”

I recently rediscovered a volcano plot — a scatter plot that aims to visualize changes in large populations.

Volcano plots are very technical and specialized and, most probably, are not a good fit for explanatory data visualization. However, they can be useful during the exploration phase, and they come with a set of well-established metrics.

Moreover, if you are lucky enough to have well-behaved data, the plots look very cool

Visualization of RNA-Seq results with Volcano Plot
From here

Of course, in real life, the data is messy. Add bad visualization practices to the mess and you get a marvel like this one

From here

The bottom line: if you have two populations to compare, consider volcano plots. But do remember dataviz good practices.

Before and after — stacked bar charts

A fellow data analyst asked a question? What do we do when we need to draw a stacked bar chart that has too many colors? How do we select the colors so that they are nice but also are easily distinguishable? To answer this question, let’s look at the data similar to what appeared in the original question. I also tried to recreate the actual chart’s style

So, how do we select colors?
The answer to this question is pretty complicated. To have a set of easily distinguishable colors, one needs to model the color perception in a typical human being properly. Luckily, a tool called I Want Hue that’s based on a solid theory explained here. The problem, however, isn’t in colors.

This is not the right question

Distinguishing between eight colors in a graph is a challenging task. Selecting the right color scheme might help, but it won’t solve this fundamental problem. Moreover, stacked bar plots are tricky due to another complication.

We, the humans, are somewhat good are comparing positions but not as good at comparing sizes. This is why comparing the heights of the bars is relatively easy. It is easy because the bars start at the same line, and our task is to compare the bar end position, not the bar size. Reading the heights of the lowest segment in the bars is also an easy task for the same reason: we don’t compare the sizes but the heights.

However, comparing the sizes of the middle components is more challenging. As a result, the intermediate parts of a graph don’t add useful information but rather add noise. Thus, let us explain two options. First, we will reduce the number of groups. Next, we will explore what happens when reducing the number of groups is not an option.

Option 1. Reduce the number of categories

It is hard to advise about data visualization when I don’t know what conclusion the author wants to convey. However, I am sure that in many cases, the number of categories that are relevant to the viewer is much smaller than the number of types that are relevant to the analyst. The viewer might not care about all the hard job you did while collecting the data; what they are about is an insight. For example, if we reduce the discussion to two groups: the USA and non-USA data centers, the graph becomes much more readable.

Note how two groups in a stacked graph pose no problem in deciphering the sizes. If we take care of readability and improve the data-ink ratio, we get a nice data visualization piece.

Option 2. When reducing the number of categories is not an option

But what if reducing the number of categories is not an option? If you are absolutely sure that the audience absolutely needs to see all the information, you can split the different groups into separate subgraphs.

Have you noticed that the X-axis in our case represents time? In this case, we can replace the bars with an evolution plot and create a separate chart for each category in the data set. I took special care to keep the Y-axis scale equal between all the graphs so that the viewer can easily distinguish between data centers with a lot of errors and data centers with only a few of them. Here’s the result:

But what if the overall error rate is of greater importance than the individual groups. In that case, we can plot them in a larger graph and add the separate groups below, in smaller, un-emphasized subplots.

Summary — the Why and the What define the How

When you have a technical question about improving a graph, make sure you ask yourself “why.” Why is, does technical problems matter? Why will it improve the chart? To answer this question, you will have to ask another question: “what?”. “What is it that I want to say.” The easiest way to force yourself to ask these questions is to force yourself to add titles to every graph you create (see my how to suck less in data visualization post for more details).

Once you have your conclusion ready, you will notice that you don’t need a technical solution but rather a conceptual one. In this case, we solved the technical problem of looking for eight distinct colors by reducing the number of categories to two or splitting one elaborate graph into several straightforward ones.

So, remember, the Why and the What define the How

Python code that was used to generate all this graphs is available on (https://gist.github.com/bgbg/6c645a5fc48e61b1a917c9d1d66fa72f)

Before and after: Alternatives to a radar chart (spider chart)

A radar chart (sometimes called “spider charts”) look cool but are, in fact,
pretty lame. So much so that when the data visualization author Stephen Few mentioned them in his book Show me the numbers, he did so in a chapter called “Silly graphs that are best forsaken.”

Here, I will demonstrate some of its problems, and will suggest an alternative

Before: The problems of a radar (spyder) plot

Above is my reconstruction of the original plot that I saw in a Facebook discussion. The graph looks pretty cool, I have to admit, but it is full of problems.
What are the problems of a spyder plot or a radar plot?
Let’s start with readability. Can you quickly tell the value of “Substance abuse” for the red series? Not that easy.

But a more significant problem emerges when one realizes that in most cases, the order of the categories is arbitrary and that different sorting options may result in entirely different visual pictures.

After: conclusion-based graph design

I have been continually preaching to add meaningful titles to all the graphs you are creating. (See How to suck less in data visualization and professional communication).

One of the byproducts of adding a title is the fact that when you write down your main takeaway of a graph, you force yourself to think, “does this graph show what it says it shows?” Thus, you guide yourself to better graph choices.

Let’s say that we conclude that there is no correlation between the two series of data. Is this conclusion evident from the graphs? I would say, not so much.

Instead of a radar chart, I suggest creating two aligned, horizontal graph plots. This way, we may sort one subplot according to the values, and then, correlation (or lack of thereof) will be evident.

But what if we noticed something interesting about the differences between A and B groups? If this is true, let’s show precisely this: the differences.

Notice how the bars in this version are sorted according to the difference. Sorting a bar chart is the easiest way to make it readable.

Python code that I used to create these graphs is available here https://gist.github.com/bgbg/db833db723998cd244b5049bfe01f5ac

Basic data visualization video course (in Hebrew)

I had the honor to record an introductory data visualization course for high school students as a part of the Israeli national distance learning project. The course is in Hebrew, and since it targets high schoolers, it does not require any prior knowledge.

I got paid for this job. However, when I divide the money that I received for this job by the time I spent on it, I get a ridiculously low rate. On the other hand, I enjoyed the process, and I view this as my humble donation to the public education system.
Since a government agency makes the course site, it’s UI is complete shit. For example, the site doesn’t support playlists, and the user is expected to search through the video clips by their titles. To fix that, I created a page that lists all the videos in the right order.

Text Visualization Browser

I’ve stumbled upon an exciting project — text visualization browser. It’s a web page that allows one to search for different text visualization techniques using keywords and publication time. 

Text visualization browser https://textvis.lnu.se

The ability to limit the search to various years gives a nice historical perspective on this interesting topic

This site’s information is based on a 2015 paper Text visualization techniques: Taxonomy, visual survey, and community insights. I wish the authors updated it with more recent data, though. 

The information is beautiful. The graphs are shit!

I apologize for my harsh language, but recently I was exposed to a bunch of graphs on the “information is beautiful” site, and I was offended (well, ot really, but let’s pretend I was). I mean, I’m a liberal person, and I don’t care what graphs people do in their own time. Many people visit that site because they try to learn good visualization practices, but some charts on that site are wrong. Very wrong.

Here’s the gem:

I deliberately don’t share the link to this site. I don’t want let Google think it’s valuable in any way.

Now, the geniuses from “Information is beautiful” (let’s call them IB for brevety) wanted to share with us some positive stats. How nice of them. So what they did? They gathered together nine pairs of metrics collected at two different time points: one in the past and one furthermore in history. They used nice colors to create some sleeky shapes. So, what’s the problem? What’s wrong with that?

Everything is wrong!

Let’s start from my guess that they cherry-picked the stats with “positive” changes. Secondly, the comparison of this sort is mostly meaningless if we compare points at different years. What stopped the authors of that tasteless “infographic” from collecting data from the same years? I guess, their laziness. That’s how we ended up comparing the number of death penalties in 1990 and 2016, but the malaria deaths numbers are for 2000 and 2016, and dying mothers are compared for years 2000 and 2017?

Now, let’s talk about data viz.

Take a look at this graph.

The only time we use shapes like that is when we want to convey information about uncertainty. To do that, the X-axis represents the thing we are measuring, and the Y-axis represents our certainty about the current value. When we compare to uncertain measurements, we may judge the difference between these measurements by the distance between the curve peaks, and the width of the curve represents the uncertainty.

Here’s a good example from [this link]:

Can you see how the metric of interest is on the X-axis? The width of each bell curve represents the uncertainty and the difference between any pair of cases is the difference on the horizontal (X) axis, not the vertical one.

Instead, what do the IB authors did? They obviously like sleek looking shapes but know nothing about how to use them. They could have used two bars and let the viewer compare their heights. But nooooo! Bars are not c3wl! Bars are boring! Instead, they took probability density curves (that’s how they are technically called) and made them pretend to be bars.

Bars. Is this THAT hard?

I can hear some of you saying, “Stop being so purist! What’s wrong with comparing the heights of bell curves?” I’ll tell you what’s wrong! Data visualization is a language. As with any language, it has some rules and traditions. If you hear me saying, “me go home,” you will understand me without any problem. However, you will silently judge me for my poor use of the English language. I know that, and since English is my third language, I use all the help to make as few mistakes as possible. The same is correct with data visualization. Please respect its rules and traditions, even if (and especially if) are not fluent in it.

I never write more than two sentences in English without Grammarly

Visit the worst practice tag in this blog to see more bad examples

Exploring alternatives to population pyramids

A population pyramid also called an “age-gender-pyramid”, is a graphical illustration that shows the distribution of various age groups in a population (typically that of a country or region of the world), which forms the shape of a pyramid when the population is growing [citation from Wikipedia].

In some cases, the pyramid provides interesting insights into the entire population. In this post, I will explore ways to make some of these insights more visible. 

The basic case

Let’s start with the basic case. If you have two-three hours of spare time, you can go to the site devoted to population pyramids — https://www.populationpyramid.net. There, you will find population pyramids for every country in the world. The site provides present and past data, as well as future forecasts. To understand how insightful age pyramids can be, look at the graph that represents the entire world.

(this and most other images in this post are from the site http://populationpyramid.net/)

You can clearly see that the world is mostly young, that the amount of people declines as the age progresses, and that there is a rough balance between men and women in the world, at least before the ages of 70+.

Now, examine the stark difference between the populations of Western Africa and Western Europe. Citing the late professor Hans Rosling, we can still see two worlds, one with large families and short lives, and one with small families and long lives. 

Another starking example of an age pyramid is the following

Do you want to guess what country is that? This particular graph shows the age distribution of the United Arab Emirates. Such a vast distortion in symmetry and age distribution stems from the fact that more than 80% of the UAE’s population is composed of expats who come to this rich country to work. The pyramid below (taken from [this article]) sheds some light to the population composition of UAE. (Note that the genders in this graph are reversed).

Whose bar is longer?

The male-female disbalance in the UAE and some other Gulf countries is very striking and cannot be missed. But what about other, more subtle cases? Take a look at the world graph above. If you follow the numbers on the bars, you will notice that more boys are born than girls, but there are more old ladies than old gents in the world. Can we make such differences less subtle?

To answer this question, we need to understand why we find it hard to compare almost equal bars. The reason for that is that our eyes (or brains) are not so good at comparing sizes. They do, however, do a much better job comparing positions. Thus, if we overlap these bars, we will see the small differences in a much more precise manner. 

(I thank the data visualization expert Bella Graf from InfoServiz.co.il for the idea of this graph).

Now, the subtle differences in gender composition are more visible. 

What am I looking at?

When I teach data visualization, I always tell my students to add a meaningful title to the graph. By “meaningful,” I mean a title that does not answer the question “what” but rather “so what”? (See my posts “How to suck less in data visualization,” and “C for conclusion“). What would a good title for this graph be? Let’s try the following

OK, so now, when we have a title, we can ask ourselves, “does the graph show what it says it shows”? And the answer is no. Right now, the title talks about differences, but we don’t see the differences. We see the differences and other stuff. Let’s look only at the differences.

I don’t like this.

What about this?

Now, this is not an age pyramid. That’s for sure. This graph doesn’t show the wealth of data that the classical pyramid shows. On the other hand, it does offer one thing, and it does it very well. Look, for example, at the male/female distortion in China in 1990.

You may find the code I used to create the graphs in this post [on GitHub].

What is the biggest problem of the Jet and Rainbow color maps, and why is it not as evil as I thought?

There was a consensus among the data visualization purists that the rainbow color map, and it’s close cousin Jet are bad. Really bad. These colormaps used to be popular at the beginning of the computational data visualization era. However, their popularity decreased in the last five years or so. The sentiment isn’t as bad as it used to be a couple of years ago, but still.

A screenshot from circa 2016. Today we are less fanatic than that

What is the biggest problem of the rainbow colormap? The most apparent problem with this particular colormap is that it not perceptually uniform. By “perceptually uniform,” I mean that equal changes in the value that we encode using a colormap should correspond to same changes in the color perception. This is not the case with the rainbow or the Jet colormaps. They have distinct bright and dark stripes within the number range, making them the wrong choice to encode numerical data. The situation is even worse for people with impaired color vision.

Can you be less perceptually uniform?

The solution to this problem was proposed in the form of better colormaps. The first one that I know of is Parula by Matlab, and it’s opensource alternative Viridis that is available in matplotlib and many other plotting libraries. (Watch this video about viridis to get a good introduction to color perception and color maps).

Viridis, the new rainbow

Everything was nice and good, and I was trashing the rainbow colormap whenever I could. Until yesterday, when I read about Turbo, the improved rainbow colormap developed by Google.

In the long and interesting blog post that describes Turbo, Anton Mikhailov, a software engineer in Google, describes several relevant applications of a “good rainbow” scheme. 

According to Anton, “Because of rapid color and lightness changes, Jet accentuates detail in the background that is less apparent with Viridis and even Inferno. Depending on the data, some detail may be lost entirely to the naked eye. The background in the following images is barely distinguishable with Inferno (which is already punchier than Viridis), but clear with Turbo.”

I must admit that I’m convinced. 

The biggest problem with that is mentioned concerning the original rainbow scheme that its brightness varies too much. However, it turns out that the color saturation and hue attract our attention more than the lightness (here’s the reference which I haven’t read yet). As such, it makes sense to construct a colormap that relies more on color and hue changes. 

Moreover, in many cases, the interesting details appear in the extreme values of the data range, not in the middle. In thes cases, a properly applied rainbow-like color scheme becomes a valid choice.

The bottom line is that one should not refrain from using rainbow(-like) color maps in their visualizations anymore, provided that they use a modern implementation. Luckily, it’s even available in matplotlib

How to suck less in data visualization and professional communication

In technical communication, the main thing is to keep the main thing the main thing. There are multiple ways to ensure this principle. Some of these ways require careful chart fine-tuning. However, there is one tool that is easy to master, fast to apply, and that provides a high return on the investment rate. I refer to chart titles. In this talk, I had two main theses. My first thesis is that most of you suck in communication (and not only data visualization).

My second thesis is that you can quickly improve your graphs by merely adding a good title. The importance of good titles is not new to my preaching, but I thought it was an excellent thing to formalize this thesis a bit, and I’m thankful to the NDR organizers for giving me this opportunity.

Following is the slide stack from my NDR presentation.

Before and after. Even excellent graphs can be improved

Being a data visualization consultant, I can’t help looking for dataviz problems in graphs that I see. Even if the graph is good. Even if I know that I would not be able to create a graph that good. Even if the overall graph is excellent, and the problems are minor, or maybe especially when the graph is excellent, and the problems are minor.

This is a nice graph published by Nevo Benita on Linkedin.

The graph presents the gap between the men and the women in the Israeli job market. As I said, the graph is excellent. However, there are several small problems that, like grains of sand in a chocolate mousse, stand in the way. Let’s take a look at them.

The time-series line in the upper right part of the graph shows good use of the real estate. The problem is that the X-axis ticks (the years) look as if they belong to the chart below. It takes some time to realize that the numbers are years of the upper graph, and not the X-axis of the graph below. Moving the numbers upwards by several pixels would have fixed that.

Now, it is more clear that “1990” and “2018” relate to the time-series graph above.
Before (left) and after (right).

Let’s talk about the left-side bar chart. It took me a while to understand what it is. As a matter of fact, I managed to write a critique paragraph about that bar chart, how it is unclear what the percentages are, and how they were computed. Only then had I noticed the explanation below. Such confusion isn’t the viewer’s fault. Since we usually scan images from top to bottom, moving the title to the top of the chart will reduce this confusion. The word “percent” is also redundant in that title since it comes after the percent sign.

Moving the explanation to the top makes it easier to notice. Before (left) and after (right)

The last point that is worth optimizing is the color order. Consistent element order in an image makes navigation and comprehension much easier. When the order is preserved, our brain can use mental shortcuts without losing much information. When these shortcuts are broken, the brain has to work harder. What am I talking about? The graph author made the correct decision to use different font colors in the graph title to specify which color stands for which gender. This way, we don’t need a separate legend, and this is good. The title is an ordered sequence of words. The visualizer could use this order to create the order heuristic that is so helpful. Such a heuristic isn’t always possible. Fortunately for the visualizer (and sadly for the society), the salary gap in all the occupations in this graph have the same direction: men earn more than women. As a result, the rightmost part has all the green dots on the right, and the purple dots are on the left. This direction is opposite to the gender direction in the title and the color direction in the bar chart. To fix this situation, I made sure that the color that stands for the women (purple) is always to the left of the color that designates the men (green).

Keeping the color order. Before (left) and after (right)

So, this is the final result. I hope you can see why I like it better.

That’s how I took and excellent graph and made it even more awesome.

Data visualization is not only dots, bars, and pies

Look at this wonderful piece of data visualization (taken from here). If you know the terms “tertiary structure” and “glycan”, there is NO way you miss the message that the author of this figure wanted to convey.

Also, note how using appropriate colors in the title, the authors got rid of graph legend.

Lie factor in ad graphs

It’s fun to look at the visit statistics and to discover old stories. I wrote this post in 2016. For a reason I don’t know, this post has been one of the most viewed posts in my blogs during the last week. 

So, I decided to publish it again. I won’t add any new examples, but if you want to see more stuff, type [lying with data visualization] in your favorite search engine

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

Nice but useless data visualization

Network visualization can mesmerize and hypnotize. Chord diagrams are especially cool because they are so colorful and smooth. The problem is that sometimes, the result doesn’t provide any actual value, and serves as a cute illustration. Cute illustrations are cute; they help put some “easiness” to the text without the risk of looking too unprofessional. 

Take the two examples below.

One example (taken from here) shows worldwide migration patterns in a clear and useful way. You can take a look at the graph and make real conclusions.

The other example (taken from here) is mostly a useless illustration.

The only “conclusion” that a viewer can make out of this graph is “everything is connected with everything.”

This type of conclusion is OK for an ad or a general overview of a problem, but it is NOT a valid way to end a discussion. 

Logarithmic scale misinforms. Period

Being a data scientist and a self-proclaimed data visualization expert, I like using log scale graphs when I find them appropriate. However, as a speaker and a communicator, I refrain from using them in presentations as much as possible. From my experience as a data visualization lecturer, I noticed that even “technical” struggle grasping the concept of log scale graphs. 

One of the Coronavirus side effects was the introduction of the term “exponential growth” to every living room. Naturally (to some of us), exponential growth is best presented using a semi-log graph, where the X-axis represents the time (linear), and the Y-axis represents the degree of magnitude of a value (log scale). 

A recent study (link) tested and demonstrated how bad log-scale is. The research title is “The Logarithmic Scale Misinforms the Public and Affects Policy Preferences.” From my experience, log scale graphs misinform everybody. Except for experienced data scientists. Nothing can confuse or misinform us, obviously 😉

It is a bummer though that data visualization in that paper sucks so much.

Don’t publish graphs like this. Especially not in data visualization papers.

Thanks to Bella Graph who pointed me to the original study.

Visualising Odds Ratio — Henry Lau

Besides being a freelancer data scientist and visualization expert, I teach. One of the toughest concepts to teach and to visualize is odds ratio. Today, I stumbled upon a very interesting post that deals exactly with that

On Thursday 7 May, the ONS published analysis comparing deaths involving COVID-19 by ethnicity. There’s an excellent summary on twitter but the headline is that when taking into account age and other socio-demographic factors, such as deprivation, household composition, education, health and disability, there is higher risk for some ethnic groups of a COVID related…

Visualising Odds Ratio — Henry Lau

Bad advice from a reputable source is bad advice.

Would you buy a grammar book with a clear spelling mistake on its cover? I hope not. That’s what happened to IBM when it published it’s new data visualization guide. I didn’t bother reading the manual because of what IBM decided to use as the first image of their guide.

We use graphs to transfer information into images that are supposed to be later transformed in our brains to information. What visual attributes do we use to interpret the information behind a pie chart? It is the segment angle, its area, or maybe the arc length? Most probably, the answer is “all of the above” (see Robert Kosara’s works for more info). When done right, the three attributes of pie segments are linearly connected one to another, which allows synergism between the visual clues.
But what did our friends at IBM do? The deliberately distorted the data! I took the screenshot from the guide homepage and made some measurements.
The purple segment has the angle of 182 degrees, and the angle of the black segment is 75 degrees, which gives us the ratio of 2.42. However, while the radius of the purple segment is 135 pixels, the radius of the black one is only 110 pixels. Why is this a problem? Well, due to the radius differences, the ratio between the arc lengths is 2.91, and the ratio between the areas is 3.66. So now, let me ask you: what is the ratio between the numbers represented by the purple and the black segments?
It is correct that the colors that IBM people used in their guide are neat, but data visualization that distorts information is not visualization but a piece of garbage. I assume that IBM produces decent computers, but don’t learn data visualization from them

The quintessence of data visualization usefulness

I have to admit, I was skeptical at the beginning of the COVID-19 crisis. I started becoming skeptical now when it seems that the crisis didn’t hit my country too hard. But then I saw the graphs in this Financial Times article, and the skepticism disapeared. The graphs are accompanied by hundreds of words, but there is no need for reading the text to understand almost everything.

These graphs are so good, so convincing, so well performed, they don’t leave any place for doubt or misunderstanding of the message the author wants to convey.

If you study data visualization, look at these graphs. Look at the color choice, legend location, and design. Look at the ticks on the X- and Y-axes, how they are spaced and typeset. Note the amount of details on the axes, specifically how sparse these details are.

If there is only one document you can read about data visualization, this is the one

I’m sorting my teaching material, and I found this gem. The UK Government Statistical Service published a guideline for effective data visualization and tables. If you know a busy person who doesn’t have time to study data visualization and can only read one document, this document is for them (it has less than 40 pages full of examples). Click o the image above to go to the guideline

Data giraffe is sometimes a feature, not a problem

I wrote about data giraffes two weeks ago. Usually, “data giraffes” are a problem and we need to work hard in order to solve it. Sometimes, they are a useful feature. Take a look at this NYT front page that shows the number of new unemployment applications in the United States over the time

And this is the pseudochartchart version of the same data

Credits: I’ve found these examples on Stott Berkun’s page.

An interesting solution of the data giraffe problem

Photo by Pixabay on Pexels.com

A data giraffe is a situation where a very prominent data point shades everything else. I learned this term from a post by Pini Yakuel and immediately liked it a lot.

Taken from https://www.optimove.com/blog/beware-the-giraffes-in-your-data
Taken from https://www.optimove.com/blog/beware-the-giraffes-in-your-data

Dealing with data giraffes is hard, especially when dealing with bar charts. Today I saw one interesting approach to this problem

Katherine S. Rowell is a co-funder of a Boston firm that specializes in data visualization. In December, she published a post dedicated to one of the most popular but also most abused graph types, the bar charts. One of the examples in her post demonstrates a nice treatment of data giraffes

http://ksrowell.com/blog-visualizing-data/2019/12/18/bar-humbug/

In this example, Katherine draws the graph twice. The zoomed-out version shows the giraffes in all their glory, while the zoomed-in one gives the spotlight to the foxes, hyenas, and mice.
Also, note how these graphs respect the rules that every bar chart has to include the zero.

Tips for making remote presentations

Before becoming a freelancer data scientist, I used to work in a distributed company. Remote communication, including remote presentations were the norm for me, long before the remote work experiment no one asked for. In this post, I share some tips for delivering better presentations remotely.

Me presenting in front of the computer

  • Stand up! Usually, we stand up when we present in front of live audience. For some reason, when presenting remotely, people tend to sit. A sitting person is less dynamic and looks less engaging. I have a standing desk which allows me to stand up and to raise the camera to my face level.
  • If you can’t raise the camera, stay sitting. You don’t want your audience staring at your groin.
  • I always use a presentation remote control. It frees me up and lets me move more naturally. My remote is almost ten years old and I have a strong emotional attachment to it
  • When presenting, it is very important to see your audience. Use two monitors. Use one monitor for screen sharing, and the other one to see the audience.
  • Put the Skype/Zoom/whatever window that shows your audience under the camera. This way you’ll look most natural on the other side of the teleconference.
  • Starting a presentation in Powerpoint or Keynote “kidnaps” all the displays. You will not be able to see the audience when that happens. I export the presentation to a PDF file and use Acrobat Reader in full-screen mode. The up- and down- buttons in my presentation remote control work with the Reader. The “make screen black” button doesn’t.
  • I open a “lightable view” of my presentation and put it next to the audience screen. It’s not as useful as seeing the presenter’s notes using a “real” presentation program, but it is good enough.

Auditorium in Chisinau showing me on their screen

  • Make a dry run. Ideally, the try run should be a day or two before the event, to make sure all the technical problems are fixed.
  • Go online at least five minutes before the schedule. Be in front of the camera, don’t let the audience stare at your empty room
  • Make sure nothing in your background will embarrass you. This risk is especially high if you present from home or a hotel. Nobody needs to see your bed during a business meeting.

ASCII histograms are quick, easy to use and to implement

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at the distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when we did most of our work in the console, and when creating a plot from Python required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

Data visualization as an engineering task – a methodological approach towards creating effective data visualization

In June 2019, I attended the NDR AI conference in Iași, Romania where I also gave a talk. Recently, the organizers uploaded the video recording to YouTube.

That was a very interesting conference, tight with interesting talks.

Next year, I plan to attend the Bucharest edition of NDR, where I will also give a talk with the working title “The biggest missed opportunity in data visualization”

Sometimes, you don’t really need a legend

This is another “because you can” rant, where I claim that the fact that you can do something doesn’t mean that you necessarily need to.

This time, I will claim that sometimes, you don’t really need a legend in your graph. Let’s take a look at an example. We will plot the GDP per capita for three countries: Israel, France, and Italy. Plotting three lines isn’t a tricky task. Here’s how we do this in Python

plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.legend()

The last line in the code above does a small magic and adds a nice legend

This image has an empty alt attribute; its file name is image.png

In Excel, we don’t even need to do anything, the legend is added for us automatically.

This image has an empty alt attribute; its file name is image-1.png

So, what is the problem?

What happens when a person wants to know which line represents which country? That person needs to compare the line color to the colors in the legend. Since our working memory has a limited capacity, we do one of the following. We either jump from the graph to the legends dozens of times, or we try to find a heuristic (a shortcut). Human brains don’t like working hard and always search for shortcuts (I recommend reading Daniel Kahneman’s “Think Fast and Slow” to learn more about how our brain works).

What would be the shortcut here? Well, note how the line for Israel lies mostly below the line for Italy which lies mostly below the line for France. The lines in the legend also lie one below the other. However, the line order in these two pieces of information isn’t conserved. This results in a cognitive mess; the viewer needs to work hard to decipher the graph and misses the point that you want to convey.

And if we have more lines in the graph, the situation is even worse.

This image has an empty alt attribute; its file name is image-2.png

Can we improve the graph?

Yes we can. The simplest way to improve the graph is to keep the right order. In Python, we do that by reordering the plotting commands.

plt.plot(gdp.Year, gdp.Australia, '-', label='Australia')
plt.plot(gdp.Year, gdp.Belgium, '-', label='Belgium')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.legend()
This image has an empty alt attribute; its file name is image-3.png

We still have to work hard but at least we can trust our brain’s shortcut.

If we have more time

If we have some more time, we may get rid of the (classical) legend altogether.

countries = [c for c in gdp.columns if c != 'Year']
fig, ax = plt.subplots()
for i, c in enumerate(countries):
    ax.plot(gdp.Year, gdp[c], '-', color=f'C{i}')
    x = gdp.Year.max()
    y = gdp[c].iloc[-1]
    ax.text(x, y, c, color=f'C{i}', va='center')
seaborn.despine(ax=ax)

(if you don’t understand the Python in this code, I feel your pain but I won’t explain it here)

This image has an empty alt attribute; its file name is image-4.png

Isn’t it better? Now, the viewer doesn’t need to zap from the lines to the legend; we show them all the information at the same place. And since we already invested three minutes in making the graph prettier, why not add one more minute and make it even more awesome.

This image has an empty alt attribute; its file name is image-5.png

This graph is much easier to digest, compared to the first one and it also provides more useful information.

.

This image has an empty alt attribute; its file name is image-6.png

I agree that this is a mess. The life is tough. But if you have time, you can fix this mess too. I don’t, so I won’t bother, but Randy Olson had time. Look what he did in a similar situation.

percent-bachelors-degrees-women-usa

I also recommend reading my older post where I compared graph legends to muttonchops.

In conclusion

Sometimes, no legend is better than legend.

This post, in Hebrew: [link]

What do we see when we look at slices of a pie chart?

What do we see when we look at slices of a pie chart? Angles? Areas? Arc length? The answer to this question isn’t clear and thus “experts” recommend avoiding pie charts at all.

Robert Kosara is a Senior Research Scientist at Tableau Software (you should follow his blog https://eagereyes.org), who is very active in studying pie charts. In 2016, Robert Kosara and his collaborators published a series of studies about pie charts. There is a nice post called “An Illustrated Tour of the Pie Chart Study Results” that summarizes these studies. 

Last week, Robert published another paper with a pretty confident title (“Evidence for Area as the Primary Visual Cue in Pie Charts”) and a very inconclusive conclusion

While this study suggests that the charts are read by area, itis not conclusive. In particular, the possibility of pie chart usersre-projecting the chart to read them cannot be ruled out. Furtherexperiments are therefore needed to zero in on the exact mechanismby which this common chart type is read.

Kosara. “Evidence for Area as the Primary Visual Cue in Pie Charts.” OSF, 17 Oct. 2019. Web.

The previous Kosara’s studies had strong practical implications, the most important being that pie charts are not evil provided they are done correctly. However, I’m not sure what I can take from this one. As far as I understand the data, the answer to the questions in the beginning of this post are still unclear. Maybe, the “real answer” to these questions is “a combination of thereof”.

Error bars in bar charts. You probably shouldn’t

This is another post in the series Because You Can. This time, I will claim that the fact that you can put error bars on a bar chart doesn’t mean you should.

It started with a paper by prof. Gerd Gigerenzer whose work in promoting numeracy I adore. The paper, “Natural frequencies improve Bayesian reasoning in simple and complex inference tasks” contained a simple graph that meant to convince the reader that natural frequencies lead to more accurate understanding (read the paper, it explains these terms). The error bars in the graph mean to convey uncertainty. However, the data visualization selection that Gigerenzer and his team selected is simply wrong.

First of all, look at the leftmost bar, it demonstrates so many problems with error bars in general, and in error bars in barplots in particular. Can you see how the error bar crosses the X-axis, implying that Task 1 might have resulted in negative percentage of correct inferences?

The irony is that Prof. Gigerenzer is a worldwide expert in communicating uncertainty. I read his book “Calculated risk” from cover to cover. Twice.

Why is this important?

Communicating uncertainty is super important. Take a look at this 2018 study with the self-explaining title “Uncertainty Visualization Influences how Humans Aggregate Discrepant Information.” From the paper: “Our study repeatedly presented two [GPS] sensor measurements with varying degrees of inconsistency to participants who indicated their best guess of the “true” value. We found that uncertainty information improves users’ estimates, especially if sensors differ largely in their associated variability”.

Image result for clinton trump polls
Source HuffPost

Also recall the surprise when Donald Trump won the presidential elections despite the fact that most of the polls predicted that Hillary Clinton had higher chances to win. Nobody cared about uncertainty, everyone saw the graphs!

Why not error bars?

Keep in mind that error bars are considered harmful, and I have a reference to support this claim. But why?

First of all, error bars tend to be symmetric (although they don’t have to) which might lead to the situation that we saw in the first example above: implying illegal values.

Secondly, error bars are “rigid”, implying that there is a certain hard threshold. Sometimes the threshold indeed exists, for example a threshold of H0 rejection. But most of the time, it doesn’t.

stacked round gold-colored coins on white surface

More specifically to bar plots, error lines break the bar analogy and are hard to read. First, let me explain the “bar analogy” part.

The thing with bar charts is that they are meant to represent physical bars. A physical bar doesn’t have soft edges and adding error lines simply breaks the visual analogy.

Another problem is that the upper part of the error line is more visible to the eye than the lower one, the one that is seen inside the physical bar. See?undefined

But that’s not all. The width of the error bars separates the error lines and makes the comparison even harder. Compare the readability of error lines in the two examples below

The proximity of the error lines in the second example (take from this site) makes the comparison easier.

Are there better alternatives?

Yes. First, I recommend reading the “Error bars considered harmful” paper that I already mentioned above. It not only explains why, but also surveys several alternatives

Nathan Yau from flowingdata.com had an extensive post about different ways to visualize uncertainty. He reviewed ranges, shades, rectangles, spaghetti charts and more.

Claus Wilke’s book “Fundamentals of Data Visualization” has a dedicated chapter to uncertainty with and even more detailed review [link].

Visualize uncertainty about the future” is a Science article that deals specifically with forecasts

Robert Kosara from Tableu experimented with visualizing uncertainty in parallel coordinates.

There are many more examples and experiments, but I think that I will stop right now.

The bottom line

Communicating uncertainty is important.

Know your tools.

Try avoiding error bars.

Bars and bars don’t combine well, therefore, try harder avoiding error bars in bar charts.

Pseudochart. It’s like a pseudocode but for charts

Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. People write pseudocode to isolate the “bigger picture” of an algorithm. Pseudocode doesn’t care about the particular implementation details that are secondary to the problem, such as memory management, dealing with different encoding, etc. Writing out the pseudocode version of a function is frequently the first step in planning the implementation of complex logic.

Similarly, I use sketches when I plan non-trivial charts, and when I discuss data visualization alternatives with colleagues or students.

One can use a sheet of paper, a whiteboard, or a drawing application. You may recognize this approach as a form of “paper prototyping,” but it deserves its own term. I suggest calling such a sketch a “pseudochart”*. Like a piece of pseudocode, the purpose of a pseudochart is to show the visualization approach to the data, not the final graph itself.

* Initially, I wanted to use the term “pseudograph” but the network scientists already took it for themselves.

** The first sentence of this post is a taken from the Wikipedia.

כוון הציר האפקי במסמכים הנכתבים מימין לשמאל

אני מחפש דוגמאות נוספות

יש לכם דוגמה של גרף עברי ״הפוך״? גרפים בערבית או פארסי? שלחו לי.

X-axis direction in Right-To-Left languages (part two)

I need more examples

Do you have more examples of graphs written in Arabic, Farsi, Urdu or another RTL language? Please send them to me.

Textbook examples

I already wrote about my interest in data visualization in Right-To-Left (RTL) languages. Recently, I got copies of high school calculus books from Jordan and the Palestinian Authority.

Both Jordan and PA use the same (Jordanian) school program. In both cases, I was surprised to discover that they almost never use Latin or Greek letters in their math notation. Not only that, the entire direction of the the mathematical notation is from right to left. Here’s an illustrative example from the Palestinian book.

Screenshot: Arabic text, Arabic math notation and a graph

And here is an example from Jordan

What do we see here?

  • the use of Arabic numerals (which are sometimes called Eastern Arabic numerals)
  • The Arabic letters س (sin) and ص (saad) are used “instead of” x and y (the Arabic alphabet doesn’t have the notion of capital letters). The letter qaf (ق) is used as the archetypical function name (f). For some reason, the capital Greek Delta is here.
  • More interestingly, the entire math is “mirrored”, compared to the Left-To-Write world — including the operand order. Not only the operand order is “mirrored”, many other pieces of math notation are mirrored, such as the square root sign, limits and others.

Having said all that, one would expect to see the numbers on the X-axis (sorry, the س-axis) run from right to left. But no. The numbers on the graph run from left to right, similarly to the LTR world.

What about mathematics textbooks in Hebrew?

Unfortunately, I don’t have a copy of a Hebrew language book in calculus, so I will use fifth grade math book

Despite the fact that the Hebrew text flows from right to left, we (the Israelis) write our math notations from left to right. I have never saw any exceptions of this rule.

In this particular textbook, the X axis is set up from left to right. This direction is obvious in the upper example. The lower example lists months — from January to December. Despite the fact the the month names are written in Hebrew, their direction is LTR. Note that this is not an obvious choice. In many version of Excel, for example, the default direction of the X axis in Hebrew document is from right to left.

I need more examples

Do you have more examples of graphs written in Arabic, Farsi, Urdu or another RTL language? Please send them to me.

Useful redundancy — when using colors is not completely useless

The maximum data-ink ratio principle implies that one should not use colors in their graphs if the graph is understandable without the colors. The fact that you can do something, such as adding colors, doesn’t mean you should do it. I know it. I even have a dedicated tag on this blog for that. Sometimes, however, consistent use of colors serves as a useful navigation tool in a long discussion. Keep reading to learn about the justified use of colors.

Pew Research Center is a “is a nonpartisan American fact tank based in Washington, D.C. It provides information on social issues, public opinion, and demographic trends shaping the United States and the world.” Recently, I read a report prepared by the Pew Center on the religious divide in the Israeli society. This is a fascinating report. I recommend reading without any connection to data visualization.

But this post does not deal with the Isreali society but with graphs and colors.

Look at the first chart in that report. You may see a tidy pie chart with several colored segments. 

Pie chart: Religious composition of Israeli society. The chart uses several colored segments

Aha! Can’t they use a single color without losing the details? Of course the can! A monochrome pie chart would contain the same information:

Pie chart: Religious composition of Israeli society. The chart uses monochrome segments

In most of the cases, such a transformation would make a perfect sense. In most of the cases, but not in this report. This report is a multipage research document packed with many facts and analyses. The pie chart above is the first graph in that report that provides a broad overview of the Israeli society. The remaining of this report is dedicated to the relationships between and within the groups represented by the colorful segments in that pie chart. To help the reader navigating through this long report, its authors use a consistent color scheme that anchors every subsequent graph to the relevant sections of the original pie chart.

All these graphs and tables will be readable without the use of colors. Despite the fact that the colors here are redundant, this is a useful redundancy. By using the colors, the authors provided additional information layers that make the navigation within the document easier. I learned about the concept of useful redundancy from “Trees, Maps, and Theorems” by Jean-luc Dumout. If you can only read one book about data communication, it should be this book.

Microtext Line Charts

Why adding text labels to graph lines, when you can build graph lines using text labels? On microtext lines

richardbrath

Tangled Lines

Line charts are a staple of data visualization. They’ve existed at least since William Playfair and possibly earlier. Like many charts, they can be very powerful and also have their limitations. One limitation is the number of lines that can be displayed. One line works well: you can see trend, volatility, highs, lows, reversals. Two lines provides opportunity for comparison. 5 lines might be getting crowded. 10 lines and you’re starting to run out of colors. But what if the task is to compare across a peer group of 30 or 40 items? Lines get jumbled, there aren’t enough discrete colors, legends can’t clearly distinguish between them. Consider this example looking at unemployment across 37 countries from the OECD: which country had the lowest unemployment in 2010?

unemployment_plain

Tooltips are an obvious way to solve this, but tooltips have problems – they are much slower than just shifing visual attention…

View original post 1,323 more words

איך אומרים דאטה ויזואליזיישן בעברית?

This post is written in Hebrew about a Hebrew issue. I won’t translate it to English.

אני מלמד data visualization בשתי מכללות בישראלבמכללת עזריאלי להנדסה בירושלים ובמכון הטכנולוגי בחולון. כשכתבתי את הסילבוס הראשון שלי הייתי צריך למצוא מונח ל־data visualization וכתבתיהדמיית נתונים״ אומנם זה הזכיר לי קצת תהליך של סימולציה, אבל האופציה האחרת ששקלתי היתה ״דימות״ וידעתי שהיא שמורה ל־imaging, דהיינו תהליך של יצירת דמות או צורה של עצם, בעיקר בעולם הרפואה.

הבנתי שהמונח בעייתי בשיעור הראשון שהעברתי. מסתברששניים מארבעת הסטודנטים שהגיעו לשיעור חשבו שקורס ״הדמיית נתונים בתהליך מחקר ופיתוח״ מדבר על סימולציות.

מתישהו שמעתי מחבר של חבר שהמונח הנכון ל־visualization זה הדמאה, אבל זה נשמע לי פלצני מדי, אז השארתי את ה־״הדמיה״ בשם הקורס והוספתי “data visualization” בסוגריים.

היום, שלוש שנים אחרי ההרצאה הראשונה שהעברתי, ויומיים לפני פתיחת הסמסטר הבא, החלטתי לגגל (יש מילה כזאת? יש!) את התשובה. ומה מסתבר? עלון ״למד לשונך״ מס׳ 109 של האקדמיה ללשון עברית שיצא לאור בשנת 2015 קובע שהמונח ל־visualization הוא הַחְזָיָה. לא יודע מה אתכם, אבל אני לא משתגע על החזיה. עוד משהו שאני לא משתגע עליו הוא שבתור הדוגמא להחזיה, האקדמיה החלטיה לשים תרשים עוגה עם כל כך הרבה שגיאות!

Screen Shot 2018-10-23 at 20.35.52

נראה לי שאני אשאר עם הדמיה. ויקימילון מרשה לי.

נ.ב. שמתם לב שפוסט זה השתמשתי במקף עברי? אני מאוד אוהב את המקף העברי.

Data visualization in right-to-left languages

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does.

Right-to-left (RTL) languages such as Hebrew, Arabic, and Farsi are used by roughly 1.8 billion people around the world. Many of them consume data in their native languages. Nevertheless, I have never seen any research or study that explores data visualization in RTL languages. Until a couple of days ago, when I saw this interesting observation by Nick Doiron “Charts when you read right-to-left“.

I teach data visualization in Israeli colleges. Whenever a student asks me RTL-related questions, I always answer something like “it’s complicated, let’s not deal with that”. Moreover, in the assignments, I even allow my students to submit graphs in English, even if they write the report in Hebrew.

Nick’s post made me wonder about data visualization do’s and don’ts in RTL environments. Should Hebrew charts differ from Arabic or Farsi? What are the accepted practices?

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does. I want to collect as many examples of data visualization in RTL languages. Links to research articles are more than welcome. You can leave your comments here or send them to boris@gorelik.net.

Thank you.

The image at the top of this post is a modified version of a graph that appears in the post that I cite. Unfortunately, I wasn’t able to find the original publication.

“Any questions?” How to fight the awkward silence at the end of a presentation?

If you ever gave or attended a presentation, you are familiar with this situation: the presenter asks whether there are any questions and … nobody asks anything. This is an awkward situation. Why aren’t there any questions? Is it because everything is clear? Not likely. Everything is never clear. Is it because nobody cares? Well, maybe. There are certainly many people that don’t care. It’s a fact of life. Study your audience, work hard to make the presentation relevant and exciting but still, some people won’t care. Deal with it.

However, the bigger reasons for lack of the questions are human laziness and the fear of being stupid. Nobody likes asking a question that someone will perceive as a stupid one. Sometimes, some people don’t mind asking a question but are embarrassed and prefer not being the first one to break the silence.

What can you do? Usually, I prepare one or two questions by myself. In this case, if nobody asks anything, I say something like “Some people, when they see these results ask me whether it is possible to scale this method to larger sets.”. Then, depending on how confident you are, you may provide the answer or ask “What do you think?”.

You can even prepare a slide that answers your question. In the screenshot below, you may see the slide deck of the presentation I gave in Trento. The blue slide at the end of the deck is the final slide, where I thank the audience for the attention and ask whether there are any questions.

My plan was that if nobody asks me anything, I would say “Thank you again. If you want to learn more practical advises about data visualization, watch the recording of my tutorial, where I present this method  <SLIDE TRANSFER, show the mockup of the “book”>. Also, many people ask me about reading suggestions, this is what I suggest you read: <SLIDE TRANSFER, show the reading pointers>

Screen Shot 2018-09-17 at 10.10.21

Luckily for me, there were questions after my talk. Luckily, one of these questions was about practical advice so I had a perfect excuse to show the next, pre-prepared, slide. Watch this moment on YouTube here.

Graphing Highly Skewed Data – Tom Hopper

My colleague, Chares Earl, pointed me to this interesting 2010 post that explores different ways to visualize categories of drastically different sizes.

The post author, Tom Hopper, experiments with different ways to deal with “Data Giraffes”. Some of his experiments are really interesting (such as splitting the graph area). In one experiment, Tom Hopper draws bar chart on a log scale. Doing so is considered as a bad practice. Bar charts value (Y) axis must include meaningful zero, which log scale can’t have by its definition.

Other than that, a good read Graphing Highly Skewed Data – Tom Hopper

Sometimes, less is better than more

Today, during the EuroSciPy conference, I gave a presentation titled “Three most common mistakes in data visualization and how to avoid them”. The title of this presentation is identical to the title of the presentation that I gave in Barcelona earlier this year. The original presentation was approximately one and a half hours long. I knew that EuroSciPy presentations were expected to be shorter, so I was prepared to shorten my talk to half an hour. At some point, a couple of days before departing to Trento, I realized that I was only allocated 15 minutes. Fifteen minutes! Instead of ninety.

Frankly speaking, I was in a panic. I even considered contacting EuroSciPy organizers and asking them to remove my talk from the program. But I was too embarrassed, so I decided to take the risk and started throwing slides away. Overall, I think that I spent eight to ten working hours shortening my presentation. Today, I finally presented it. Based on the result, and on the feedback that I got from the conference audience, I now know that the 15-minutes version is better than the original, longer one. Video recording of my talk is available on Youtube and is embedded below. Below is my slide deck

 

 

Illustration image credit: Photo by Jo Szczepanska on Unsplash

An even better data visualization workshop

Boris Gorelik teaching in front of an audience.

Yesterday, I gave a data visualization workshop at EuroSciPy 2018 in Trento. I spent HOURs building and improving it. I even developed a “simple to use, easy to follow, never failing formula” for data visualization process (I’ll write about it later).

I enjoyed this workshop so much. Both preparing it, and (even more so) delivering it. There were so many useful questions and remarks. The most important remark was made by Gael Varoquaux who pointed out that one of my examples was suboptimal for vision impaired people. The embarrassing part is that one of the last lectures that I gave in my college data visualization course was about visual communication for the visually impaired. That is why the first thing I did when I came to my hotel after the workshop was to fix the error. You may find all the (corrected) material I used in this workshop on GitHub. Below, is the video of the workshop, in case you want to follow it.

 

 

 

Photo credit: picture of me delivering the workshop is by Margriet Groenendijk

Meet me at EuroSciPy 2018

I am excited to run a data visualization tutorial, and to give a data visualization talk during the 2018 EuroSciPy meeting in Trento, Italy.

My tutorial “Data visualization — from default and suboptimal to efficient and awesome”will take place on Sep 29 at 14:00. This is a two-hours tutorial during which I will cover between two to three examples. I will start with the default Matplotlib graph, and modify it step by step, to make a beautiful aid in technical communication. I will publish the tutorial notebooks immediately after the conference.

My talk “Three most common mistakes in data visualization” will be similar in nature to the one I gave in Barcelona this March, but more condensed and enriched with information I learned since then.

If you plan attending EuroSciPy and want to chat with me about data science, data visualization, or remote working, write a message to boris@gorelik.net.

Full conference program is available here.

Evolution of a complex graph. Part 1. What do you want to say?

From time to time, people ask me for help with non-trivial data visualization tasks. A couple of weeks ago, a friend-of-a-friend-of-a-friend showed me a set of graphs with the following note:

Each row is a different use case. Each use case was tested on three separate occasions – columns 1,2,3. We hope to show that the lines in each row behave similarly, but that there are differences between the different rows.

Before looking at the graphs, note the last sentence in the above comment. Knowing what you want to show is an essential and not trivial part of a data visualization task. Specifying what is it precisely that you want to say is the first required task in any communication attempt, technical or not.

For the obvious reasons, I cannot share the original graphs that that person gave me. I managed to re-create the spirit of those graphs using a combination of randomly generated arrays.
The original graph: A 3-by-4 panel of line charts
Notice how the X- and Y- axes are aligned between all the subplots. Such alignment is a smart move that provides a shared scale and allows faster and more natural comparison between the curves. You should always try aligning your axes. If aligning isn’t possible, make sure that it is absolutely, 100%, clear that the scales are different. Slight differences are very confusing.

There are several small things that we can do to improve this graph. First, the identical legends in every subplot are a useless waste of ink and thus, of your viewers’ processing power. Since they are identical, these legends do nothing but distracting the viewer. Moreover, while I understand how a variable name such as event_prob appeared on a graph, showing such names outside technical teams is a bad practice. People who don’t share intimate knowledge with the underlying data will find human-readable labels easier to comprehend, making your message “stickier.”
Let’s improve the signal-to-noise ratio of this plot.
An improved version of the 3-by-4 grid of line charts

According to our task, each row is a different use case. Notice that I accompanied each row with a human-readable label. I didn’t use cryptic code such as group_001, age_0_10 or the such.
Now, let’s go back to the task specification. “We hope to show that the lines in each row behave similarly, but that there are differences between the separate rows.” Remember my advice to always use conclusions as graph titles? Let’s test how such a title will look like

A hypothetical screenshot. The title says: "low intra- & high inter- group variability"

Really? Is there a better way to justify the title? I claim that there is.

Let’s experiment a little bit. What will happen if we will plot all the lines on the same graph? By doing so, we might create a stronger emphasize of the similarities and the differences.

Overlapping lines that show several repetitions in four different groups
Not bad. The separate lines create some excessive noise, and the legend isn’t the best way to label multiple lines, so let’s improve the graph even further.

Curves representing four different data groups. Shaded areas represent inter-group variability

Note that meaningful ticks on the X-axis. The 30, 180, and 365-day marks provide useful anchors.

Now, let us go back to our title. “Low intra- and high inter- group variability” is, in fact, two conclusions. If you have ever read any text about technical presentations, you should remember the “one point per slide” rule. How do we solve this problem? In cases like these, I like to use the same graph in two different slides, one for each conclusion.

Screenshot showing two slides. The first one is titled "low within-group variability". The second one is titled "High between-group variability". The graphs in the slides is the same

During a presentation, I would show this graph with the first conclusion as a title. I would talk about the implications of that conclusion. Next, I will say “wait! There is more”, will promote the slide and start talking about the second conclusion.

To sum up,

First, decide what is it that you want to say. Then ask whether your graph says what you want to say. Next, emphasize what you want to say, and finally, say what you want to say.

To be continued

The case that you see in this post is a relatively easy one because it only compares four groups. What will happen if you will need to compare six, sixteen or sixty groups? I will try answering this question in one of my next posts

C for Conclusion

From time to time, I give a lecture about most common mistakes in data visualization. In this lection, I say that not adding a graph’s conclusion as a title is an opportunity wasted

Screenshot. Slide deck. The slide says

In one of these lectures, a fresh university graduate commented that in her University, she was told to never write a conclusion in a graph. According to to the logic she was tought, a scientist is only supposed to show the information, and let his or her peer scientists draw the conclusions by themselves. This sounds like a valid demand except that it is, in my non-humble opinion, wrong. To understand why is that, let’s review the arguments in favor of spelling out the conclusions.

The cynical reason

We cannot “unlearn” how to read. If you show a piece of graphic for its aesthetic value, it is perfectly OK not to suggest any conclusions. However, most of the time, you will show a graph to persuade someone, to convince them that you did a good job, that your product is worth investing in, or that your opponent is ruining the world. You hope that your audience will reach the conclusion that you want them to reach, but you are not sure. Spelling out your conclusion ensures that the viewers use it as a starting point. In many cases, they will be too lazy to think of objections and will adopt your point of view. You don’t have to believe me on this one. The Nobel Prize winner Daniel Kahneman wrote a book about this phenomenon.

What if you want to hear genuine criticism? Use the same trick to ask for it. Write an open question instead of the conclusion to ensure everybody wakes up and start thinking critically.

The self-discipline reason

Some people are not comfortable with the cynical way I suggest to exploit the limitations of the human mind. Those people might be right. For them, I have another reason, self-discipline. Coming up with a short, concise and descriptive title requires effort. This effort slows you down and ensures that you start thinking critically and asking questions. “What does this graph really tells?” “Is this the best way to demonstrate this conclusion?” “Is this conclusion relevant to the topic of my talk, is it worth the time?”. These are very important questions that someone has to ask you. Sure, having a professional and devoted reviewer on your prep team is great but unless you are a Fortune-500 CEO, you are preparing your presentations by yourself.

The philosophical reason

You will notice that my two arguments sound like a hack. They do not talk about the “pure science attitude”, and seem to be detached from the theoretical picture of the idealized scientific process. That is why, when that student objected to my suggestion, I admitted defeat. Being a data scientist, I want to feel good about my scientific practice. It took me a while but at some point, I realized that writing a conclusion as the sole title of a graph or a slide is a good scientific practice and not a compromise.

According to the great philosopher Karl Popper, a mandatory characteristic of any scientific theory is that they make claims that future observations might show to be false. Popper claims that without taking a risk of being proved wrong,  a scientist misses the point  [ref]. And what is the best way to make a clear, risky statement, if not spelling it out as a clear, non-ambiguous title of your graph?

Don’t feel bad, your bases are covered

To sum up, whenever you create a graph or a slide, think hard about what conclusion you want your audience to make out of it. Use this conclusion as your title. This will help you check yourself, and will help your fellow scientists assess your theory. And if a purist professor says you shouldn’t write your conclusions, tell him or her that the great Karl Popper thought otherwise.

 

Meaningless slopes

That fact that you can doesn’t mean that you should! I will say it once again.That fact that you can doesn’t mean that you should! Look at this slopegraph that was featured by “Information is Beautiful”

What does it say? What do the slopes mean? It’s a slopegraph, its slopes should have a meaning. Sure, you can draw a line between one point to another but can you see the problem here? In this nonsense graph, the viewer is invited to look at slopes of lines that connect dollars with years. The proverbial “apples and oranges” are not even close to the nonsense degree of this graph. Not even close.

This page attributes this graph to National Geographic, which makes me even sadder.

 

In defense of three-dimensional graphs

“There is only one thing worse than a pie chart. It’s a 3-D pie chart”. This is what I used to think for quite a long time. Recently, I have revised my attitude towards pie charts, mainly due to the works of Rober Kosara from Tableau. I am no so convinced that pie charts can be a good visualization choice, I even included a session “Pie charts as an alternative to bar charts” in my recent workshop.

What about three-dimensional graphs? I’m not talking about the situations where the data is intrinsically three-dimensional. Such situations lie within the consensus. I’m talking about adding a third dimension to graphs that can work in two dimensions. Like the example below that is taken from a 2017 post by Deven Wisner.

Screenshot: a 3D pie chart with text "The only good thing about this pie chart is that it's starting to look more like [a] real pie"

Of course, this is not a hypothetical example. We all remember how the late Steve Jobs tried to create a false impression of Apple market share

Steve Jobs during a presentation, in front of a

Having said all that, can you think of a legitimate case where adding the third dimension adds more value than distraction? I worked hard, and I finally found it.

 

Take a look at the overlapping density plot (a.k.a “joy plot”).

Three joyplot examples

If you think of this, joyplots are nothing more than 3-d line graphs done well. Most of the time, they provide information-rich data overview that also enables digging into fine details. I really like joyplots. I included one in my recent workshop. Many libraries now provide ready-to-use implementations of joyplots. This is a good thing to have. The only reservation that I have about those implementations is the fact that many of them, including my favorite seaborn, add meaningless colors to the curves. But this is a topic for another rant.

Today’s workshop material

Today, I hosted a data visualization workshop, as a part of the workshop day adjacent to the fourth Israeli Data Science Summit. I really enjoyed this workshop, especially the follow-up questions. These questions are  the reason I volunteer talking about data visualization every time I can. It may sound strange, but I learn a lot from the questions people ask me.

If you want to see the code, you may find it on GitHub. The slide deck is available on Slideshare

Me in front of an audience

 

 

I will host a data visualization workshop at Israel’s biggest data science event

TL/DR

 

What: Data Visualization from default to outstanding. Test cases of tough data visualization

Why:  You would never settle for default settings of a machine learning algorithm. Instead, you would tweak them to obtain optimal results. Similarly, you should never stop with the default results you receive from a data visualization framework. Sadly, most of you do.

When: May 27, 2018 (a day before the DataScience summit)/ 13:00 – 16:00

Where:  Interdisciplinary Center (IDC) at Herzliya.

More info: here.

Timeline:
1. Theoretical introduction: three most common mistakes in data visualization (45 minutes)
2. Test case (LAB): Plotting several radically different time series on a single graph (45 minutes)
3. Test case (LAB): Bar chart as an effective alternative to a pie chart (45 minutes)
4. Test case (LAB): Pie chart as an effective alternative to a bar chart (45 minutes)

More words

According to the conference organizers, the yearly Data Science Summit is the biggest data science event in Israel. This year, the conference will take place in Tel Aviv on Monday, May 28. One day before the main conference, there will be a workshop day, hosted at the Herzliya Interdisciplinary Center. I’m super excited to host one of the workshops, during the afternoon session. During this workshop, we will talk about the mistakes data scientist make while visualizing their data and the way to avoid them. We will also have some fun creating various charts, comparing the results, and trying to learn from each others’ mistakes.

Register here.

Three most common mistakes in data visualization 
and how to avoid them. Now, the slides

Yesterday, I talked in front of the Barcelona Data Science and Machine Learning Meetup about the most common mistakes in data visualization. I enjoyed talking with the local community very much. Judging by the feedback I received during and after the talk, they too, enjoyed my presentation. I uploaded my slides to Slideshare.

Me in front of a screen that shows a bar chart

Enjoy!

ASCII histograms are quick, easy to use and implement

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

Tips on making remote presentations

Today, I made a presentation to the faculty of the Chisinau
Institute of Mathematics and Computer Science. The audience gathered in a conference room in Chisinau, and I was in my home office in Israel.

Me presenting in front of the computer

Following is a list of useful tips for this kind of presentations.

  • When presenting, it is very important to see your audience. Thus, use two monitors. Use one monitor for screen sharing, and the other one to see the audience
  • Put the (Skype) window that shows your audience under the camera. This way you’ll look most natural on the other side of the teleconference.
  • Starting a presentation in Powerpoint or Keynote “kidnaps” all the displays. You will not be able to see the audience when that happens. I export the presentation to a PDF file and use Acrobat Reader in full-screen mode. The up- and down- buttons in my presentation remote control work with the Reader. The “make screen black” button doesn’t.
  • I open a “lightable view” of my presentation and put it next to the audience screen. It’s not as useful as seeing the presenter’s notes using a “real” presentation program, but it is good enough.
  • Stand up! Usually, we stand up when we present in front of live audience. For some reason, when presenting remotely, people tend to sit. A sitting person is less dynamic and looks less engaging. I have a standing desk which allows me to stand up and to raise the camera to my face level. If you can’t raise the camera, stay sitting. You don’t want your audience staring at your groin.

Auditorium in Chisinau showing me on their screen

 

The case of meaningless comparison

Exposé, an Australian-based data analytics company, published a use case in which they analyze the benefits of a custom-made machine learning solution. The only piece of data in their report [PDF] was a graph which shows the observed and the predicted

Screenshot that shows two time series curves: one for the observed and one for the predicted values

Graphs like this one provide an easy-to-digest overview of the data but are meaningless with respect to our ability to judge model accuracy. When predicting values of time series, it is customary to use all the available data to predict the next step. In cases like that, “predicting” the next value to be equal to the last available one will result in an impressive correlation. Below, for example, is my “prediction” of Apple stock price. In my model, I “predict” tomorrow’s prices to be equal to today’s closing price plus random noise.

Two curves representing two time series - Apple stock price and the same data shifted by one day

Look how impressive my prediction is!

I’m not saying that Exposé constructed a nonsense model. I have no idea what their model is. I do say, however, that their communication is meaningless. In many time series, such as consumption dynamics, stock price, etc, each value is a function of the previous ones. Thus, the “null hypothesis” of each modeling attempt should be that of a random walk, which means that we should not compare the actual values but rather the changes. And if we do that, we will see the real nature of the model. Below is such a graph for my pseudo-model (zoomed to the last 20 points)

diff_series

 

Suddenly, my bluff is evident.

To sum up, a direct comparison of observed and predicted time series can only be used as a starting point for a more detailed analysis. Without such an analysis, this comparison is nothing but a meaningless illustration.

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

Does chart junk really damage the readability of your graph?

Screen Shot 2018-02-12 at 16.32.56Data-ink ratio is considered to be THE guiding principle in data visualization. Coined by Edward Tufte, data-ink is “the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” According to Tufte, the ratio of the data-ink out of all the “ink” in a graph should be as high as possible, preferably, 100%.
Everyone who considers themselves serious about data visualization knows (either formally, or intuitively) about the importance to keep the data-ink ratio high, the merits of high signal-to-noise ratio, the need to keep the “chart junk” out. Everybody knows it. But are there any empirical studies that corroborate this “knowledge”? One of such studies was published in 1988 by James D. Kelly in a report titled “The Data-Ink Ratio and Accuracy of Information Derived from Newspaper Graphs: An Experimental Test of the Theory.”

In the study presented by J.D. Kelly, the researchers presented a series of newspaper graphs to a group of volunteers. The participants had to look at the graphs and answer questions. A different group of participants was exposed to similar graphs that underwent rigorous removal of all the possible “chart junk.” One such an example is shown below

Two bar charts based on identical data. One - with "creative" illustrations. The other one only presents the data.

Unexpectedly enough, there was no difference between the error rate the two groups made. “Statistical analysis of results showed that control groups and treatment groups made a nearly identical number of errors. Further examination of the results indicated that no single graph produced a significant difference between the control and treatment conditions.”

I don’t remember how this report got into my “to read” folder. I am amazed I have never heard about it. So, what is my take out of this study? It doesn’t mean we need to abandon the data-ink ratio at all. It does not provide an excuse to add “chart junk” to your charts “just because”. It does, however, show that maximizing the data-ink ratio shouldn’t be followed zealously as a religious rule. The maximum data-ink ratio isn’t a goal, but rather a tool. Like any tool, it has some limitations. Edward Tufte said, “Above all, show data.” My advice is “Show data, enough data, and mostly data.” Your goal is to convey a message, if some decoration (a.k.a chart junk) makes your message more easily digestible, so be it.

Why bar charts should always start at zero?

In the data visualization world, not starting a bar chart at zero is a “BIG NO”. Some people protest. “How come can anyone tell me how to start my bar chart? The Paper/Screen can handle anything! If I want to start a bar chart at 10, nobody can stop me!”

Data visualization is a language. Like any language, data visualization has its set of rules,  grammar if you wish. Like in any other language, you are free to break any rule, but if you do so, don’t be surprised if someone underestimates you. I’m not a native English speaker. I certainly break many English grammar rules when I write or speak. However, I never argue if someone knowledgeable corrects me. If you agree that one should try respecting grammar rules of a spoken language, you have to agree to respect the grammar of any other language, including data visualization.

Natan Yau from flowingdata.com has a very informative post

Screenshot of flowingdata.com post "Bar Chart Baselines Start at Zero"

that explores this exact point. Read it.

Another related discussion is called “When to use the start-at-zero rule” and is also worth reading.

Also, do remember is that the zero point has to be a meaningful one. That is why, you cannot use a bar chart to depict the weather because, unless you operate in Kelvin, the zero temperature is meaningless and changes according to the arbitrary choice the temperature scale.

Yet another thing to remember is that

It’s true that every rule has its exception. It’s just that with this particular rule, I haven’t seen a worthwhile reason to bend it yet.

(citing Natan Yau)

Gender salary gap in the Israeli high-tech — now the code

Several people have asked me about the technology I used to create the graphs in my recent post about the gender salary gap in the Israeli high-tech. Like 99% of the graphs I create, I used matplotlib. I have uploaded the notebook that I used for that post to Github. Here’s the link. The published version uses seaborn style settings. The original one uses a slightly customized style.

 

In defense of double-scale and double Y axes

If you had a chance to talk to me about data visualization, you know that I dislike the use of double Y-axis for anything except for presenting different units of the same measurement (for example inches and meters). Of course, I’m far from being a special case.  Double axis ban is a standard stand among all the people in the field of data visualization education. Nevertheless, double-scale axes (mostly Y-axis) are commonly used both in popular and technical publications. One of my data visualization students in the Azrieli College of Engineering of Jerusalem told me that he continually uses double Y scales when he designs dashboards that are displayed on a tiny screen in a piece of sophisticated hardware. He claimed that it was impossible to split the data into separate graphs, due to space constraints, and that the engineers that consume those charts are professional enough to overcome the shortcomings of the double scales. I couldn’t find any counter-argument.

When I tried to clarify my position on that student’s problem, I found an interesting article by Financial Times commentator John Auther, called “Lies, Damned Lies and Statistics.” In this article, John Auther reviews the many problems a double scale can create. He also shows different alternatives (such as normalization). However, at the end of that article, John Auther also provides strong and valid arguments in favor of the moderate use of double scales. John Auther notices strange dynamics of two metrics

A chart with two Y axes - one for EURJPY exchange rate and the other for SPX Index
Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)

It is extraordinary that two measures with almost nothing in common with each other should move this closely together. In early 2007 I noticed how they were moving together, and ended up writing an entire book attempting to explain how this happened.

It is relatively easy to modify chart scales so that “two measures with almost nothing in common […] move […] closely together”. However, it is hard to argue with the fact that it was the double scale chart that triggered that spark in the commentator’s head.  He acknowledges that normalizing (or rebasing, as he puts it) would have resulted in a similar picture

Graph that depicts the dynamics of two metrics, brought to the same scale
Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)

But

However, there is one problem with rebasing, which is that it does not take account of the fact that a stock market is naturally more variable than a foreign exchange market. Eye-balling this chart quickly, the main message might be that the S&P was falling faster than the euro against the yen. The more important point was that the two were as correlated as ever. Both stock markets and foreign exchange carry trades were suffering grievous losses, and they were moving together — even if the S&P was moving a little faster.

I am not a financial expert, so I can’t find an easy alternative that will provide the insight John Auther is looking for while satisfying my purist desire to refrain from using double Y axes. The real question, however, is whether such an alternative is something one should be looking for. In many fields, double scales are the accepted language. Thanks to the standard language, many domain experts are able to exchange ideas and discover novel insights.  Reinventing the language might bring more harm than good. Thus, my current recommendations regarding double scales are:

Avoid double scales when possible, unless its a commonly accepted practice. In which case, be persistent and don’t lie.

 

Don’t take career advises from people who mistreat graphs this badly

Recently, I stumbled upon a report called “Understanding Today’s Chief Data Scientist” published by an HR company called Heidrick & Struggles. This document tries to draw a profile of the modern chief data scientist in today’s Big Data Era. This document contains the ugliest pieces of data visualization I have seen in my life. I can’t think of a more insulting graphical treatment of data. Publishing graph like these ones in a document that tries to discuss careers in data science is like writing a profile of a Pope candidate while accompanying it with pornographic pictures.

Before explaining my harsh attitude, let’s first ask an important question.

What is the purpose of graphs in a report?

There are only two valid reasons to include graphs in a report. The first reason is to provide a meaningful glimpse into the document. Before a person decided whether he or she wants to read a long document, they want to know what is it about, what were the methods used, and what the results are. The best way to engage the potential reader to provide them with a set of relevant graphs (a good abstract or introduction paragraph help too). The second reason to include graphs in a document is to provide details that cannot be effectively communicating by text-only means.

That’s it! Only two reasons. Sometimes, we might add an illustration or two, to decorate a long piece of text. Adding illustrations might be a valid decision provided that they do not compete with the data and it is obvious to any reader that an illustration is an illustration.

Let the horror begin!

The first graph in the H&S report stroke me with its absurdness.

Example of a bad chart. I have no idea what it means

At first glance, it looks like an overly-artistic doughnut chart. Then, you want to understand what you are looking at. “OK”, you say to yourself, “there were 100 employees who belonged to five categories. But what are those categories? Can someone tell me? Please? Maybe the report references this figure with more explanations? Nope.  Nothing. This is just a doughnut chart without a caption or a title. Without a meaning.

I continued reading.

Two more bad charts. The graphs are meaningless!

OK, so the H&S geniuses decided to hide the origin or their bar charts. Had they been students in a dataviz course I teach, I would have given them a zero. Ooookeeyy, it’s not a college assignment, as long as we can reconstruct the meaning from the numbers and the labels, we are good, right? I tried to do just that and failed. I tried to use the numbers in the text to help me filling the missing information and failed. All in all, these two graphs are a meaningless graphical junk, exactly like the first one.

The fourth graph gave me some hope.

Not an ideal pie chart but at least we can understand it

Sure, this graph will not get the “best dataviz” award, but at least I understand what I’m looking at. My hope was too early. The next graph was as nonsense as the first three ones.

Screenshot with an example of another nonsense graph

Finally, the report authors decided that it wasn’t enough to draw smartly looking color segments enclosed in a circle. They decided to add some cool looking lines. The authors remained faithful to their decision to not let any meaning into their graphical aidsScreenshot with an example of a nonsense chart.

Can’t we treat these graphs as illustrations?

Before co-founding the life-changing StackOverflow, Joel Spolsky was, among other things, an avid blogger. His blog, JoelOnSoftware, was the first blog I started following. Joel writes mostly about the programming business and. In order not to intimidate the readers with endless text blocks, Joel tends to break the text with illustrations. In many posts, Joel uses pictures of a cute Husky as an illustration. Since JoelOnSoftware isn’t a cynology blog, nobody gets confused by the sudden appearance of a Husky. Which is exactly what an illustration is – a graphical relief that doesn’t disturb. But what would happen if Joel decided to include a meaningless class diagram? Sure a class diagram may impress the readers. The readers will also want to understand it and its connection to the text. Once they fail, they will feel angry, and rightfully so

Two screenshots of Joel's blog. One with a Husky, another one with a meaningless diagram

The bottom line

The bottom line is that people have to respect the rules of the domain they are writing about. If they don’t, their opinion cannot be trusted. That is why you should not take any pieces of advice related to data (or science) from H&S. Don’t get me wrong. It’s OK not to know the “grammar” of all the possible business domains. I, for example, know nothing about photography or dancing; my English is far from being perfect. That is why, I don’t write about photography, dancing or creative writing. I write about data science and visualization. It doesn’t mean I know everything about these fields. However, I did study a lot before I decided I could write something without ridiculing myself. So should everyone.

 

What’s the most important thing about communicating uncertainty?

Sigrid Keydana, in her post Plus/minus what? Let’s talk about uncertainty (talk) — recurrent null, said

What’s the most important thing about communicating uncertainty? You’re doing it

Really?

Here, for example, a graph from a blog post

Thousands of randomly looking points. From https://myscholarlygoop.wordpress.com/2017/11/20/the-all-encompassing-figure/

The graph clearly “communicates” the uncertainty but does it really convey it? Would you consider the lines and their corresponding confidence intervals very uncertain had you not seen the points?

What if I tell you that there’s a 30% Chance of Rain Tomorrow? Will you know what it means? Will a person who doesn’t operate on numbers know what it means? The answer, to both these questions, is “no”, as is shown by Gigerenzer and his collaborators in a 2005 paper.

Screenshot: many images for the 2016 US elections

Communicating uncertainty is not a new problem. Until recently, the biggest “clients” of uncertainty communication research were the weather forecasters.  However, the recent “data era” introduced uncertainty to every aspect of our personal and professional lives. From credit risk to insurance premiums, from user classification to content recommendation, the uncertainty is everywhere. Simply “doing” uncertainty communication, as Sigrid Keydana from the Recurrent Null blog suggested isn’t enough. The huge public surprise caused by the 2016 US presidential election is the best evidence for that. Proper uncertainty communication is a complex topic. A good starting point to this complex topic is a paper Visualizing Uncertainty About the Future by David Spiegelhalter.

The Y-axis doesn’t have to be on the left

Line charts are great to convey the evolution of a variable over the time. This is a typical chart. It has three key components, the X-axis that represents the time, the Y-axis that represents the tracked value, and the line itself.

A typical line chart. The Y-axis is on the left

Usually, you will see the Y-axis at the left part of the graph. Unless you design for a Right-To-Left language environment, placing the Y-axis on the left makes perfect sense. However, left-side Y-axis isn’t a hard rule.

In many cases, more importance is given to the most recent data point. For example, it might be interesting to know a stock price dynamics but today’s price is what determines how much money I can get by selling my stock portfolio.

What happens if we move the axis to the right?

A slightly improved version. The Y-axis is on the right, adjacent to the most recent data point

Now, today’s price of XYZ stock is visible more clearly. Let’s make the most important values explicitly clear:

The final version. The Y-axis is on the right, adjacent to the most recent data point. The axis ticks correspont to actual data points

There are two ways to obtain right-sided Y axis in matplotib. The first way uses a combination of

ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")

The second one creates a “twin X” axis and makes sure the first axis is invisible. It might seem that the first option is easier. However, when combined with seaborn’s despine function, strange things happen. Thus, I perform the second option. Following is the code that I used to create the last version of the graph.

np.random.seed(123)
days = np.arange(1, 31)
price = (np.random.randn(len(days)) * 0.1).cumsum() + 10

fig = plt.figure(figsize=(10, 5))
ax = fig.gca()
ax.set_yticks([]) # Make 1st axis ticks disapear.
ax2 = ax.twinx() # Create a secondary axis
ax2.plot(days,price, '-', lw=3)
ax2.set_xlim(1, max(days))
sns.despine(ax=ax, left=True) # Remove 1st axis spines
sns.despine(ax=ax2, left=True, right=False)
tks = [min(price), max(price), price[-1]]
ax2.set_yticks(tks)
ax2.set_yticklabels([f'min:\n{tks[0]:.1f}', f'max:\n{tks[1]:.1f}', f'{tks[-1]:.1f}'])
ax2.set_ylabel('price [$]', rotation=0, y=1.1, fontsize='x-large')
ixmin = np.argmin(price); ixmax = np.argmax(price);
ax2.set_xticks([1, days[ixmin], days[ixmax], max(days)])
ax2.set_xticklabels(['Oct, 1',f'Oct, {days[ixmin]}', f'Oct, {days[ixmax]}', f'Oct, {max(days)}' ])
ylm  = ax2.get_ylim()
bottom = ylm[0]
for ix in [ixmin, ixmax]:
    y = price[ix]
    x = days[ix]
    ax2.plot([x, x], [bottom, y], '-', color='gray', lw=0.8)
    ax2.plot([x, max(days)], [y, y], '-', color='gray', lw=0.8)
ax2.set_ylim(ylm)

Next time when you create a “something” vs time graph, ask yourself whether the last available point has a special meaning to the viewer. If it does, consider moving the Y axis to the left part of your graph and see whether it becomes more readable.

This post was triggered by a nice write-up by  Plotting a Course: Line Charts by a new blogger David (he didn’t mention his last name) from https://thenumberist.wordpress.com/

How to make a graph less readable? Rotate the text labels

This is my “because you can” rant.

Here, you can see a typical situation. You have some sales data that you want to represent using a bar plot.

01_default

Immediately, you notice a problem: the names on the X axis are not readable. One way to make the labels readable is to enlarge the graph.02_large_image

Making larger graphs isn’t always possible. So, the next default solution is to rotate the text labels.

03_rotated

However, there is a problem. Rotated text is read more slowly than standard horizontal text. Don’t believe me? This is not an opinion but rather a result of empirical studies [ref], [ref]. Sometimes, rotated text is unavoidable. Most of the time, it is not.

So, how do we make sure all the labels are readable without rotating them? One option is to move them up and down so that they don’t hinder each other. It is easily obtained with Python’s matplotlib

plt.bar(range(len(people)), sales)
plt.title('October sales')
plt.ylabel('$US', rotation=0, ha='right')
ticks_and_labels = plt.xticks(range(len(people)), people, rotation=0)
for i, label in enumerate(ticks_and_labels[1]):
    label.set_y(label.get_position()[1] - (i % 2) * 0.05)

(note, that I also rotated the Y axis label, for even more readability)

05_alternate_labels

Another approach that will work with even longer labels and that requires fewer code lines it to rotate the bars, not the labels.

07_horizontal_plot

… and if you don’t have a compelling reason for the data order, you might also consider sorting the bars. Doing so will not only make it prettier, it will also make it easier to compare between similar values. Use the graph above to tell whether Teresa Jackson’s sales were higher or lower than those of Marie Richardson’s. Now do the same comparison using the graph below.

08_horizontal_plot_sorted

To sum up: the fact you can does not mean you should. Sometimes, rotating text labels is the easiest solution. The additional effort needed to decipher the graph is the price your audience pays for your laziness. They might as well skip your graphs your message won’t stick.

This was my because you can rant.

Featured image by Flickr user gullevek

Good information + bad visualization = BAD

I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.

via Math required for machine learning — Youth Innovation

I would write a post about good practices in pie charts, but Robert Kosara, of https://eagereyes.org does this so well, I don’t really think I need to reinvent the weel. Pie charts are very powerful in conveying information. Make sure you use this tool well. I strongly suggest reading everything Robert Kosara has to say on this topic.

 

 

Do you REALLY need the colors?

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Look at this example from the seaborn documentation site

>>> import seaborn as sns
>>> sns.set_style("whitegrid")
>>> tips = sns.load_dataset("tips")
>>> ax = sns.barplot(x="day", y="total_bill", data=tips)

Barplot example with colored bars

This example shows the default barplot and is the first barplot. Can you see how easy it is to add colors to the different columns? But WHY? What do those colors represent? It looks like the only information that is encoded by the color is the bar category. We already have this information in the form of bar location. Having this colorful image adds nothing but a distraction. It is sad that this is the default behavior that seaborn developers decided to adopt.

Look at the same example, without the colors

>>> ax = sns.barplot(x="day", y="total_bill", color='gray', data=tips)

Barplot example with gray bars

Isn’t it much better? The sad thing is that a better version requires memorizing additional arguments and more typing.

This was my because you can rant.

 

When scatterplots are better than bar charts, and why?

From time to time, you might hear that graphical method A is better at representing problem X than method B. While in case of problem Z, the method B is much better than A, but C is also a possibility. Did you ever ask yourselves (or the people who tell you that) “Says WHO?”

The guidelines like these come from theoretical and empirical studies. One such an example is a 1985 paper “Graphical perception and graphical methods for analyzing scientific data.” by Cleveland and McGill. I got the link to this paper from Varun Raj of https://varunrajweb.wordpress.com/.

It looks like a very interesting and relevant paper, despite the fact that it has been it was published 22 years go. I will certainly read it. Following is the reading list that I compiled for my data visualization students more than two years ago. Unfortunately, they didn’t want to read any of these papers. Maybe some of the readers of this blog will …

 

Because you can — a new series of data visualization rants

Here’s an old joke:

Q: Why do dogs lick their balls?
A: Because they can.

Canine behavior aside, the fact that you can do something doesn’t mean that you should to it. I already wrote about one such example, when I compared between chart legends to muttonchops.

Citing myself:

Chart legends are like Muttonchops — the fact that you can have them doesn’t mean you should.

Stay tuned and check the because-you-can tag.

Featured image by Unsplash user Nicolas Tessari

Untitled

I teach data visualization to in Azrieli College of Engineering in Jerusalem. Yesterday, during my first lesson, I was talking about the different ways a chart design selection can lead to different conclusions, despite not affecting the actual data. One of the students hypothesized that the preception of a figure can change as a function of other graphs shown together. Which was exactly tested in a research I recently mentioned here. I felt very proud of that student, despite only meeting them one hour before that.

Who doesn’t like some merciless critique of others’ work?

Stephen Few is the author of (among others) “Show Me The Numbers“. Besides writing about what should be done, in the field of data visualization, Dr. Few also writes a lot about what should not be done. He does that in a sharp, merciless way which makes it very interesting reading (although, sometimes Dr. Few can be too harsh). This time, it was the turn of the Tableau blog team to be at the center of Stephen Few’s attention, and not for the good reason.

If Tableau wishes to call this research, then I must qualify it as bad research. It produced no reliable or useful findings. Rather than a research study, it would be more appropriate to call this “someone having fun with an eye tracker.”

Reading merciless critique by knowledgeable experts is an excellent way to develop that “inner voice” that questions all your decisions and makes sure you don’t make too many mistakes. Despite the fear to be fried, I really that some day I’ll be able to know what Stephen Few things of my work.

http://www.perceptualedge.com/blog/?p=2718

 

Disclaimer: Stephen Few was very generous to allow me using the illustrations from his book in my teaching.

Featured image is Public domain image by Alan Levine from here

Can the order in which graphs are shown change people’s conclusions?

When I teach data visualization, I love showing my students how simple changes in the way one visualizes his or her data may drive the potential audience to different conclusions. When done correctly, such changes can help the presenters making their point. They also can be used to mislead the audience. I keep reminding the students that it is up to them to keep their visualizations honest and fair.  In his recent post, Robert Kosara, the owner of https://eagereyes.org/, mentioned another possible way that may change the perceived conclusion. This time, not by changing a graph but by changing the order of graphs exposed to a person. Citing Robert Kosara:

Priming is when what you see first influences how you perceive what comes next. In a series of studies, [André Calero Valdez, Martina Ziefle, and Michael Sedlmair] showed that these effects also exist in the particular case of scatterplots that show separable or non-separable clusters. Seeing one kind of plot first changes the likelihood of you judging a subsequent plot as the same or another type.

via IEEE VIS 2017: Perception, Evaluation, Vision Science — eagereyes

As any tool, priming can be used for good or bad causes. Priming abuse can be a deliberate exposure to non-relevant information in order to manipulate the audience. A good way to use priming is to educate the listeners of its effect, and repeatedly exposing them to alternate contexts. Alternatively, reminding the audience of the “before” graph, before showing them the similar “after” situation will also create a plausible effect of context setting.

P.S. The paper mentioned by Kosara is noticeable not only by its results (they are not as astonishing as I expected from the featured image) but also by how the authors report their research, including the failures.

 

Featured image is Figure 1 from Calero Valdez et al. Priming and Anchoring Effects in Visualization

Before and after — the Hebrew holiday season chart

Sometimes, when I see a graph, I think “I could draw a better version.” From time to time, I even consider writing a blog post with the “before” and “after” versions of the plot. Last time I had this desire was when I read the repost of my own post about the crazy month of Hebrew holidays. I created this graph three years ago. Since then, I have learned A LOT. So I thought it would be a good opportunity to apply my over-criticism to my own work. This is the “before” version:

Graph: Tishrei is mostly a non-working month.

There are quite a few points worth fixing in that plot. Let’s review those problems:

  • The point of the original post is to emphasize the amount of NON-working days in Tishrei. However, the largest points represent the working days. As the result, the emphasis goes to the working days, thus reversing the semantics.
  • It is not absolutely clear what point I intended to make using this graph. A short and meaningful title is an effective way to lead the audience towards the desired conclusion.
  • There are three distinct colors in my graph, representing working, half-working and non-working days. The category order is clear. The color order, on the other hand, is absolutely arbitrary. Moreover, green and red are never a good color combination due to the significantly high prevalence of impaired color vision.
  • Y label is rotated. Rotated Y labels are the default option in all the plotting tools that I know. Why is that is beyond my understanding, given the numerous studies that show that reading rotated text takes more time and is more error-prone (for example, see ref, ref, and ref.)
  • One interesting piece of information that one might expect to read from a graph is how many working days are there in year X. One can obtain this information either by counting the dots or by looking at a separate graph. It would be a good idea to make this information readily available to the observer.
  • The frame around the plot is useless.

 

OK, now that we have identified the problems, let’s fix them

  • Emphasize the right things. I will use bigger points for the non-working days and small ones for the working days. I will also use squares instead of circles. Placing several squares one next to the other creates solid areas with less white space in-between. This lack of whitespace will help further emphasizing non-working chunks. I will make to leave some whitespace between the points, to enable counting.
  • What’s your point? I will add an explanatory title. Having given some thought, I came up with “How productive can you be?”. It is short, thought-provoking, and makes the point.
  • Reduce the number of colors. My intention was to use red for non-working days, and blue for the working ones. What color should I use for the half-working (Chol haMoed) days? I don’t want to introduce another color to the improved graph. Since in my case, those days are mostly non-working, I will use a shade of red for Chol haMoed.
  • Improve label readability. One way to solve the rotated Y label problem is to remove the Y label at all! After all, most people will correctly assume that “2006”, “2010”, “2020” and other values represent the years. However, the original post mentions two different methods to count the years, using the Hebrew and Christian traditions. To make it absolutely clear that the graph talks about the Christian (common) calendar, I decided to keep the legend and format it properly.
  • Add more info. I added the total number of working days as a separate column of properly aligned gray text labels. The gray color ensures that the labels don’t compete with the graph.  I also highlighted the current year using a subtle background rectangle.
  • Data-ink ratio. I removed the box around the graph and got rid of lines for the X and Y axes. I also removed the vertical grid lines. I wasn’t sure about the horizontal ones but I decided to keep them in place.

This is the result:

tishrei_working_days_after.png

I like it very much. I’m sure though, that if I revisit it in a year or two, I will find more ways to make it even better.

You may find the code that generates this figure here.

 

A successful failure

Almost half a year ago, I decided to create an online data visualization course. After investing hundreds of hours, I managed to release the first lecture and record another one. However, I decided not to publish new lectures and to remove the existing one from the net. Why? The short answer is a huge cost-to-benefit ratio. For a longer answer, you will have to keep reading this post.

Why creating a course?

It’s not that there are no good courses. There are. However, most of them are tightly coupled with one tool or another. Moreover, many of the courses I have reviewed online are act as an advanced tutorial of a certain plotting tool. The course that I wanted to create was supposed to be tool-neutral, full of theoretical knowledge and bits of practical advice. Another decision that I made was not to write a set of text files (online book, set of Jupyter notebooks, whatever) but to create a course in which the majority of the knowledge is brought to the audience by the means of frontal video lectures. I assumed that this kind of format will be the easiest for the audience to consume.

What went wrong?

So, what went wrong? First of all, you should remember that I work full time at Automattic, which means that every side project is a … side project, that I have to do during my free time. I realized that since the very beginning. However, since I already teach data visualization in different institutions in Israel, I already have a well-formed syllabus with accompanying slide decks full of examples. I assumed that it will take me not more than one hour to prepare every online lecture.

Green screen and a camera in a typical green room setup
Green room. All my friends were very impressed to see it

So, instead of verifying this assumption, I started solving the technical problems, such as buying a nice microphone (which turned out to be a crap), tripods, building a green room in my home office, etc. Once I was satisfied with my technical setup, I decided to record a promo video. Here, I faced a big problem. You see, talking to people and to the camera are completely different things. I feel pretty comfortable talking to people but when I face the camera, I almost freeze. Also, in person-to-person communication, we are somewhat tolerant to small studdering and longish pauses. However, when watching recorded video clips, we expect television quality narration. It turns out that achieving this kind of narration is very hard. Add the fact that English is my third language, and you get a huge time drain. To be able to record a two-minute promo video, I had to write the entire script, rehearse it for a dozen of times, and record it in front of a teleprompter. The filming session alone took around half an hour, as I had to repeat almost every line, time after time.

Screenshot of my YouTube video with 18 views
18 views.

Preparing slide decks for the lectures wasn’t an easy task either. Despite the fact that I had pretty good slide decks, I realized that they are good for an in-class lecture, where I can point to the screen, go back and forth within a presentation, open external URL’s etc. Once I had my slide decks ready, I faced the narration problem once again. So, I had to write the entire lesson’s script, edit it, rehearse for several days, and shoot. At this time, I became frustrated. I might have been more motivated had my first video received some real traffic. However, with 18 (that’s eighteen) views, most of which lasted not more than a minute or two, I hardly felt a YouTube super star. I know that it’s impossible to get a real traction in such a short period, without massive promotion. However, after I completed shooting the second lecture, I realized that I will not be able to do it much longer. Not without quitting my day job. So, I decided to quit.

What now?

Since I already have pretty good texts for the first two lectures, I might be able to convert them to posts in this blog. I also have material for some before-and-after videos that I planned to have as a part of this course. I will make convert them to posts, too, similar to this post on the data.blog.

Was it worth it?

It certainly was! During the preparations, I learned a lot. I learned new things about data visualization. I took a glimpse into the world of video production. I had a chance to restructure several of my presentations.


Featured image for this post by Nicolas Nova under the CC-by license.

 

The first lesson of the data visualization course is available

The first lesson of the course Data Visualization Beyond the Tutorial is online! Go to the lesson page to watch the lesson video. There’s also an assignment!

Do you know a friend, a colleague, a classmate who needs to communicate numbers as part of their works? Let them know about this course. They will thank you 🙂

 

I have created an online data visualization course

Free online course. Data Visualization Beyond the Tutorial. https://gorelik.net/course

If you create charts using your tool’s default settings and your intuition, chances are you’re doing it wrong.

Let me present you an online course that dives into the theory of data visualization and its practical aspects. Every lecture is accompanied by before & after case studies and learner assignments. The course is tool-neutral. It doesn’t matter if you use Python, R, Excel, or pen, and paper.

The first lecture will be published on July 7th. Future lectures will follow every two weeks. Meanwhile, you may visit the course page and watch the intro video. Follow this blog so that you don’t miss new lectures!

Please spread the word! Reblog this post, share it on Twitter (@gorelik_boris), Facebook, LinkedIn or any other network. Tell about this course to your colleagues and friends. The more learners will take this course, the happier I will be.

Chart legends and the Muttonchops

Adding legends to a graph is easy. With matplotlib, for example, you simply call plt.legend() and voilà, you have your legends. The fact that any major or minor visualization platform makes it super easy to add a legend doesn’t mean that it should be added. At least, not in graphs that are supposed to be shared with the public.

Take a look at this interesting graph taken from Reddit:

The chart provides fascinating information. However, to “decipher” it, the viewer needs to constantly switch between the chart and the legend to the right. Moreover, having to encode eight different categories, resulted in colors that are hard to distinguish. And if you happen to be a colorblind person, your chances to get the colors right are significantly lower.

What is the solution to this problem? Let’s reduce the distance between the labels and the data by putting the labels and the data together.

Notice the multiple advantages of the “after” version. First, the viewer doesn’t need to jump back-and-forth to decide which segment represents which data series. Secondly, by moving the legends inside the graph, we freed up valuable real estate area. But that’s not all. The new version is readable by the colorblind. Plus, the slightly bigger letters make the reading easier for the visually impaired. It is also readable and understandable when printed out using a black and white printer.

“Wait a minute,” you might say, “there’s not enough space for all the labels! We’ve lost some valuable information. After all,” you might say, “we now only have four labels, not eight”. Here’s the thing. I think that losing four categories is an advantage. By imposing restrictions, we are forced to decide what is it that we want to say, what is important and what is not. By forcing ourselves to only label larger chunks, we are forced to ask questions. Is the distinction between “Moustache with Muttonchops” and “Moustache with Sideburns” THAT important? If it is, make a graph about Muttonchops and Sideburns. If it’s not, combine them into a single category. Even better, combine them with “Mustache”.

Muttonchops
Muttonchops. By Flickr user GSK

Having the ability to add a legend with any number of categories, using only one code line is super convenient and useful, especially, during data exploration. However, when shared with the public, graphs need to contain as fewer legends as practically needed. Remove the legends, place the labels close to the data. If doing so results in unreadable overlapping labels, refine the graph, rethink your message, combine categories. This may take time and cause frustration, but the result might surprise you. If none of these is possible, put the legend back. At least you tried.

Chart legends are like Muttonchops — the fact that you can have them doesn’t mean you should.

Evolution of a Plot: Better Data Visualization, One Step at a Time

My latest post on data.blog

Data for Breakfast

The goal of data visualization is to transform numbers into insights. However, default data visualization output often disappoints. Sometimes, the graph shows irrelevant data or misses important aspects; sometimes, the graph lacks context; sometimes, it’s difficult to read. Often, data practitioners “feel” that something isn’t right with the graph, but cannot pinpoint the problem.

In this post, I’ll share the process of visualizing a complex issue using a simple plot. Despite the fact that the final plot looks elementary and straightforward, it took me several hours and trial-and-error attempts to achieve the result. By sharing this process, I hope to accomplish two goals: to offer my perspectives and approaches to data visualization and to learn from other options you suggest. You’ll find the code and the data used in this post here.

Plotting power distribution in the Knesset

This post is devoted to a graph I created to explore…

View original post 1,738 more words