Data visualization as an engineering task – a methodological approach towards creating effective data visualization

In June 2019, I attended the NDR AI conference in Iași, Romania where I also gave a talk. Recently, the organizers uploaded the video recording to YouTube.

That was a very interesting conference, tight with interesting talks.

Next year, I plan to attend the Bucharest edition of NDR, where I will also give a talk with the working title “The biggest missed opportunity in data visualization”

Sometimes, you don’t really need a legend

This is another “because you can” rant, where I claim that the fact that you can do something doesn’t mean that you necessarily need to.

This time, I will claim that sometimes, you don’t really need a legend in your graph. Let’s take a look at an example. We will plot the GDP per capita for three countries: Israel, France, and Italy. Plotting three lines isn’t a tricky task. Here’s how we do this in Python

plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.legend()

The last line in the code above does a small magic and adds a nice legend

This image has an empty alt attribute; its file name is image.png

In Excel, we don’t even need to do anything, the legend is added for us automatically.

This image has an empty alt attribute; its file name is image-1.png

So, what is the problem?

What happens when a person wants to know which line represents which country? That person needs to compare the line color to the colors in the legend. Since our working memory has a limited capacity, we do one of the following. We either jump from the graph to the legends dozens of times, or we try to find a heuristic (a shortcut). Human brains don’t like working hard and always search for shortcuts (I recommend reading Daniel Kahneman’s “Think Fast and Slow” to learn more about how our brain works).

What would be the shortcut here? Well, note how the line for Israel lies mostly below the line for Italy which lies mostly below the line for France. The lines in the legend also lie one below the other. However, the line order in these two pieces of information isn’t conserved. This results in a cognitive mess; the viewer needs to work hard to decipher the graph and misses the point that you want to convey.

And if we have more lines in the graph, the situation is even worse.

This image has an empty alt attribute; its file name is image-2.png

Can we improve the graph?

Yes we can. The simplest way to improve the graph is to keep the right order. In Python, we do that by reordering the plotting commands.

plt.plot(gdp.Year, gdp.Australia, '-', label='Australia')
plt.plot(gdp.Year, gdp.Belgium, '-', label='Belgium')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.legend()
This image has an empty alt attribute; its file name is image-3.png

We still have to work hard but at least we can trust our brain’s shortcut.

If we have more time

If we have some more time, we may get rid of the (classical) legend altogether.

countries = [c for c in gdp.columns if c != 'Year']
fig, ax = plt.subplots()
for i, c in enumerate(countries):
    ax.plot(gdp.Year, gdp[c], '-', color=f'C{i}')
    x = gdp.Year.max()
    y = gdp[c].iloc[-1]
    ax.text(x, y, c, color=f'C{i}', va='center')
seaborn.despine(ax=ax)

(if you don’t understand the Python in this code, I feel your pain but I won’t explain it here)

This image has an empty alt attribute; its file name is image-4.png

Isn’t it better? Now, the viewer doesn’t need to zap from the lines to the legend; we show them all the information at the same place. And since we already invested three minutes in making the graph prettier, why not add one more minute and make it even more awesome.

This image has an empty alt attribute; its file name is image-5.png

This graph is much easier to digest, compared to the first one and it also provides more useful information.

.

This image has an empty alt attribute; its file name is image-6.png

I agree that this is a mess. The life is tough. But if you have time, you can fix this mess too. I don’t, so I won’t bother, but Randy Olson had time. Look what he did in a similar situation.

percent-bachelors-degrees-women-usa

I also recommend reading my older post where I compared graph legends to muttonchops.

In conclusion

Sometimes, no legend is better than legend.

This post, in Hebrew: [link]

What do we see when we look at slices of a pie chart?

What do we see when we look at slices of a pie chart? Angles? Areas? Arc length? The answer to this question isn’t clear and thus “experts” recommend avoiding pie charts at all.

Robert Kosara is a Senior Research Scientist at Tableau Software (you should follow his blog https://eagereyes.org), who is very active in studying pie charts. In 2016, Robert Kosara and his collaborators published a series of studies about pie charts. There is a nice post called “An Illustrated Tour of the Pie Chart Study Results” that summarizes these studies. 

Last week, Robert published another paper with a pretty confident title (“Evidence for Area as the Primary Visual Cue in Pie Charts”) and a very inconclusive conclusion

While this study suggests that the charts are read by area, itis not conclusive. In particular, the possibility of pie chart usersre-projecting the chart to read them cannot be ruled out. Furtherexperiments are therefore needed to zero in on the exact mechanismby which this common chart type is read.

Kosara. “Evidence for Area as the Primary Visual Cue in Pie Charts.” OSF, 17 Oct. 2019. Web.

The previous Kosara’s studies had strong practical implications, the most important being that pie charts are not evil provided they are done correctly. However, I’m not sure what I can take from this one. As far as I understand the data, the answer to the questions in the beginning of this post are still unclear. Maybe, the “real answer” to these questions is “a combination of thereof”.

Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

On Sunday, I wrote about bootstrapping. On Monday, I wrote about visualization uncertainty. Let’s now talk about bootstrapping and uncertainty visualization.

Robert Grant is a data visualization expert who wrote a book about interactive data visualization (which I should read, BTW).

Robert runs an interesting blog from which I learned another approach to uncertainty visualization, bootstrapping.

Source: Robert Grant.

Read the entire post: Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

Error bars in bar charts. You probably shouldn’t

This is another post in the series Because You Can. This time, I will claim that the fact that you can put error bars on a bar chart doesn’t mean you should.

It started with a paper by prof. Gerd Gigerenzer whose work in promoting numeracy I adore. The paper, “Natural frequencies improve Bayesian reasoning in simple and complex inference tasks” contained a simple graph that meant to convince the reader that natural frequencies lead to more accurate understanding (read the paper, it explains these terms). The error bars in the graph mean to convey uncertainty. However, the data visualization selection that Gigerenzer and his team selected is simply wrong.

First of all, look at the leftmost bar, it demonstrates so many problems with error bars in general, and in error bars in barplots in particular. Can you see how the error bar crosses the X-axis, implying that Task 1 might have resulted in negative percentage of correct inferences?

The irony is that Prof. Gigerenzer is a worldwide expert in communicating uncertainty. I read his book “Calculated risk” from cover to cover. Twice.

Why is this important?

Communicating uncertainty is super important. Take a look at this 2018 study with the self-explaining title “Uncertainty Visualization Influences how Humans Aggregate Discrepant Information.” From the paper: “Our study repeatedly presented two [GPS] sensor measurements with varying degrees of inconsistency to participants who indicated their best guess of the “true” value. We found that uncertainty information improves users’ estimates, especially if sensors differ largely in their associated variability”.

Image result for clinton trump polls
Source HuffPost

Also recall the surprise when Donald Trump won the presidential elections despite the fact that most of the polls predicted that Hillary Clinton had higher chances to win. Nobody cared about uncertainty, everyone saw the graphs!

Why not error bars?

Keep in mind that error bars are considered harmful, and I have a reference to support this claim. But why?

First of all, error bars tend to be symmetric (although they don’t have to) which might lead to the situation that we saw in the first example above: implying illegal values.

Secondly, error bars are “rigid”, implying that there is a certain hard threshold. Sometimes the threshold indeed exists, for example a threshold of H0 rejection. But most of the time, it doesn’t.

stacked round gold-colored coins on white surface

More specifically to bar plots, error lines break the bar analogy and are hard to read. First, let me explain the “bar analogy” part.

The thing with bar charts is that they are meant to represent physical bars. A physical bar doesn’t have soft edges and adding error lines simply breaks the visual analogy.

Another problem is that the upper part of the error line is more visible to the eye than the lower one, the one that is seen inside the physical bar. See?undefined

But that’s not all. The width of the error bars separates the error lines and makes the comparison even harder. Compare the readability of error lines in the two examples below

The proximity of the error lines in the second example (take from this site) makes the comparison easier.

Are there better alternatives?

Yes. First, I recommend reading the “Error bars considered harmful” paper that I already mentioned above. It not only explains why, but also surveys several alternatives

Nathan Yau from flowingdata.com had an extensive post about different ways to visualize uncertainty. He reviewed ranges, shades, rectangles, spaghetti charts and more.

Claus Wilke’s book “Fundamentals of Data Visualization” has a dedicated chapter to uncertainty with and even more detailed review [link].

Visualize uncertainty about the future” is a Science article that deals specifically with forecasts

Robert Kosara from Tableu experimented with visualizing uncertainty in parallel coordinates.

There are many more examples and experiments, but I think that I will stop right now.

The bottom line

Communicating uncertainty is important.

Know your tools.

Try avoiding error bars.

Bars and bars don’t combine well, therefore, try harder avoiding error bars in bar charts.

Visualizations with perceptual free-rides

Dr. Richard Brath is a data visualization expert who also blogs from time to time. Each post in Richard’s blog provides a deep, and often unexpected to me, insight into one dataviz aspect or another.

richardbrath

We create visualizations to aid viewers in making visual inferences. Different visualizations are suited to different inferences. Some visualizations offer more additional perceptual inferences over comparable visualizations. That is, the specific configuration enables additional inferences to be observed directly, without additional cognitive load. (e.g. see Gem Stapleton et al, Effective Representation of Information: Generalizing Free Rides2016).

Here’s an example from 1940, a bar chart where both bar length and width indicate data:

Walter_Weld__How_to_chart_data_1960_hathitrust2

The length of the bar (horizontally) is the percent increase in income in each industry.  Manufacturing has the biggest increase in income (18%), Contract Construction is second at 13%.

The width of the bar (vertically) is the relative size of that industry: Manufacturing is wide – it’s the biggest industry – it accounts for about 23% of all industry. Contract Construction is narrow, perhaps the third smallest industry, perhaps around 3-4%.

What’s really interesting is that

View original post 446 more words

Pseudochart. It’s like a pseudocode but for charts

Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. People write pseudocode to isolate the “bigger picture” of an algorithm. Pseudocode doesn’t care about the particular implementation details that are secondary to the problem, such as memory management, dealing with different encoding, etc. Writing out the pseudocode version of a function is frequently the first step in planning the implementation of complex logic.

Similarly, I use sketches when I plan non-trivial charts, and when I discuss data visualization alternatives with colleagues or students.

One can use a sheet of paper, a whiteboard, or a drawing application. You may recognize this approach as a form of “paper prototyping,” but it deserves its own term. I suggest calling such a sketch a “pseudochart”*. Like a piece of pseudocode, the purpose of a pseudochart is to show the visualization approach to the data, not the final graph itself.

* Initially, I wanted to use the term “pseudograph” but the network scientists already took it for themselves.

** The first sentence of this post is a taken from the Wikipedia.

Useful redundancy — when using colors is not completely useless

The maximum data-ink ratio principle implies that one should not use colors in their graphs if the graph is understandable without the colors. The fact that you can do something, such as adding colors, doesn’t mean you should do it. I know it. I even have a dedicated tag on this blog for that. Sometimes, however, consistent use of colors serves as a useful navigation tool in a long discussion. Keep reading to learn about the justified use of colors.

Pew Research Center is a “is a nonpartisan American fact tank based in Washington, D.C. It provides information on social issues, public opinion, and demographic trends shaping the United States and the world.” Recently, I read a report prepared by the Pew Center on the religious divide in the Israeli society. This is a fascinating report. I recommend reading without any connection to data visualization.

But this post does not deal with the Isreali society but with graphs and colors.

Look at the first chart in that report. You may see a tidy pie chart with several colored segments. 

Pie chart: Religious composition of Israeli society. The chart uses several colored segments

Aha! Can’t they use a single color without losing the details? Of course the can! A monochrome pie chart would contain the same information:

Pie chart: Religious composition of Israeli society. The chart uses monochrome segments

In most of the cases, such a transformation would make a perfect sense. In most of the cases, but not in this report. This report is a multipage research document packed with many facts and analyses. The pie chart above is the first graph in that report that provides a broad overview of the Israeli society. The remaining of this report is dedicated to the relationships between and within the groups represented by the colorful segments in that pie chart. To help the reader navigating through this long report, its authors use a consistent color scheme that anchors every subsequent graph to the relevant sections of the original pie chart.

All these graphs and tables will be readable without the use of colors. Despite the fact that the colors here are redundant, this is a useful redundancy. By using the colors, the authors provided additional information layers that make the navigation within the document easier. I learned about the concept of useful redundancy from “Trees, Maps, and Theorems” by Jean-luc Dumout. If you can only read one book about data communication, it should be this book.

Microtext Line Charts

Why adding text labels to graph lines, when you can build graph lines using text labels? On microtext lines

richardbrath

Tangled Lines

Line charts are a staple of data visualization. They’ve existed at least since William Playfair and possibly earlier. Like many charts, they can be very powerful and also have their limitations. One limitation is the number of lines that can be displayed. One line works well: you can see trend, volatility, highs, lows, reversals. Two lines provides opportunity for comparison. 5 lines might be getting crowded. 10 lines and you’re starting to run out of colors. But what if the task is to compare across a peer group of 30 or 40 items? Lines get jumbled, there aren’t enough discrete colors, legends can’t clearly distinguish between them. Consider this example looking at unemployment across 37 countries from the OECD: which country had the lowest unemployment in 2010?

unemployment_plain

Tooltips are an obvious way to solve this, but tooltips have problems – they are much slower than just shifing visual attention…

View original post 1,323 more words