Boris Gorelik |

Data Science & Communication Consultant and Advisor | AI, Machine Learning, Data Visualization

A new phase in my professional life

May 2, 2021

I’m excited to announce that I’m joining MyBiotics Pharma Ltd as the company’s Head of Data and Bioinformatics. I have been working with this fantastic company and its remarkable people as a freelancer for fourteen fruitful months. But today, I join the MyBiotics family as a full-time member. Together, we will strive to better understanding the interactions between humans and their microbiome to improve health and well-being.

rbt

May 2, 2021 - 1 minute read -
announcement bioinformatics career mybiotics blog
Black lives matter. Lior Pachter

April 30, 2021

Almost one year after it was originally published, I stumbled upon this powerful post.

Today, June 10th 2020, black academic scientists are holding a strike in solidarity with Black Lives Matter protests. I strike with them and for them. This is why: I began to understand the enormity of racism against blacks thirty five years ago when I was 12 years old. A single event, in which I witnessed […]

Black lives matter

April 30, 2021 - 1 minute read -
blog
Super useful videos for advanced data visualizers

April 21, 2021

The great Robert Kosara, also known as the “eager eyes” has started publishing a series of videos he calls Chart Appreciation. In these videos, Robert takes a piece of data visualization from a reputable and known source, and discusses why this particular piece is so good, what decisions were made that made it possible, what alternatives are, and more. If you consider yourself an intermediate or advanced practitioner of data visualization, you should subscribe. Here’s one example.

April 21, 2021 - 1 minute read -
chart-appreciation data visualisation Data Visualization dataviz robert-kosara blog
Career advise. Upgrading data science career

April 11, 2021
From time to time, people send me emails asking for career advice. Here’s one recent exchange.

Hi Boris,

I am currently trying to decide on a career move and would like to ask for your advice.

I have a MSc from a leading university in ML, without thesis.

I have 5 years of experience in data science at , producing ML based pipelines for the products. I have experience with Big Data (Spark, …), ML, deploying models to production…

However, I feel that I missed doing real ML complicated stuff. Most of the work I did was to build pipelines, training simple models, do some basic feature engineering… and it worked good enough.

Well, this IS the real ML job for 91.4%* of data scientists. You were lucky to work in a company with access to data and has teams dedicated to keeping data flowing, neat, and organized. You worked in a company with good work ethics, surrounded by smart people, and, I guess, the computational power was never a big issue. Most of the data scientists that I know don’t have all these perks. Some have to work alone; others need to solve “dull” engineering problems, find ways to process data on suboptimal computers or fight with a completely unstandardized data collection process. In fact, I know a young data scientist who quit their first post-Uni job after less than six months because she couldn’t handle most of these problems.

However I don’t have any real research experience. I never published any paper, and feel like I always did easy stuff. Therefore, I lack confidence in the ML domain. I feel like what I’ve been doing is not complicated and I could be easily replaced.

This is a super valid concern. I am surprised how few people in our field think about it. On the one hand, most ML practitioners don’t publish papers because they are busy doing the job they are paid for. I am a big proponent of teaching as a means of professional growth. So, you can decide to teach a course in a local meetup, local college, in your workplace, or at a conference. Teaching is an excellent way to improve your communication skills, which are the best means for job security (see this post).

Since you work at XXXXX , I suggest talking to your manager and/or HR representative. I’m SURE that they will have some ideas for a research project that you can take full-time or part-time to help you grow and help your business unit. This brings me to your next question.

I feel like having a research experience/doing a PhD may be an essential part to stay relevant in the long term in the domain. Also, having an expertise in one of NLP/Computer Vision may be very valuable.

I agree. Being a Ph.D. and an Israeli (we have one of the largest Ph.D. percentages globally) makes me biased.

I got 2 offers:
- One with , to do research in NLP and Computer Vision. [...] which is focused on doing research and publishing papers [...]
- One with a very fast growing insurance startup, for a data scientist position, as a part of the founding team team. […] However, I feel it would be the continuation of my current position as a data scientist, and I would maybe miss on this research component in my career.
You can explore a third option: A Ph.D. while working at your current place of work. I know for a fact that this company allows some of their employees to pursue a Ph.D. while working. The research may or may not be connected to their day job.

I am very hesitant because
- I am not sure focusing on ML models in a research team would be a good use of my time as ML may be commoditised, and general DS may be more future-proof. Also I am concerned about my impact there.
- I am not sure that I would have such a great impact in the DS team of the startup, due to regulations in the pricing model [of that company], and the fact that business problems may be solved by outsourced tools.
These are hard questions to answer. First of all, one may see legal constraints as a “feature, not a bug,” as they force more creative thinking and novel approaches. Many business problems may indeed be solved by outsourcing, but this usually doesn’t happen in problems central to the company’s success since these problems are unique enough to not fit an off-the-shelf product. You also need to consider your personal preferences because it is hard to be good at something you hate doing.

From time to time, I give career advice. When the question or the answer is general enough, I publish them in a post like this. You may read all of these posts here.
April 11, 2021 - 4 minute read -
career data science careers blog Career advice
Interview 27: Racial discrimination and fair machine learning

March 7, 2021
I invited Dr. Charles Earl for this episode of my podcast “Job Interview” to talk about racial discrimination at the workplace and fairness in machine learning.

Dr. Charles Earl is a data scientist in Automattic, my previous place of work. Charles holds a Ph.D. in computer science, M.A. in education, M.Sc in Electrical engineering, and B.Sc in mathematics. His career covered a position of assistant professor and a wide range of hands-on, managerial, and consulting roles in the field that we like to call today “data science.”

But there is another aspect in Dr. Earl. His skin is brown. He was born to an African-American family in Atlanta, GA, in the 1960s when racial segregation was explicitly legal. I am sure that this fact affected Charles’ entire life, personal and professional.

Links
If you know Hebrew, follow my podcast Job Interview (Reayon Avoda), and This Week in the Middle East
March 7, 2021 - 1 minute read -
discrimination machine learning podcast race racial-discrimination blog
Five things I wish people knew about real-life machine learning

March 3, 2021

Deena Gergis is a data science lead at Bayer. I recently discovered Deena’s article on LinkedIn titled “Five Things I Wish I Knew About Real-Life AI.” I think that this article is a great piece of a career advice for all the current and aspiring data scientists, as well as for all the professionals who work with them. Let’ me take Deena’s headings and add my 2 cents.

One. It is all about the delivered value, not the method.

I fully agree with this one. Nobody cares whether you used a linear regression or recurrent neural network. Nobody really cares about p-values or r-squared. What people need are results, insights, or working products. Simple, right?

Two. Packaging does matter

Again, well said. The way you present your solution to your colleagues, customers, or stakeholders can determine whether your project will get more funds and resources or not.

Three. Doing the right things != doing things right.

Exactly. Citing Deena: “you might be perfectly predicting a KPI that no one cares about.” Enough said.

Four. Set realistic expectations.

Not everybody realizes that “machine learning” and “artificial intelligence” are not a synonym of “magic” but rather a form of statistics (I hope “real” statisticians won’t get mad at me here). The principle “garbage in - garbage out” holds in machine learning. Moreover, sometimes, ML systems amplify the garbage, resulting in “garbage in, tons of garbage out”.

Five. Keep humans in the loop.

Let me cite Deena again: “My customers are my partners, not just end-users.” Note that by “customers,” we don’t only mean walk-in clients, but also any internal customer, project manager, even a colleague who works on the same project. They are all partners with unique insights, domain knowledge, and experience. Use them to make your work better.

Read the original article here. Deena Gergis has several more articles on LinkedIn here. And if you know Arabic, you might want to watch Deena’s videos on YouTube here. Unfortunately, my Arabic is not good enough to understand her Egyptian accent, but I suspect that her videos are as good as her writings.

March 3, 2021 - 2 minute read -
communication data-scienc data science careers reblog blog Career advice
One of the first dataviz blogs that I used to follow is now a book. Better Posters

March 1, 2021

I started following data visualization news and opinions quite a few years ago. One of the first bloggers who were active in this area NeurDojo, by the (now) professor Zen Faulkes. On of Zen’s spin-off blogs was devoted to better posters. This poster blog is called, surprisingly enough, Better Posters. Since I’m not in academia anymore, stopped caring about posters many years ago. Today, I stumbled upon this blog and was pleasantly surprised to discover that Better Posters is still active and that it is also now a book.

March 1, 2021 - 1 minute read -
better-posters communication data visualisation Data Visualization posters blog
On startup porn

January 13, 2021

Danny Lieberman managed teams of programmers before I couldn’t read, so when Danny writes a post as bold and blunt as this, you should read it.

Click the picture to go to the full text.

Oh, if you speak Hebrew, you should listen to Danny Lieberman talking in my podcast [here].

January 13, 2021 - 1 minute read -
danny-lieberman startup startup-culture startup-porn blog
Working with the local filesystem and with S3 in the same code

January 4, 2021
As data people, we need to work with files: we use files to save and load data, models, configurations, images, and other things. When possible, I prefer working with local files because it’s fast and straightforward. However, sometimes, the production code needs to work with data stored on S3. What do we do? Until recently, you would have to rewrite multiple parts of the code. But not anymore. I created a sshalosh package that solves so many problems and spares a lot of code rewriting. Here’s how you work with it:
```
if work_with_s3:
    s3_config = {
      "s3": {
        "defaultBucket": "bucket",
        "accessKey": "ABCDEFGHIJKLMNOP",
        "accessSecret": "/accessSecretThatOnlyYouKnow"
      }
    }
    
else:
    s3_config = None
serializer = sshalosh.Serializer(s3_config)

# Done! From now on, you only need to deal with the business logic, not the house-keeping

# Load data & model
data = serializer.load_json('data.json')
model = serializer.load_pickle('model.pkl')

# Update
data = update_with_new_examples()
model.fit(data)

# Save updated objects
serializer.dump_json(data, 'data.json')
serializer.dump_pickle(model, 'model.pkl')
```
As simple as that.
The package provides the following functions.
- path_exists
- rm
- rmtree
- ls
- load_pickle, dump_pickle
- load_json, dump_json
There is also a multipurpose open function that can open a file in read, write or append mode, and returns a handler to it.

How to install? How to contribute?

The installation is very simple: pip install sshalosh-borisgorelik
and you’re done. The code lives on GitHub under http://github.com/bgbg/shalosh. You are welcome to contribute code, documentation, and bug reports.

The name is strange, isn’t it?

Well, naming is hard. In Hebrew, “shalosh” means “three”, so “sshalosh” means s3. Don’t overanalyze this. The GitHub repo doesn’t have the extra s. My bad
January 4, 2021 - 2 minute read -
code opensource python sshalosh blog
Book review. The Persuasion Slide by Richard Dooley

December 30, 2020

TL;DR Very shallow and uninformative. It could be an OK series of blog posts for complete novices, but not a book.

The Persuasion Slide by Richard Dooley was a disappointment for me. I love Dooley’s podcast Brainfluence, and I was sure that Richard’s book would full of in-depth knowledge and case studies. However, it contained neither.

The only contribution of this book is the analogy between a sale process and an amusement part slide. The theory behind the book is mostly presented as a ground truth with almost no explanation or support from research. One will gain much more knowledge and understanding by reading Kahneman’s “Thinking, Fast and Slow,” Arieli’s “Predictably irrational.” or Weisman’s “59 seconds.”

Should I read this book?

No

December 30, 2020 - 1 minute read -
book-re brainfluence dooley persuasion blog
Graphical comparison of changes in large populations with "volcano plots"

December 24, 2020

I recently rediscovered a volcano plot – a scatter plot that aims to visualize changes in large populations.

Volcano plots are very technical and specialized and, most probably, are not a good fit for explanatory data visualization. However, they can be useful during the exploration phase, and they come with a set of well-established metrics.

Moreover, if you are lucky enough to have well-behaved data, the plots look very cool

From here

Of course, in real life, the data is messy. Add bad visualization practices to the mess and you get a marvel like this one

From here

The bottom line: if you have two populations to compare, consider volcano plots. But do remember dataviz good practices.

December 24, 2020 - 1 minute read -
data visualisation Data Visualization datavis volcano-plot blog
Book review: Manager in shorts by Gal Zellermayer

December 23, 2020

TL;DR Nice’n’easy reading for novice managers

I read this book after hearing the author, Gal Zellermayer, in a podcast. Gal is an Israeli guy who has been working as a manager in several global companies’ Israeli offices. He brings a perspective that combines (what is perceived) the best practices of American managing style with the Israeli tendency to make things straight and simple.

The greater part of the book is devoted to helping the people in your team develop. The book serves as a good motivator and helps to keep the importance of “peopleware.” I wish, however, it would bring more practical advice and cite more research and external analyses.

Should you read this book?

If you are a beginning manager or want to be one - yes.

If you never read a book on management - maybe (although Peopleware might be a better read).

The bottom line: 4/5

December 23, 2020 - 1 minute read -
book review management peopleware blog
You might not love working at a distributed company if...

December 9, 2020

A couple of weeks go, I wrote a post about an unexpected hitch of working in a distributed team. Yesterday, my ex-coworker, Ann McCarthy wrote a related, more elaborative post on the same issue. It’s worth reading.

December 9, 2020 - 1 minute read -
automattic distributed work work-from-home blog
One idea per slide. It’s not that complicated

December 8, 2020

I wrote this post in 2009, I published it in March 2020, and am republishing it again

A lot of texts that talk about presentation design cite a very clear rule: each slide has to contain only one idea. Here’s a slide from a presentation deck that says just that.

And here’s the next slide in the same presentation

Can you count how many ideas there are on this slide? I see four of them.

Can we do better?

First of all, we need to remember that most of the time, the slides accompany the presenters and not replace them. This means that you don’t have to put everything you say as a slide. In our case, you can simply show the first slide and give more details orally. On the other hand, let’s face it, the presenters often use slides to remined themselves of what they want to say.

So, if you need to expand your idea, split the sub-ideas into slides.

You can add some nice illustrations to connect the information and emotion.

Making it more technical

“Yo!”, I can hear you saying, “Motivational slides are one thing, and technical presentation is a completely different thing! Also,” you continue, “We have things to do, we don’t have time searching the net for cute pics”. I hear you. So let me try improving a fairly technical slide, a slide that presents different types of machine learning.
Does slide like this look familiar to you?

First of all, the easiest solution is to split the ideas into individual slides.

It was simple, wasn’t it. The result is so much more digestible! Plus, the frequent changes of slides help your audience stay awake.

Here’s another, more graphical attempt

When I show the first slide in the deck above, I tell my audience that I am about to talk about different machine learning algorithms. Then, I switch to the next slide, talk about the first algorithm, then about the next one, and then mention the “others”. In this approach, each slide has only one idea. Notice also how the titles in these last slides are smaller than the contents. In these slides, they are used for navigation and are therefore less important. In the last slide, I got a bit crazy and added so much information that everybody understands that this information isn’t meant to be read but rather serves as an illustration. This is a risky approach, I admit, but it’s worth testing.

To sum up

“One idea per slide” means one idea per slide. The simplest way to enforce this rule is to devote one slide per a sentence. Remember, adding slides is free, the audience attention is not.

December 8, 2020 - 2 minute read -
powerpoint presentation presentation-tip technical-presentation blog
Innumeracy

December 3, 2020

Innumeracy is the “inability to deal comfortably with the fundamental notions of number and chance”.
I wish there was a better term for “innumeracy”, a term that would reflect the importance of analyzing risks, uncertainty, and chance. Unfortunately, I can’t find such a term. Nevertheless, the problem is huge. In this long post, Tom Breur reviews many important aspects of “numeracy”. I already shared this post a long time ago, but it’s worth sharing again.

https://tombreur.wordpress.com/2018/10/21/innumeracy/

December 3, 2020 - 1 minute read -
blog
Before and after — stacked bar charts

November 25, 2020

A fellow data analyst asked a question? What do we do when we need to draw a stacked bar chart that has too many colors? How do we select the colors so that they are nice but also are easily distinguishable? To answer this question, let’s look at the data similar to what appeared in the original question. I also tried to recreate the actual chart’s style

So, how do we select colors?
The answer to this question is pretty complicated. To have a set of easily distinguishable colors, one needs to model the color perception in a typical human being properly. Luckily, a tool called I Want Hue that’s based on a solid theory explained here. The problem, however, isn’t in colors.

This is not the right question

Distinguishing between eight colors in a graph is a challenging task. Selecting the right color scheme might help, but it won’t solve this fundamental problem. Moreover, stacked bar plots are tricky due to another complication.

We, the humans, are somewhat good are comparing positions but not as good at comparing sizes. This is why comparing the heights of the bars is relatively easy. It is easy because the bars start at the same line, and our task is to compare the bar end position, not the bar size. Reading the heights of the lowest segment in the bars is also an easy task for the same reason: we don’t compare the sizes but the heights.

However, comparing the sizes of the middle components is more challenging. As a result, the intermediate parts of a graph don’t add useful information but rather add noise. Thus, let us explain two options. First, we will reduce the number of groups. Next, we will explore what happens when reducing the number of groups is not an option.

Option 1. Reduce the number of categories

It is hard to advise about data visualization when I don’t know what conclusion the author wants to convey. However, I am sure that in many cases, the number of categories that are relevant to the viewer is much smaller than the number of types that are relevant to the analyst. The viewer might not care about all the hard job you did while collecting the data; what they are about is an insight. For example, if we reduce the discussion to two groups: the USA and non-USA data centers, the graph becomes much more readable.

Note how two groups in a stacked graph pose no problem in deciphering the sizes. If we take care of readability and improve the data-ink ratio, we get a nice data visualization piece.

Option 2. When reducing the number of categories is not an option

But what if reducing the number of categories is not an option? If you are absolutely sure that the audience absolutely needs to see all the information, you can split the different groups into separate subgraphs.

Have you noticed that the X-axis in our case represents time? In this case, we can replace the bars with an evolution plot and create a separate chart for each category in the data set. I took special care to keep the Y-axis scale equal between all the graphs so that the viewer can easily distinguish between data centers with a lot of errors and data centers with only a few of them. Here’s the result:

But what if the overall error rate is of greater importance than the individual groups. In that case, we can plot them in a larger graph and add the separate groups below, in smaller, un-emphasized subplots.

Summary – the Why and the What define the How

When you have a technical question about improving a graph, make sure you ask yourself “why.” Why is, does technical problems matter? Why will it improve the chart? To answer this question, you will have to ask another question: “what?”. “What is it that I want to say.” The easiest way to force yourself to ask these questions is to force yourself to add titles to every graph you create (see my how to suck less in data visualization post for more details).

Once you have your conclusion ready, you will notice that you don’t need a technical solution but rather a conceptual one. In this case, we solved the technical problem of looking for eight distinct colors by reducing the number of categories to two or splitting one elaborate graph into several straightforward ones.

So, remember, the Why and the What define the How

Python code that was used to generate all this graphs is available on [gist](https://gist.github.com/bgbg/6c645a5fc48e61b1a917c9d1d66fa72f)

November 25, 2020 - 4 minute read -
bar plot before-after data visualisation Data Visualization blog
The Problem With Slope Charts (by Nick Desbarats)

November 12, 2020

Slope charts are often suggested as a valid alternative to clustered bar charts, especially for “before and after” cases.

So, instead of a clustered bar char like this

we tend to recommend a slope chart (or slope graph) like this

However, a slope chart isn’t free of problems either. In the past, I already wrote about a case of a meaningless slopegraph [here]. Today, I stumbled upon an interesting blog post (and a video) that surveys the problems of slope chars and their alternatives

All the graphs here come from the original post by Nick Desbarats that can be found [here].

November 12, 2020 - 1 minute read -
bar plot Data Visualization slopegraph blog
Before and after: Alternatives to a radar chart (spider chart)

November 10, 2020

A radar chart (sometimes called “spider charts”) look cool but are, in fact,
pretty lame. So much so that when the data visualization author Stephen Few mentioned them in his book Show me the numbers, he did so in a chapter called “Silly graphs that are best forsaken.”

Here, I will demonstrate some of its problems, and will suggest an alternative

Before: The problems of a radar (spyder) plot

Above is my reconstruction of the original plot that I saw in a Facebook discussion. The graph looks pretty cool, I have to admit, but it is full of problems.
What are the problems of a spyder plot or a radar plot?
Let’s start with readability. Can you quickly tell the value of “Substance abuse” for the red series? Not that easy.

But a more significant problem emerges when one realizes that in most cases, the order of the categories is arbitrary and that different sorting options may result in entirely different visual pictures.

After: conclusion-based graph design

I have been continually preaching to add meaningful titles to all the graphs you are creating. (See How to suck less in data visualization and professional communication).

One of the byproducts of adding a title is the fact that when you write down your main takeaway of a graph, you force yourself to think, “does this graph show what it says it shows?” Thus, you guide yourself to better graph choices.

Let’s say that we conclude that there is no correlation between the two series of data. Is this conclusion evident from the graphs? I would say, not so much.

Instead of a radar chart, I suggest creating two aligned, horizontal graph plots. This way, we may sort one subplot according to the values, and then, correlation (or lack of thereof) will be evident.

But what if we noticed something interesting about the differences between A and B groups? If this is true, let’s show precisely this: the differences.

Notice how the bars in this version are sorted according to the difference. Sorting a bar chart is the easiest way to make it readable.

Python code that I used to create these graphs is available here https://gist.github.com/bgbg/db833db723998cd244b5049bfe01f5ac

November 10, 2020 - 2 minute read -
bar plot before-after data visualisation Data Visualization radar-chart spider-chart blog
Another language

November 5, 2020

بعد حوالي سنتين من الدراسة ، بحس حالي جاهز لإضافة اللغة العربية إلى قائمة اللغات في ال-LinkedIn

After about two years of study, I feel ready to add Arabic to LinkedIn’s language list

November 5, 2020 - 1 minute read -
acheivement arabic linkedin blog
Basic data visualization video course (in Hebrew)

October 26, 2020

I had the honor to record an introductory data visualization course for high school students as a part of the Israeli national distance learning project. The course is in Hebrew, and since it targets high schoolers, it does not require any prior knowledge.

I got paid for this job. However, when I divide the money that I received for this job by the time I spent on it, I get a ridiculously low rate. On the other hand, I enjoyed the process, and I view this as my humble donation to the public education system.
Since a government agency makes the course site, it’s UI is complete shit. For example, the site doesn’t support playlists, and the user is expected to search through the video clips by their titles. To fix that, I created a page that lists all the videos in the right order.

https://he.gorelik.net/course/

October 26, 2020 - 1 minute read -
data visualisation Data Visualization dataviz recording studio teaching blog
Text Visualization Browser

October 22, 2020

I’ve stumbled upon an exciting project – text visualization browser. It’s a web page that allows one to search for different text visualization techniques using keywords and publication time.

Text visualization browser https://textvis.lnu.se

The ability to limit the search to various years gives a nice historical perspective on this interesting topic

This site’s information is based on a 2015 paper Text visualization techniques: Taxonomy, visual survey, and community insights. I wish the authors updated it with more recent data, though.

October 22, 2020 - 1 minute read -
data visualisation Data Visualization dataviz site blog
Hands-on Data Visualization in Python

October 21, 2020

הקליקו כאן לקבלת פרטים והרשמה!

October 21, 2020 - 1 minute read -
announcement Data Visualization blog
Sharing the results of your Python code

October 20, 2020

If you work, but nobody knows about your results or cares about them, have you done any work at all?

A proverbial tree in the proverbial forest. Photo by veeterzy on Pexels.com

As a data scientist, the product of my work is usually an algorithm, an analysis, or a model. What is a good way to share these results with my clients?

Since 99% of my time, I write in Python, I fell in love with a framework called Panel (http://panel.holoviz.org/). Panel allows you to create and serve basic interactive UI around data, an analysis, or a method. It plays well with API frameworks such as FastAPI or Flask. The only problem is that to share this work. Sometimes, it is enough to run a local demo server, but if you want to share the work with someone who doesn’t sit next to you, you have to host it somewhere and to take care of access rights. For this purpose, I have a cheap cloud server ($5/month), which is more than enough for my personal needs.

If you can share the entire work publicly, some services can pick up your Jupyter notebooks from Github and interactively serve them. I know of voila and Binder)

Recently, Streamlit.io is entering this niche. It currently only allows sharing public repos, but promises to add a paid service for your private code. I’m eager to see that.

October 20, 2020 - 1 minute read -
panel python sharing streamlit blog
New notebook, new plans

October 8, 2020

This notebook is a part of my productivity system. Read more on productivity and procrastination here.

October 8, 2020 - 1 minute read -
procrastination productivity blog Productivity & Procrastination
The information is beautiful. The graphs are shit!

October 1, 2020

I apologize for my harsh language, but recently I was exposed to a bunch of graphs on the “information is beautiful” site, and I was offended (well, ot really, but let’s pretend I was). I mean, I’m a liberal person, and I don’t care what graphs people do in their own time. Many people visit that site because they try to learn good visualization practices, but some charts on that site are wrong. Very wrong.

Here’s the gem:

I deliberately don’t share the link to this site. I don’t want let Google think it’s valuable in any way.

Now, the geniuses from “Information is beautiful” (let’s call them IB for brevety) wanted to share with us some positive stats. How nice of them. So what they did? They gathered together nine pairs of metrics collected at two different time points: one in the past and one furthermore in history. They used nice colors to create some sleeky shapes. So, what’s the problem? What’s wrong with that?

Everything is wrong!

Let’s start from my guess that they cherry-picked the stats with “positive” changes. Secondly, the comparison of this sort is mostly meaningless if we compare points at different years. What stopped the authors of that tasteless “infographic” from collecting data from the same years? I guess, their laziness. That’s how we ended up comparing the number of death penalties in 1990 and 2016, but the malaria deaths numbers are for 2000 and 2016, and dying mothers are compared for years 2000 and 2017?

Now, let’s talk about data viz.

Take a look at this graph.

The only time we use shapes like that is when we want to convey information about uncertainty. To do that, the X-axis represents the thing we are measuring, and the Y-axis represents our certainty about the current value. When we compare to uncertain measurements, we may judge the difference between these measurements by the distance between the curve peaks, and the width of the curve represents the uncertainty.

Here’s a good example from [this link]:

Can you see how the metric of interest is on the X-axis? The width of each bell curve represents the uncertainty and the difference between any pair of cases is the difference on the horizontal (X) axis, not the vertical one.

Instead, what do the IB authors did? They obviously like sleek looking shapes but know nothing about how to use them. They could have used two bars and let the viewer compare their heights. But nooooo! Bars are not c3wl! Bars are boring! Instead, they took probability density curves (that’s how they are technically called) and made them pretend to be bars.

Bars. Is this THAT hard?

I can hear some of you saying, “Stop being so purist! What’s wrong with comparing the heights of bell curves?” I’ll tell you what’s wrong! Data visualization is a language. As with any language, it has some rules and traditions. If you hear me saying, “me go home,” you will understand me without any problem. However, you will silently judge me for my poor use of the English language. I know that, and since English is my third language, I use all the help to make as few mistakes as possible. The same is correct with data visualization. Please respect its rules and traditions, even if (and especially if) are not fluent in it.

I never write more than two sentences in English without Grammarly

Visit the worst practice tag in this blog to see more bad examples

October 1, 2020 - 3 minute read -
data visualisation Data Visualization dataviz blog
Older posts Newer posts

A new phase in my professional life

Black lives matter. Lior Pachter

Super useful videos for advanced data visualizers

Career advise. Upgrading data science career

Interview 27: Racial discrimination and fair machine learning

Five things I wish people knew about real-life machine learning

One. It is all about the delivered value, not the method.

Two. Packaging does matter

Three. Doing the right things != doing things right.

Four. Set realistic expectations.

Five. Keep humans in the loop.

One of the first dataviz blogs that I used to follow is now a book. Better Posters

On startup porn

Working with the local filesystem and with S3 in the same code

How to install? How to contribute?

The name is strange, isn’t it?

Book review. The Persuasion Slide by Richard Dooley

Should I read this book?

Graphical comparison of changes in large populations with "volcano plots"

Book review: Manager in shorts by Gal Zellermayer

Should you read this book?

You might not love working at a distributed company if...

One idea per slide. It’s not that complicated

Can we do better?

Making it more technical

To sum up

Innumeracy

Before and after — stacked bar charts

This is not the right question

Option 1. Reduce the number of categories

Option 2. When reducing the number of categories is not an option

Summary – the Why and the What define the How

The Problem With Slope Charts (by Nick Desbarats)

Before and after: Alternatives to a radar chart (spider chart)

Before: The problems of a radar (spyder) plot

After: conclusion-based graph design

Another language

Basic data visualization video course (in Hebrew)

Text Visualization Browser

Hands-on Data Visualization in Python

Sharing the results of your Python code

New notebook, new plans

The information is beautiful. The graphs are shit!

Everything is wrong!

Now, let’s talk about data viz.