Modern tools make your skills obsolete. So what?

Read this if you are a data scientist (or another professional) worried about your career.

So many people, including me, write about how fields such as copywriting, drawing, or data science change from being accessible to a niche of highly professional individuals to a mere commodity. I claim it’s a good thing, not only for humankind but for the individual professional. Since I know nothing about drawing, I’ll talk about data science.

I started working as a data scientist a long time ago, even before the term data science was coined. Back then, my data science job included:

  • writing code that implements this optimization algorithm or the other
  • writing code that implements this statistical analysis or the other
  • writing code that implements this machine learning technique of the other
  • writing code that implements this quality metric or the other
  • writing code that handles named columns
  • writing code that deals with parallelization, caching, fetching data from the internet

Back then, exactly when the term data scientist was coined, I used to say “data is data”. I claimed that it didn’t matter whether you write a model that detects cancer or detects online fraud, a model that simulates two molecules in a solution or a model that simulates players in the electric appliances market. Data was data, and my job, as a data scientist was to crunch it.

Time passed by. Suddenly, I discovered one cool library, the other, and a third one … Suddenly, my job was to connect these libraries, which allowed me to be more expressive in what I could achieve. It also allowed me to concentrate better on “business logic.” Business logic is the term I use to describe all the knowledge required for the organization that pays your salary to keep doing so. If you work for a gaming company, “business logic” is the gaming psychology, competitor landscape, growth methods, and network effect. If you work for a biotech company, “business logic” is the deep understanding of disease mechanisms, biochemistry, genetics, or whatever is needed to perform the breakthrough. The fact that I don’t need to deal with “low-level coding” made me obsolete and drove me to a state where I became more specialized.

These days, we are facing a new era in knowledge commoditization. This commoditization makes our skills obsolete but also makes us more efficient in tasks that we were slow at and lets us develop new skills. 

In 2017, Gartner predicted that more than 40% of data science tasks would be obsolete by 2020. Today, in 2023, I can safely say that they were right. I can also say that today, despite the recent layouts, there are much more busy data scientists than there were in 2017 or 2020.

The bottom line. Stop worrying.

Let me cite myself from 2017:

Data scientists won’t disappear as an occupation. They will be more specialized.

I’m not saying that data scientists will disappear in the way coachmen disappeared from the labor market. My claim is that data scientists will cease to be perceived as a panacea by the typical CEO/CTO/CFO. Many tasks that are now performed by the data scientists will shift to business developers, programmers, accountants and other domain owners who will learn another skill — operating with numbers using ready to use tools. An accountant can use Excel to balance a budget, identify business strengths, and visualize trends. There is no reason he or she cannot use a reasonably simple black box to forecast sales, identify anomalies, or predict churn.

This is another piece of career advice. I have more of them in my blog

Graphing Highly Skewed Data – Tom Hopper

My colleague, Chares Earl, pointed me to this interesting 2010 post that explores different ways to visualize categories of drastically different sizes.

The post author, Tom Hopper, experiments with different ways to deal with “Data Giraffes”. Some of his experiments are really interesting (such as splitting the graph area). In one experiment, Tom Hopper draws bar chart on a log scale. Doing so is considered as a bad practice. Bar charts value (Y) axis must include meaningful zero, which log scale can’t have by its definition.

Other than that, a good read Graphing Highly Skewed Data – Tom Hopper

Evolution of a complex graph. Part 1. What do you want to say?

From time to time, people ask me for help with non-trivial data visualization tasks. A couple of weeks ago, a friend-of-a-friend-of-a-friend showed me a set of graphs with the following note:

Each row is a different use case. Each use case was tested on three separate occasions – columns 1,2,3. We hope to show that the lines in each row behave similarly, but that there are differences between the different rows.

Before looking at the graphs, note the last sentence in the above comment. Knowing what you want to show is an essential and not trivial part of a data visualization task. Specifying what is it precisely that you want to say is the first required task in any communication attempt, technical or not.

For the obvious reasons, I cannot share the original graphs that that person gave me. I managed to re-create the spirit of those graphs using a combination of randomly generated arrays.
The original graph: A 3-by-4 panel of line charts
Notice how the X- and Y- axes are aligned between all the subplots. Such alignment is a smart move that provides a shared scale and allows faster and more natural comparison between the curves. You should always try aligning your axes. If aligning isn’t possible, make sure that it is absolutely, 100%, clear that the scales are different. Slight differences are very confusing.

There are several small things that we can do to improve this graph. First, the identical legends in every subplot are a useless waste of ink and thus, of your viewers’ processing power. Since they are identical, these legends do nothing but distracting the viewer. Moreover, while I understand how a variable name such as event_prob appeared on a graph, showing such names outside technical teams is a bad practice. People who don’t share intimate knowledge with the underlying data will find human-readable labels easier to comprehend, making your message “stickier.”
Let’s improve the signal-to-noise ratio of this plot.
An improved version of the 3-by-4 grid of line charts

According to our task, each row is a different use case. Notice that I accompanied each row with a human-readable label. I didn’t use cryptic code such as group_001, age_0_10 or the such.
Now, let’s go back to the task specification. “We hope to show that the lines in each row behave similarly, but that there are differences between the separate rows.” Remember my advice to always use conclusions as graph titles? Let’s test how such a title will look like

A hypothetical screenshot. The title says: "low intra- & high inter- group variability"

Really? Is there a better way to justify the title? I claim that there is.

Let’s experiment a little bit. What will happen if we will plot all the lines on the same graph? By doing so, we might create a stronger emphasize of the similarities and the differences.

Overlapping lines that show several repetitions in four different groups
Not bad. The separate lines create some excessive noise, and the legend isn’t the best way to label multiple lines, so let’s improve the graph even further.

Curves representing four different data groups. Shaded areas represent inter-group variability

Note that meaningful ticks on the X-axis. The 30, 180, and 365-day marks provide useful anchors.

Now, let us go back to our title. “Low intra- and high inter- group variability” is, in fact, two conclusions. If you have ever read any text about technical presentations, you should remember the “one point per slide” rule. How do we solve this problem? In cases like these, I like to use the same graph in two different slides, one for each conclusion.

Screenshot showing two slides. The first one is titled "low within-group variability". The second one is titled "High between-group variability". The graphs in the slides is the same

During a presentation, I would show this graph with the first conclusion as a title. I would talk about the implications of that conclusion. Next, I will say “wait! There is more”, will promote the slide and start talking about the second conclusion.

To sum up,

First, decide what is it that you want to say. Then ask whether your graph says what you want to say. Next, emphasize what you want to say, and finally, say what you want to say.

To be continued

The case that you see in this post is a relatively easy one because it only compares four groups. What will happen if you will need to compare six, sixteen or sixty groups? I will try answering this question in one of my next posts