Prompt engineers, the sexiest job of the third decade of the 21st century (?), or Don’t study prompt engineering as a career move, you’ll waste your time

Do you recall when data scientists were the talk of the town? Dubbed the sexiest job of the 21st century, they boasted a unique blend of knowledge and skills. I still remember the excitement I felt when I realized that the work I did had a name, and the warm feeling I got when I saw those cool Venn diagrams showing just how awesome data scientists were. Well, it’s time for data scientists to step aside and make way for the new heroes in town: the Prompt Engineers!

The demand for prompt engineers is soaring, and it seems like everyone is trying to become one. But what exactly is a prompt engineer, and what are my thoughts on this new profession?

Let’s take a step back in time: we started with assembly languages, and then a language called Formula Translator (better known as Fortran), which significantly lowered the barrier of entry into the field. I’m sure back then, people rolled their eyes and said that with the emergence of high-level programming languages, anyone could now take any formula and get an output, without understanding how semiconductors worked.

Fast forward to today. What do prompt engineers do? They essentially translate their domain knowledge, language understanding, and AI algorithm expertise into computer output (sounds like “ForTran,” right?). Prompt engineering is, in essence, a super-high-level programming language. Over time, I believe we’ll see dedicated tools and established standards emerge. But for now, it’s a wild, untamed frontier.

In 2017, I wrote a blog post titled “Don’t study data science as a career move; you’ll waste your time!“. Until today, this is the most read post in my blog. Now, it’s time for a new warning: “Don’t study prompt engineering as a career move; you’ll waste your time!”

Meanwhile, here’s a nice Venn diagram for you 🙂

Modern tools make your skills obsolete. So what?

Read this if you are a data scientist (or another professional) worried about your career.

So many people, including me, write about how fields such as copywriting, drawing, or data science change from being accessible to a niche of highly professional individuals to a mere commodity. I claim it’s a good thing, not only for humankind but for the individual professional. Since I know nothing about drawing, I’ll talk about data science.

I started working as a data scientist a long time ago, even before the term data science was coined. Back then, my data science job included:

  • writing code that implements this optimization algorithm or the other
  • writing code that implements this statistical analysis or the other
  • writing code that implements this machine learning technique of the other
  • writing code that implements this quality metric or the other
  • writing code that handles named columns
  • writing code that deals with parallelization, caching, fetching data from the internet

Back then, exactly when the term data scientist was coined, I used to say “data is data”. I claimed that it didn’t matter whether you write a model that detects cancer or detects online fraud, a model that simulates two molecules in a solution or a model that simulates players in the electric appliances market. Data was data, and my job, as a data scientist was to crunch it.

Time passed by. Suddenly, I discovered one cool library, the other, and a third one … Suddenly, my job was to connect these libraries, which allowed me to be more expressive in what I could achieve. It also allowed me to concentrate better on “business logic.” Business logic is the term I use to describe all the knowledge required for the organization that pays your salary to keep doing so. If you work for a gaming company, “business logic” is the gaming psychology, competitor landscape, growth methods, and network effect. If you work for a biotech company, “business logic” is the deep understanding of disease mechanisms, biochemistry, genetics, or whatever is needed to perform the breakthrough. The fact that I don’t need to deal with “low-level coding” made me obsolete and drove me to a state where I became more specialized.

These days, we are facing a new era in knowledge commoditization. This commoditization makes our skills obsolete but also makes us more efficient in tasks that we were slow at and lets us develop new skills. 

In 2017, Gartner predicted that more than 40% of data science tasks would be obsolete by 2020. Today, in 2023, I can safely say that they were right. I can also say that today, despite the recent layouts, there are much more busy data scientists than there were in 2017 or 2020.

The bottom line. Stop worrying.

Let me cite myself from 2017:

Data scientists won’t disappear as an occupation. They will be more specialized.

I’m not saying that data scientists will disappear in the way coachmen disappeared from the labor market. My claim is that data scientists will cease to be perceived as a panacea by the typical CEO/CTO/CFO. Many tasks that are now performed by the data scientists will shift to business developers, programmers, accountants and other domain owners who will learn another skill — operating with numbers using ready to use tools. An accountant can use Excel to balance a budget, identify business strengths, and visualize trends. There is no reason he or she cannot use a reasonably simple black box to forecast sales, identify anomalies, or predict churn.

This is another piece of career advice. I have more of them in my blog

14-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between for a perios of several years.

Overall, this period consists of between 14 to 17 working days in a single month (31 days, mind you). This year, we only have 14 working days during the Tishrei holiday period. This is how the working/not-working time during this month looks like:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.

(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

New position, new challenge

I will skip the usual “I’m thrilled and excited…”. I’ll just say it.
As of today, I am the CTO of wizer.me, a platform for teachers and educators to create and share interactive worksheets.

On a scale of 1 to 10, how thrilled am I? 10
On a scale of 1 to 10, how terrified am I? 10
On a scale of 1 to 10, how confident am I that wizer.me will become the “next big thing” and the most significant chapter in my career? You won’t believe me, but also 10.

Back to in-person presentations

Today, I gave my first in-person presentation since the pandemic. It was awesome! I was talking about the study I performed with Nabeel Sulieman about data visualization in environments that use right-to-left writing systems.

I wrote about this study in the past [one, two]. Today, you may find the results of our study at http://direction-matters.com/. I hope to be able to publish the video recording of this presentation really soon.

An example of a very bad graph

An example of a very bad graph

Nature Medicine is a peer-reviewed journal that belongs to the very prestigious Nature group. Today, I was reading a paper that included THIS GEM.

These two graphs are so bad. It looks as if the authors had a target to squeeze as many data visualization mistakes as possible in a single piece of graphics.

Let’s take a look at the problems.

  • Double Y axes. Don’t! Double axes are bad in 99% of cases (exceptions do exist, but they are rare).
  • Two subgraphs that are meant to work together have different category orders and different Y-axis scales. These differences make the comparison much harder.
  • Inverted Y scale in a bar chart. Wow! This is very strange. Bizarre! It took me a while to spot this. First, I tried to understand why the line of P<0.05 (the magic value of statistics) is above 0.1. Then, I realized that the right Y-axis is reversed. At first, I thought, “WTF?!” but then I understood why the authors made this decision. You see, according to the widespread statistical ritual, the lower the “P-value” is, the more significant it is considered. The value of 1 is deemed to be non-significant at all, and the value of 0 is considered “as significant as one can have.” So, in theory, the authors could have renamed the axis to “Significance” and reversed the numbers. Still, the result would not be a real “significance,” nor would the name be intuitive to anyone familiar with statistical analysis. On the other hand, they really wanted more “significant” values to be bigger than less significant ones. So, what the heck? Let’s invert the scale! Well, no, this is not a good idea
  • Slanted category labels. This might be a matter of taste, but I dislike rotated and slanted labels. Turning the graph solves the need for label rotation, thus making it more readable and having zero drawbacks.

What can be done?

I don’t like criticism without improvement suggestions. Let’s see what I would have done with this graph. To make this decision, I first need to decide what I want to show. According to my understanding of the paper, the authors wish to show that the two data sets are very different in determining a specific outcome. To show that, we don’t need to depict both the P-value and variance (mainly since these two values are very much correlated). Thus, I will depict only show one metric. I will stick with the P-value.

I will keep the category order the same between the two subgraphs. Doing so will create a “table lens” effect; it will show the individual values while demonstrating the lack of correlations between the two groups. Finally, I will convert the bars into points, primarily to reduce the data-ink ratio. Two additional arguments against bar charts, in this case, are the facts that the P-values of a statistical test cannot possibly be zero and that bar charts don’t allow log-scale, in case we’ll want to use it.

The result should look like this sketch.

14-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between for a perios of several years.

Overall, this period consists of between 14 to 17 working days in a single month (31 days, mind you). This year, we only have 14 working days during the Tishrei holiday period. This is how the working/not-working time during this month looks like:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.

(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

Opening a new notebook in my productivity system

Those who know me, know that I always care with me a cheep and thin notebook which I use as an extension to my mind. Today, I opened a new notebook, and this is a good opportunity to share some links about my productivity system.

  • Start with the post “The best productivity system I know
  • Failed attempt with tangible boards is here. This approach has an interesting idea behind it, but I couldn’t stick with it. YMMW
  • Failed attempt with digital/analog/tangible combo is here.

Another example of the power of data visualization

I stumbled upon a great graph that tells a complex story compellingly.

Comparison of two COVID-19 waves in the UK, taken from here.

This graph compares the last two waves of COVID-19 in the United Kingdom and is shows so clearly that the new wave (that is supposedly composed of the Delta variant) is much more infections on the one hand, but on the other hand, causes much less damage. Is the more moderate damage the result of the Delta variant nature of the protective effect of the vaccination is still an open question, but the difference is still striking.

A new phase in my professional life

rbt

I’m excited to announce that I’m joining MyBiotics Pharma Ltd as the company’s Head of Data and Bioinformatics. I have been working with this fantastic company and its remarkable people as a freelancer for fourteen fruitful months. But today, I join the MyBiotics family as a full-time member. Together, we will strive to better understanding the interactions between humans and their microbiome to improve health and well-being.

rbt

Super useful videos for advanced data visualizers

The great Robert Kosara, also known as the “eager eyes” has started publishing a series of videos he calls Chart Appreciation. In these videos, Robert takes a piece of data visualization from a reputable and known source, and discusses why this particular piece is so good, what decisions were made that made it possible, what alternatives are, and more. If you consider yourself an intermediate or advanced practitioner of data visualization, you should subscribe. Here’s one example.

Career advise. Upgrading data science career

Photo by Kelly Lacy on Pexels.com

From time to time, people send me emails asking for career advice. Here’s one recent exchange.

Hi Boris,

I am currently trying to decide on a career move and would like to ask for your advice.

I have a MSc from a leading university in ML, without thesis.

I have 5 years of experience in data science at <XXX Multinational Company> , producing ML based pipelines for the products. I have experience with Big Data (Spark, …), ML, deploying models to production…

However, I feel that I missed doing real ML complicated stuff. Most of the work I did was to build pipelines, training simple models, do some basic feature engineering… and it worked good enough.

Well, this IS the real ML job for 91.4%* of data scientists. You were lucky to work in a company with access to data and has teams dedicated to keeping data flowing, neat, and organized. You worked in a company with good work ethics, surrounded by smart people, and, I guess, the computational power was never a big issue. Most of the data scientists that I know don’t have all these perks. Some have to work alone; others need to solve “dull” engineering problems, find ways to process data on suboptimal computers or fight with a completely unstandardized data collection process. In fact, I know a young data scientist who quit their first post-Uni job after less than six months because she couldn’t handle most of these problems.

However I don’t have any real research experience. I never published any paper, and feel like I always did easy stuff. Therefore, I lack confidence in the ML domain. I feel like what I’ve been doing is not complicated and I could be easily replaced.

This is a super valid concern. I am surprised how few people in our field think about it. On the one hand, most ML practitioners don’t publish papers because they are busy doing the job they are paid for. I am a big proponent of teaching as a means of professional growth. So, you can decide to teach a course in a local meetup, local college, in your workplace, or at a conference. Teaching is an excellent way to improve your communication skills, which are the best means for job security (see this post).

Since you work at XXXXX , I suggest talking to your manager and/or HR representative. I’m SURE that they will have some ideas for a research project that you can take full-time or part-time to help you grow and help your business unit. This brings me to your next question.

I feel like having a research experience/doing a PhD may be an essential part to stay relevant in the long term in the domain. Also, having an expertise in one of NLP/Computer Vision may be very valuable.

I agree. Being a Ph.D. and an Israeli (we have one of the largest Ph.D. percentages globally) makes me biased.

I got 2 offers:

– One with <YYY Multinational company> , to do research in NLP and Computer Vision. […] which is focused on doing research and publishing papers […]

– One with a very fast growing insurance startup, for a data scientist position, as a part of the founding team team. […] However, I feel it would be the continuation of my current position as a data scientist, and I would maybe miss on this research component in my career.

You can explore a third option: A Ph.D. while working at your current place of work. I know for a fact that this company allows some of their employees to pursue a Ph.D. while working. The research may or may not be connected to their day job.

I am very hesitant because

– I am not sure focusing on ML models in a research team would be a good use of my time as ML may be commoditised, and general DS may be more future-proof. Also I am concerned about my impact there.

– I am not sure that I would have such a great impact in the DS team of the startup, due to regulations in the pricing model [of that company], and the fact that business problems may be solved by outsourced tools.

These are hard questions to answer. First of all, one may see legal constraints as a “feature, not a bug,” as they force more creative thinking and novel approaches. Many business problems may indeed be solved by outsourcing, but this usually doesn’t happen in problems central to the company’s success since these problems are unique enough to not fit an off-the-shelf product. You also need to consider your personal preferences because it is hard to be good at something you hate doing.

From time to time, I give career advice. When the question or the answer is general enough, I publish them in a post like this. You may read all of these posts here.

Interview 27: Racial discrimination and fair machine learning

I invited Dr. Charles Earl for this episode of my podcast “Job Interview” to talk about racial discrimination at the workplace and fairness in machine learning.

Dr. Charles Earl is a data scientist in Automattic, my previous place of work. Charles holds a Ph.D. in computer science, M.A. in education, M.Sc in Electrical engineering, and B.Sc in mathematics. His career covered a position of assistant professor and a wide range of hands-on, managerial, and consulting roles in the field that we like to call today “data science.” 

But there is another aspect in Dr. Earl. His skin is brown. He was born to an African-American family in Atlanta, GA, in the 1960s when racial segregation was explicitly legal. I am sure that this fact affected Charles’ entire life, personal and professional.

Links

If you know Hebrew, follow my podcast Job Interview (Reayon Avoda), and This Week in the Middle East

One of the first dataviz blogs that I used to follow is now a book. Better Posters

I started following data visualization news and opinions quite a few years ago. One of the first bloggers who were active in this area NeurDojo, by the (now) professor Zen Faulkes. On of Zen’s spin-off blogs was devoted to better posters. This poster blog is called, surprisingly enough, Better Posters. Since I’m not in academia anymore, stopped caring about posters many years ago. Today, I stumbled upon this blog and was pleasantly surprised to discover that Better Posters is still active and that it is also now a book.

Working with the local filesystem and with S3 in the same code

Photo by Ekrulila on Pexels.com

As data people, we need to work with files: we use files to save and load data, models, configurations, images, and other things. When possible, I prefer working with local files because it’s fast and straightforward. However, sometimes, the production code needs to work with data stored on S3. What do we do? Until recently, you would have to rewrite multiple parts of the code. But not anymore. I created a sshalosh package that solves so many problems and spares a lot of code rewriting. Here’s how you work with it:

if work_with_s3:
    s3_config = {
      "s3": {
        "defaultBucket": "bucket",
        "accessKey": "ABCDEFGHIJKLMNOP",
        "accessSecret": "/accessSecretThatOnlyYouKnow"
      }
    }
    
else:
    s3_config = None
serializer = sshalosh.Serializer(s3_config)

# Done! From now on, you only need to deal with the business logic, not the house-keeping

# Load data & model
data = serializer.load_json('data.json')
model = serializer.load_pickle('model.pkl')

# Update
data = update_with_new_examples()
model.fit(data)

# Save updated objects
serializer.dump_json(data, 'data.json')
serializer.dump_pickle(model, 'model.pkl')

As simple as that.
The package provides the following functions.

  • path_exists
  • rm
  • rmtree
  • ls
  • load_pickle, dump_pickle
  • load_json, dump_json

There is also a multipurpose open function that can open a file in read, write or append mode, and returns a handler to it.

How to install? How to contribute?

The installation is very simple: pip install sshalosh-borisgorelik
and you’re done. The code lives on GitHub under http://github.com/bgbg/shalosh. You are welcome to contribute code, documentation, and bug reports.

The name is strange, isn’t it?

Well, naming is hard. In Hebrew, “shalosh” means “three”, so “sshalosh” means s3. Don’t overanalyze this. The GitHub repo doesn’t have the extra s. My bad

Book review. The Persuasion Slide by Richard Dooley

TL;DR Very shallow and uninformative. It could be an OK series of blog posts for complete novices, but not a book.

The Persuasion Slide by Richard Dooley was a disappointment for me. I love Dooley’s podcast Brainfluence, and I was sure that Richard’s book would full of in-depth knowledge and case studies. However, it contained neither. 

The only contribution of this book is the analogy between a sale process and an amusement part slide. The theory behind the book is mostly presented as a ground truth with almost no explanation or support from research. One will gain much more knowledge and understanding by reading Kahneman’s “Thinking, Fast and Slow,” Arieli’s “Predictably irrational.” or Weisman’s “59 seconds.”

Should I read this book?

No

Graphical comparison of changes in large populations with “volcano plots”

I recently rediscovered a volcano plot — a scatter plot that aims to visualize changes in large populations.

Volcano plots are very technical and specialized and, most probably, are not a good fit for explanatory data visualization. However, they can be useful during the exploration phase, and they come with a set of well-established metrics.

Moreover, if you are lucky enough to have well-behaved data, the plots look very cool

Visualization of RNA-Seq results with Volcano Plot
From here

Of course, in real life, the data is messy. Add bad visualization practices to the mess and you get a marvel like this one

From here

The bottom line: if you have two populations to compare, consider volcano plots. But do remember dataviz good practices.

Book review: Manager in shorts by Gal Zellermayer

TL;DR Nice’n’easy reading for novice managers

I read this book after hearing the author, Gal Zellermayer, in a podcast. Gal is an Israeli guy who has been working as a manager in several global companies’ Israeli offices. He brings a perspective that combines (what is perceived) the best practices of American managing style with the Israeli tendency to make things straight and simple. 

The greater part of the book is devoted to helping the people in your team develop. The book serves as a good motivator and helps to keep the importance of “peopleware.” I wish, however, it would bring more practical advice and cite more research and external analyses. 

Should you read this book?

If you are a beginning manager or want to be one – yes. 

If you never read a book on management – maybe (although Peopleware might be a better read).

The bottom line: 4/5

One idea per slide. It’s not that complicated


I wrote this post in 2009, I published it in March 2020, and am republishing it again


A lot of texts that talk about presentation design cite a very clear rule: each slide has to contain only one idea. Here’s a slide from a presentation deck that says just that.

And here’s the next slide in the same presentation

Can you count how many ideas there are on this slide? I see four of them.

Can we do better?

First of all, we need to remember that most of the time, the slides accompany the presenters and not replace them. This means that you don’t have to put everything you say as a slide. In our case, you can simply show the first slide and give more details orally. On the other hand, let’s face it, the presenters often use slides to remined themselves of what they want to say. 

So, if you need to expand your idea, split the sub-ideas into slides.

You can add some nice illustrations to connect the information and emotion. 

Making it more technical

“Yo!”, I can hear you saying, “Motivational slides are one thing, and technical presentation is a completely different thing! Also,” you continue, “We have things to do, we don’t have time searching the net for cute pics”. I hear you. So let me try improving a fairly technical slide, a slide that presents different types of machine learning.
Does slide like this look familiar to you?

First of all, the easiest solution is to split the ideas into individual slides.

It was simple, wasn’t it. The result is so much more digestible! Plus, the frequent changes of slides help your audience stay awake.

Here’s another, more graphical attempt

When I show the first slide in the deck above, I tell my audience that I am about to talk about different machine learning algorithms. Then, I switch to the next slide, talk about the first algorithm, then about the next one, and then mention the “others”. In this approach, each slide has only one idea. Notice also how the titles in these last slides are smaller than the contents. In these slides, they are used for navigation and are therefore less important.  In the last slide, I got a bit crazy and added so much information that everybody understands that this information isn’t meant to be read but rather serves as an illustration. This is a risky approach, I admit, but it’s worth testing.

To sum up

“One idea per slide” means one idea per slide. The simplest way to enforce this rule is to devote one slide per a sentence. Remember, adding slides is free, the audience attention is not.

Before and after — stacked bar charts

A fellow data analyst asked a question? What do we do when we need to draw a stacked bar chart that has too many colors? How do we select the colors so that they are nice but also are easily distinguishable? To answer this question, let’s look at the data similar to what appeared in the original question. I also tried to recreate the actual chart’s style

So, how do we select colors?
The answer to this question is pretty complicated. To have a set of easily distinguishable colors, one needs to model the color perception in a typical human being properly. Luckily, a tool called I Want Hue that’s based on a solid theory explained here. The problem, however, isn’t in colors.

This is not the right question

Distinguishing between eight colors in a graph is a challenging task. Selecting the right color scheme might help, but it won’t solve this fundamental problem. Moreover, stacked bar plots are tricky due to another complication.

We, the humans, are somewhat good are comparing positions but not as good at comparing sizes. This is why comparing the heights of the bars is relatively easy. It is easy because the bars start at the same line, and our task is to compare the bar end position, not the bar size. Reading the heights of the lowest segment in the bars is also an easy task for the same reason: we don’t compare the sizes but the heights.

However, comparing the sizes of the middle components is more challenging. As a result, the intermediate parts of a graph don’t add useful information but rather add noise. Thus, let us explain two options. First, we will reduce the number of groups. Next, we will explore what happens when reducing the number of groups is not an option.

Option 1. Reduce the number of categories

It is hard to advise about data visualization when I don’t know what conclusion the author wants to convey. However, I am sure that in many cases, the number of categories that are relevant to the viewer is much smaller than the number of types that are relevant to the analyst. The viewer might not care about all the hard job you did while collecting the data; what they are about is an insight. For example, if we reduce the discussion to two groups: the USA and non-USA data centers, the graph becomes much more readable.

Note how two groups in a stacked graph pose no problem in deciphering the sizes. If we take care of readability and improve the data-ink ratio, we get a nice data visualization piece.

Option 2. When reducing the number of categories is not an option

But what if reducing the number of categories is not an option? If you are absolutely sure that the audience absolutely needs to see all the information, you can split the different groups into separate subgraphs.

Have you noticed that the X-axis in our case represents time? In this case, we can replace the bars with an evolution plot and create a separate chart for each category in the data set. I took special care to keep the Y-axis scale equal between all the graphs so that the viewer can easily distinguish between data centers with a lot of errors and data centers with only a few of them. Here’s the result:

But what if the overall error rate is of greater importance than the individual groups. In that case, we can plot them in a larger graph and add the separate groups below, in smaller, un-emphasized subplots.

Summary — the Why and the What define the How

When you have a technical question about improving a graph, make sure you ask yourself “why.” Why is, does technical problems matter? Why will it improve the chart? To answer this question, you will have to ask another question: “what?”. “What is it that I want to say.” The easiest way to force yourself to ask these questions is to force yourself to add titles to every graph you create (see my how to suck less in data visualization post for more details).

Once you have your conclusion ready, you will notice that you don’t need a technical solution but rather a conceptual one. In this case, we solved the technical problem of looking for eight distinct colors by reducing the number of categories to two or splitting one elaborate graph into several straightforward ones.

So, remember, the Why and the What define the How

Python code that was used to generate all this graphs is available on (https://gist.github.com/bgbg/6c645a5fc48e61b1a917c9d1d66fa72f)

The Problem With Slope Charts (by Nick Desbarats)

Slope charts are often suggested as a valid alternative to clustered bar charts, especially for “before and after” cases.

So, instead of a clustered bar char like this

we tend to recommend a slope chart (or slope graph) like this

However, a slope chart isn’t free of problems either. In the past, I already wrote about a case of a meaningless slopegraph [here]. Today, I stumbled upon an interesting blog post (and a video) that surveys the problems of slope chars and their alternatives

All the graphs here come from the original post by Nick Desbarats that can be found [here].

Before and after: Alternatives to a radar chart (spider chart)

A radar chart (sometimes called “spider charts”) look cool but are, in fact,
pretty lame. So much so that when the data visualization author Stephen Few mentioned them in his book Show me the numbers, he did so in a chapter called “Silly graphs that are best forsaken.”

Here, I will demonstrate some of its problems, and will suggest an alternative

Before: The problems of a radar (spyder) plot

Above is my reconstruction of the original plot that I saw in a Facebook discussion. The graph looks pretty cool, I have to admit, but it is full of problems.
What are the problems of a spyder plot or a radar plot?
Let’s start with readability. Can you quickly tell the value of “Substance abuse” for the red series? Not that easy.

But a more significant problem emerges when one realizes that in most cases, the order of the categories is arbitrary and that different sorting options may result in entirely different visual pictures.

After: conclusion-based graph design

I have been continually preaching to add meaningful titles to all the graphs you are creating. (See How to suck less in data visualization and professional communication).

One of the byproducts of adding a title is the fact that when you write down your main takeaway of a graph, you force yourself to think, “does this graph show what it says it shows?” Thus, you guide yourself to better graph choices.

Let’s say that we conclude that there is no correlation between the two series of data. Is this conclusion evident from the graphs? I would say, not so much.

Instead of a radar chart, I suggest creating two aligned, horizontal graph plots. This way, we may sort one subplot according to the values, and then, correlation (or lack of thereof) will be evident.

But what if we noticed something interesting about the differences between A and B groups? If this is true, let’s show precisely this: the differences.

Notice how the bars in this version are sorted according to the difference. Sorting a bar chart is the easiest way to make it readable.

Python code that I used to create these graphs is available here https://gist.github.com/bgbg/db833db723998cd244b5049bfe01f5ac

Another language

بعد حوالي سنتين من الدراسة ، بحس حالي جاهز لإضافة اللغة العربية إلى قائمة اللغات في ال-LinkedIn 

After about two years of study, I feel ready to add Arabic to LinkedIn’s language list

Basic data visualization video course (in Hebrew)

I had the honor to record an introductory data visualization course for high school students as a part of the Israeli national distance learning project. The course is in Hebrew, and since it targets high schoolers, it does not require any prior knowledge.

I got paid for this job. However, when I divide the money that I received for this job by the time I spent on it, I get a ridiculously low rate. On the other hand, I enjoyed the process, and I view this as my humble donation to the public education system.
Since a government agency makes the course site, it’s UI is complete shit. For example, the site doesn’t support playlists, and the user is expected to search through the video clips by their titles. To fix that, I created a page that lists all the videos in the right order.

Text Visualization Browser

I’ve stumbled upon an exciting project — text visualization browser. It’s a web page that allows one to search for different text visualization techniques using keywords and publication time. 

Text visualization browser https://textvis.lnu.se

The ability to limit the search to various years gives a nice historical perspective on this interesting topic

This site’s information is based on a 2015 paper Text visualization techniques: Taxonomy, visual survey, and community insights. I wish the authors updated it with more recent data, though. 

Sharing the results of your Python code

Photo by veeterzy on Pexels.com

If you work, but nobody knows about your results or cares about them, have you done any work at all? 

A proverbial tree in the proverbial forest. Photo by veeterzy on Pexels.com

As a data scientist, the product of my work is usually an algorithm, an analysis, or a model. What is a good way to share these results with my clients? 

Since 99% of my time, I write in Python, I fell in love with a framework called Panel (http://panel.holoviz.org/). Panel allows you to create and serve basic interactive UI around data, an analysis, or a method. It plays well with API frameworks such as FastAPI or Flask.  The only problem is that to share this work. Sometimes, it is enough to run a local demo server, but if you want to share the work with someone who doesn’t sit next to you, you have to host it somewhere and to take care of access rights. For this purpose, I have a cheap cloud server ($5/month), which is more than enough for my personal needs.

If you can share the entire work publicly, some services can pick up your Jupyter notebooks from  Github and interactively serve them. I know of voila  and Binder)

Recently, Streamlit.io is entering this niche. It currently only allows sharing public repos, but promises to add a paid service for your private code. I’m eager to see that.

The information is beautiful. The graphs are shit!

I apologize for my harsh language, but recently I was exposed to a bunch of graphs on the “information is beautiful” site, and I was offended (well, ot really, but let’s pretend I was). I mean, I’m a liberal person, and I don’t care what graphs people do in their own time. Many people visit that site because they try to learn good visualization practices, but some charts on that site are wrong. Very wrong.

Here’s the gem:

I deliberately don’t share the link to this site. I don’t want let Google think it’s valuable in any way.

Now, the geniuses from “Information is beautiful” (let’s call them IB for brevety) wanted to share with us some positive stats. How nice of them. So what they did? They gathered together nine pairs of metrics collected at two different time points: one in the past and one furthermore in history. They used nice colors to create some sleeky shapes. So, what’s the problem? What’s wrong with that?

Everything is wrong!

Let’s start from my guess that they cherry-picked the stats with “positive” changes. Secondly, the comparison of this sort is mostly meaningless if we compare points at different years. What stopped the authors of that tasteless “infographic” from collecting data from the same years? I guess, their laziness. That’s how we ended up comparing the number of death penalties in 1990 and 2016, but the malaria deaths numbers are for 2000 and 2016, and dying mothers are compared for years 2000 and 2017?

Now, let’s talk about data viz.

Take a look at this graph.

The only time we use shapes like that is when we want to convey information about uncertainty. To do that, the X-axis represents the thing we are measuring, and the Y-axis represents our certainty about the current value. When we compare to uncertain measurements, we may judge the difference between these measurements by the distance between the curve peaks, and the width of the curve represents the uncertainty.

Here’s a good example from [this link]:

Can you see how the metric of interest is on the X-axis? The width of each bell curve represents the uncertainty and the difference between any pair of cases is the difference on the horizontal (X) axis, not the vertical one.

Instead, what do the IB authors did? They obviously like sleek looking shapes but know nothing about how to use them. They could have used two bars and let the viewer compare their heights. But nooooo! Bars are not c3wl! Bars are boring! Instead, they took probability density curves (that’s how they are technically called) and made them pretend to be bars.

Bars. Is this THAT hard?

I can hear some of you saying, “Stop being so purist! What’s wrong with comparing the heights of bell curves?” I’ll tell you what’s wrong! Data visualization is a language. As with any language, it has some rules and traditions. If you hear me saying, “me go home,” you will understand me without any problem. However, you will silently judge me for my poor use of the English language. I know that, and since English is my third language, I use all the help to make as few mistakes as possible. The same is correct with data visualization. Please respect its rules and traditions, even if (and especially if) are not fluent in it.

I never write more than two sentences in English without Grammarly

Visit the worst practice tag in this blog to see more bad examples

15-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between for a perios of several years.

Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.
(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

Career advice. Becoming a freelancer immediately after finishing a masters degree

Photo by Miguel u00c1. Padriu00f1u00e1n on Pexels.com

Will Cray [link] is a fresh M.Sc. in Computer Science and considers becoming a freelancer in the Machine Learning / Artificial Intelligence / Data Science field. Will asked for advice on the LocallyOptimistic.com community Slack channel. Here’s will question (all the names in this post are used with people’s permissions).

Read more career advices [here].

Let’s begin.

Will Cray 

I’m hoping to start a career as a freelancer in the AI space after finishing my Master’s in CS with a focus in AI. I don’t, however, have any industry experience in AI or data science. Do you all think it’s feasible to start a freelancing career without any industry experience? If so, do you have any tips on how to do it successfully?
[I worked for] two years at a major tech company, but I was a systems engineer. It was experience that isn’t necessarily relevant to what I want to work on as a freelancer.

Let’s divide the response to Will’s questions into two parts that correspond to Slack’s two discussion threads.

Thread #1 – Michael Kaminsky

This is a copy/paste from Slack.

Michael Kaminsky 

LocallyOptimistic.com — a valuable source for data folks

My hunch is that it’s going to be pretty tough to get started, though not impossible. You’re probably looking at a pretty lean year or two to build up a reputation out of the gate

Michael Kaminsky 

AI work in general is sort of difficult to contract out — so you might have more luck if you team up with a larger consulting outfit that can handle the other non-AI parts of the work

Michael Kaminsky 

very rarely is someone like “we have all of the data pipeline and pieces working, now we just need to hire someone to do the AI part” — in general, the model-fitting part of an AI project is the easiest and fastest

Will Cray 

Thank you so much for the info–it’s really helping me getting a better understanding of the landscape. Would your opinion, especially regarding that last message, change if the AI work I was doing was more custom model/agent design and training, rather than doing something quick like .fit() in sklearn?

Michael Kaminsky

ummm maybe? but like who needs custom model/agent design and training that doesn’t already have in-house data scientists working on it?

Michael Kaminsky

I don’t want to dissuade you, but my point is that you should think about who your customers are, and how you can market your services in such a way that it will provide them value. If you don’t have a clear map of the three concepts in italics, it could get rough — you can definitely figure it out by doing it, but that’s what you’ll be up against

Will Cray

You mentioned “larger consulting outfits” earlier–do you have any examples of organizations that you think could be a good fit?

Michael Kaminsky

so Brooklyn Data Company and 4 mile consulting are the two that jump to my mind — they specialize in BI and data but might want flex capacity into DS — they might be able to give you deal flow, etc. I know there are a number of others, maybe even folks in this channel

Thread #2 – Boris Gorelik

This is a copy/paste from Slack with some later edits and additions. 

Boris Gorelik 

Another thing to consider is what your risks are. If there are people who depend on you financially, starting with a freelance career might be too risky, especially if you don’t have 1-2 (better 2) customers who already committed to paying you for your services.

If you can afford several months without a steady income, or no income at all, being a freelancer might expose you to a larger variety of companies and business models in the market. I know some people who used to work as freelancers and gradually “adopted” one customer and moved to full employment. In these cases, freelance projects were, in fact, mutual trial periods where both sides decided whether there is a good fit.

Will Cray 

I greatly appreciate this insight. I have little risks. I’m single, my living expenses are low, and I have some financial runway. Part of the reason I like the idea of freelancing is for the reason you stated–I’ll get to see many different business models. As an aspiring entrepreneur, I think diversity of experiences and exposure would be useful to me. I also think being flexible in how many hours I work will allow me to allocate more time to developing my own ideas/projects; although, I understand that’s a luxury that comes with being an established freelancer. I don’t have any clients currently. Do you have any recommendations for channels to try and garner clients?

Boris Gorelik

> As an aspiring entrepreneur, I think ….

Even though a freelancer and an entrepreneur’s legal status may be the same, they are different occupations and careers. An entrepreneur creates and realizes business models; a freelancer sells their time and expertise to fulfill someone else’s ideas. That’s true that most of the time (not always), combining freelance with entrepreneurship is easier than combining entrepreneurship with being a full-time employee in a traditional company.

 > Do you have any recommendations for channels to try and garner clients?

Nothing except the regular facebook/linkedin/ but mostly friends and former coworkers and, in your case, teachers/lecturers. I got my first job interview via my Ph.D. advisor. Later, when I helped in hiring processes, I asked him and other professors to refer me to proper candidates. So yeah, make sure your professors know your status.

Exploring alternatives to population pyramids

A population pyramid also called an “age-gender-pyramid”, is a graphical illustration that shows the distribution of various age groups in a population (typically that of a country or region of the world), which forms the shape of a pyramid when the population is growing [citation from Wikipedia].

In some cases, the pyramid provides interesting insights into the entire population. In this post, I will explore ways to make some of these insights more visible. 

The basic case

Let’s start with the basic case. If you have two-three hours of spare time, you can go to the site devoted to population pyramids — https://www.populationpyramid.net. There, you will find population pyramids for every country in the world. The site provides present and past data, as well as future forecasts. To understand how insightful age pyramids can be, look at the graph that represents the entire world.

(this and most other images in this post are from the site http://populationpyramid.net/)

You can clearly see that the world is mostly young, that the amount of people declines as the age progresses, and that there is a rough balance between men and women in the world, at least before the ages of 70+.

Now, examine the stark difference between the populations of Western Africa and Western Europe. Citing the late professor Hans Rosling, we can still see two worlds, one with large families and short lives, and one with small families and long lives. 

Another starking example of an age pyramid is the following

Do you want to guess what country is that? This particular graph shows the age distribution of the United Arab Emirates. Such a vast distortion in symmetry and age distribution stems from the fact that more than 80% of the UAE’s population is composed of expats who come to this rich country to work. The pyramid below (taken from [this article]) sheds some light to the population composition of UAE. (Note that the genders in this graph are reversed).

Whose bar is longer?

The male-female disbalance in the UAE and some other Gulf countries is very striking and cannot be missed. But what about other, more subtle cases? Take a look at the world graph above. If you follow the numbers on the bars, you will notice that more boys are born than girls, but there are more old ladies than old gents in the world. Can we make such differences less subtle?

To answer this question, we need to understand why we find it hard to compare almost equal bars. The reason for that is that our eyes (or brains) are not so good at comparing sizes. They do, however, do a much better job comparing positions. Thus, if we overlap these bars, we will see the small differences in a much more precise manner. 

(I thank the data visualization expert Bella Graf from InfoServiz.co.il for the idea of this graph).

Now, the subtle differences in gender composition are more visible. 

What am I looking at?

When I teach data visualization, I always tell my students to add a meaningful title to the graph. By “meaningful,” I mean a title that does not answer the question “what” but rather “so what”? (See my posts “How to suck less in data visualization,” and “C for conclusion“). What would a good title for this graph be? Let’s try the following

OK, so now, when we have a title, we can ask ourselves, “does the graph show what it says it shows”? And the answer is no. Right now, the title talks about differences, but we don’t see the differences. We see the differences and other stuff. Let’s look only at the differences.

I don’t like this.

What about this?

Now, this is not an age pyramid. That’s for sure. This graph doesn’t show the wealth of data that the classical pyramid shows. On the other hand, it does offer one thing, and it does it very well. Look, for example, at the male/female distortion in China in 1990.

You may find the code I used to create the graphs in this post [on GitHub].

The Mysterious Status of .blog Domains

Photo by Bruno Bueno on Pexels.com

When the .blog TLD was started by Automattic, employees were given the option to reserve a domain for free. In return […], they asked that the domain be used as a primary domain (no forwarding to a different site), and that the site be updated with new content at least once a month. This requirement was the last argument for me NOT taking boris.blog — I didn’t want to make this commitment, plus I like gorelik.net a lot.

Recently, there were some not so nice developments about .blog names that were given away to Automatticians. The complains about this situation are usually anonymously, but I think that in this case, anonymity isn’t the right approach. That is why, I decided to share here an anonymous post from the Antimattic blog. Although I am not the author of this original post, and I don’t share the views of some of the posts written there, I do share the concerns expressed in this particular article. Posting in return for a domain name might have been a reasonable request at the beginning of the .blog TLD to help promoting its adaptation. But now, several years after this TLD is active, this requirement is simply not OK. To read the original post, click the screenshot below.

The first paragraph of this post is a verbatim copy from Antimattic.

ASCII histograms are quick, easy to use and to implement

From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Recently, I updated the python function that I use to create ASCII histograms. The updated function [link] uses more modern formatting and includes several signal-to-noise improvements. One can also use it with custom output functions, such as logging.info.

Book review: The Abyss: Bridging the Divide between Israel and the Arab World

TL;DR If you are an Israeli and don’t feel like learning the behind the scenes stories, skip it. Otherwise, I do recommend reading this book. I enjoyed it a lot 4.5/5

The Abyss: Bridging the Divide between Israel and the Arab World went to print slightly after the outbreak of the “Arab Spring.” The author, Eli Avidar, is a former Israeli intelligence officer and diplomat. Among other things, Eli Avidar served as the head of the Israeli diplomatic mission to Qatar in 1999. Today, Eli Avidar is a Knesset member for the right-wing Yisrael Beiteinu party. Even though so many things have changed since the book was published, I didn’t find any claim that Eli Avidar made, and that turned out to be wrong, nine years after the publication. 

I enjoyed reading this book a lot despite the fact that most of Eli Avidar’s claims are not new to me. Most of them are widely known to all the Israelis, and the real question is not whether you are aware of these claims, but whether you agree with them and what conclusions you make out of them.

On the other hand, The Abyss is an interesting storybook full of behind the scenes anecdotes and gossip. All who know me know how much I like gossips. It also provides a great introspection of how the (Jewish-)Israeli society sees the Arab-Israeli conflict, and what it feels towards it.

Should you read the book? If you are an Israeli and don’t feel like learning the behind the scenes stories, you may skip it. Otherwise, I do recommend reading this book. I don’t know how accurate is Avidar’s description of the Arab world, but his analysis of the Israeli behavior and attitude is very accurate. If you ever cough yourself wondering “What the fuck do the Israelis think?”, this book might shed some light for you. That is why I write this review in English, despite my tendency to review Hebrew books in my Hebrew blog.

Fun fact. I finished reading this book on August the 13th. I closed the book, opened Twitter, and saw my feed FULL with news about the upcoming normalization treaty between Israel and UAE. 

What is the biggest problem of the Jet and Rainbow color maps, and why is it not as evil as I thought?

There was a consensus among the data visualization purists that the rainbow color map, and it’s close cousin Jet are bad. Really bad. These colormaps used to be popular at the beginning of the computational data visualization era. However, their popularity decreased in the last five years or so. The sentiment isn’t as bad as it used to be a couple of years ago, but still.

A screenshot from circa 2016. Today we are less fanatic than that

What is the biggest problem of the rainbow colormap? The most apparent problem with this particular colormap is that it not perceptually uniform. By “perceptually uniform,” I mean that equal changes in the value that we encode using a colormap should correspond to same changes in the color perception. This is not the case with the rainbow or the Jet colormaps. They have distinct bright and dark stripes within the number range, making them the wrong choice to encode numerical data. The situation is even worse for people with impaired color vision.

Can you be less perceptually uniform?

The solution to this problem was proposed in the form of better colormaps. The first one that I know of is Parula by Matlab, and it’s opensource alternative Viridis that is available in matplotlib and many other plotting libraries. (Watch this video about viridis to get a good introduction to color perception and color maps).

Viridis, the new rainbow

Everything was nice and good, and I was trashing the rainbow colormap whenever I could. Until yesterday, when I read about Turbo, the improved rainbow colormap developed by Google.

In the long and interesting blog post that describes Turbo, Anton Mikhailov, a software engineer in Google, describes several relevant applications of a “good rainbow” scheme. 

According to Anton, “Because of rapid color and lightness changes, Jet accentuates detail in the background that is less apparent with Viridis and even Inferno. Depending on the data, some detail may be lost entirely to the naked eye. The background in the following images is barely distinguishable with Inferno (which is already punchier than Viridis), but clear with Turbo.”

I must admit that I’m convinced. 

The biggest problem with that is mentioned concerning the original rainbow scheme that its brightness varies too much. However, it turns out that the color saturation and hue attract our attention more than the lightness (here’s the reference which I haven’t read yet). As such, it makes sense to construct a colormap that relies more on color and hue changes. 

Moreover, in many cases, the interesting details appear in the extreme values of the data range, not in the middle. In thes cases, a properly applied rainbow-like color scheme becomes a valid choice.

The bottom line is that one should not refrain from using rainbow(-like) color maps in their visualizations anymore, provided that they use a modern implementation. Luckily, it’s even available in matplotlib

If you don’t teach yet, start! It will make you a better professional.

Many people know me as a data scientist. However, I also teach, which is sort of unnoticed to many of my friends and colleagues. I created a page dedicated to my teaching activity. Talk to me if you want to organize a course or a workshop.

I also highly recommend teaching as way of learning. So, if you don’t teach yet, start! It will make you a better professional.

How to suck less in data visualization and professional communication

In technical communication, the main thing is to keep the main thing the main thing. There are multiple ways to ensure this principle. Some of these ways require careful chart fine-tuning. However, there is one tool that is easy to master, fast to apply, and that provides a high return on the investment rate. I refer to chart titles. In this talk, I had two main theses. My first thesis is that most of you suck in communication (and not only data visualization).

My second thesis is that you can quickly improve your graphs by merely adding a good title. The importance of good titles is not new to my preaching, but I thought it was an excellent thing to formalize this thesis a bit, and I’m thankful to the NDR organizers for giving me this opportunity.

Following is the slide stack from my NDR presentation.

Unexpected hitch of working in a distributed team

Photo by Porapak Apichodilok on Pexels.com

It has been about half a year after I became a freelance data scientist. Before my career change, I worked in a distributed team for more than five years. Today, I suddenly realized that working in a distributed team has a significant problem, inherent to its distributed, multinational, nature.

My team was always spread over multiple time zones. Sometimes, the time zone span was so broad, that we could never find a time slot where all the team members were ordinarily awake. Automattic, the company I used to work for, is a firm believer in asynchronous communication, but from time to time, you HAVE to meet over a Zoom/Slack/Whatever call. Since I wasn’t a manager, the number of live calls that I had to attend was kept to a minimum, and yet, I found myself at least twice a week in a 10 pm Zoom call. I don’t know what about you, but my brain keeps working for at least two outs after log off. Thus, twice a week, I would find myself going to bed after one o’clock at night. As a result, I was sleep deprived for the majority of the week.

Only now have I noticed the fact that my sleep has improved so much after the career change. I know that people who work in “colocated” teams also find themselves in late night phone calls, but working in a distributed group means that you’ll do it regularly.

Hybrid digital/analog tangible week planning

Here’s a neat method that helps me organize my week, increase my productivity and fight procrastination. 

Being a freelancer data scientist, I’m involved in three hands-on projects for two clients. I also manage/mentor two data scientists in two other projects, and participate in strategic discussions for a customer of mine, and in a startup in which I invest. Oh, I am also in the final stages of writing a paper. I never imagined I would be in the situation with so many balls that I need to keep in the air. How do I manage to keep sanity? 

This is what I do. Following the advice in “15 Secrets Successful People Know About Time Management“, I try to keep as many items in my calendar as possible. When my workweek starts, I print out the weekly schedule on a sheet of paper. Then, I apply the tangible GTD hack that I learned from another book [link] and write out all my projects on a bunch of small post-it notes. These notes allow me “dumping” all my brain contents into an external medium, which frees up my brain to spend more CPU cycles on processing, rather than remembering and worrying. 

Next comes the fun part, I get to play with my cards by arranging them on the weekly schedule. The geometry of the post-it notes and the sheet of paper ensures that I allocate reasonably larget chunks of time for each “big thing.” It also reminds me that the amount of time each day is limited, and I can’t stick too many plans into a day or a week. (No, I won’t be able to finalize the paper, complete the analysis for a retail shop, learn a chapter in Bayesian statistics book, before the end of today).

After I’m done, I copy each post-it note into my calendar. Thanks to the integration with Todoist (an excellent productivity tool), all these tasks end up in my todo list, where I can further work with them.

To sum up:

  • Global week overview – check
  • Prioritization and honesty – check.
  • Fun playing with sticky notes – check.
  • Work gets done – (I wish!).

Oh, did you notice the appointments between 5 and 6 am? This is my sports activity. Sometimes working out charges me for the entire day. Sometimes, all I want to do for the entire day is to have a nap 🙂

Before and after. Even excellent graphs can be improved

Being a data visualization consultant, I can’t help looking for dataviz problems in graphs that I see. Even if the graph is good. Even if I know that I would not be able to create a graph that good. Even if the overall graph is excellent, and the problems are minor, or maybe especially when the graph is excellent, and the problems are minor.

This is a nice graph published by Nevo Benita on Linkedin.

The graph presents the gap between the men and the women in the Israeli job market. As I said, the graph is excellent. However, there are several small problems that, like grains of sand in a chocolate mousse, stand in the way. Let’s take a look at them.

The time-series line in the upper right part of the graph shows good use of the real estate. The problem is that the X-axis ticks (the years) look as if they belong to the chart below. It takes some time to realize that the numbers are years of the upper graph, and not the X-axis of the graph below. Moving the numbers upwards by several pixels would have fixed that.

Now, it is more clear that “1990” and “2018” relate to the time-series graph above.
Before (left) and after (right).

Let’s talk about the left-side bar chart. It took me a while to understand what it is. As a matter of fact, I managed to write a critique paragraph about that bar chart, how it is unclear what the percentages are, and how they were computed. Only then had I noticed the explanation below. Such confusion isn’t the viewer’s fault. Since we usually scan images from top to bottom, moving the title to the top of the chart will reduce this confusion. The word “percent” is also redundant in that title since it comes after the percent sign.

Moving the explanation to the top makes it easier to notice. Before (left) and after (right)

The last point that is worth optimizing is the color order. Consistent element order in an image makes navigation and comprehension much easier. When the order is preserved, our brain can use mental shortcuts without losing much information. When these shortcuts are broken, the brain has to work harder. What am I talking about? The graph author made the correct decision to use different font colors in the graph title to specify which color stands for which gender. This way, we don’t need a separate legend, and this is good. The title is an ordered sequence of words. The visualizer could use this order to create the order heuristic that is so helpful. Such a heuristic isn’t always possible. Fortunately for the visualizer (and sadly for the society), the salary gap in all the occupations in this graph have the same direction: men earn more than women. As a result, the rightmost part has all the green dots on the right, and the purple dots are on the left. This direction is opposite to the gender direction in the title and the color direction in the bar chart. To fix this situation, I made sure that the color that stands for the women (purple) is always to the left of the color that designates the men (green).

Keeping the color order. Before (left) and after (right)

So, this is the final result. I hope you can see why I like it better.

That’s how I took and excellent graph and made it even more awesome.

Data visualization is not only dots, bars, and pies

Look at this wonderful piece of data visualization (taken from here). If you know the terms “tertiary structure” and “glycan”, there is NO way you miss the message that the author of this figure wanted to convey.

Also, note how using appropriate colors in the title, the authors got rid of graph legend.

How to become a Python professional in 42 hours?

Here’s an appealing ad that I saw

This image has an empty alt attribute; its file name is image-2.png

How to become a Python professional in 42 hours? I’ll tell you how. There is no way. I don’t know any field of knowledge in which one can become professional after 42 hours. Certainly not Python. Not even after 42 days. Maybe after 42 weeks if that’s mostly what you do and you already a programmer.

Book review. Five Stars by Carmine Gallo

TL;DR Good motivation to improve communication. Inadequate source of information on how to achieve that 

The central premise of Five Stars Communication Secrets to Get from Good to Great by Carmine Gallo is that professionals who don’t invest in communication skills are at high risk of being replaced by computers and robots. One of the book’s sections bares the title that summarises this premise very well “Storytelling isn’t a soft skill; it’s the equivalent of hard cash.” I firmly believe in these premises. That is why I invest so much time in learning and teaching data visualization, in public speaking, and blogging. 

When I started reading this book, I got excited. I kept marking one passage after another. Gallo packed the first part of the book with numerous citations and explanations on how a lack of communication skills is the most severe risk factor in the career of a modern professional, team, or company. One example leads to another one, and one smart conclusion followed another one. 

Then, I started noticing that the book tries to convince me more and more, but I didn’t need that convincing in the first place. More than half of the book is evangelism. The author tells you how essential communication skills are, then he gives you some examples of people who did it right, and then again talks on importance. Again, and again, and again. Where are all those “secrets to get from good to great”???

When, finally, we get to the practical parts, the reader is left mostly with shallow, almost trivial bits of advice. 

Some of the most important points I took from this book

Slight feeling of a hamster-wheel while reading this book

Adopt the three-act storytelling approach to presentations. The three-act storytelling approach worked for Homerus, Shakespear, Tarantino, and there is no reason it should fail you in your technical presentations. Fair enough. On the other hand, this 2012 article by Nancy Duarte, provides more depth and more actionable information on this approach (follow Duarte’s blog if presentation skills are something you are interested in). 

“In the first two to three minutes of a presentation, I want people to lean forward in their chairs.” I like this citation by Avinash Kaushik, Google’s digital marketing evangelist. I will undoubtedly try this approach in my next presentations.

Should you read this book?

If you read these lines, your job depends on your communication and presentation skills. If you believe this premise, you can skip the first 60% of the book. If you want to improve your communication skills, I suggest reading Jean-luc Doumont’s “Trees, Maps, and Theorems,” which is much shorter, but also much denser in methods and practical advice. 

The bottom line

3.5/5

The delicate art of fine trolling

Photo by Pixabay on Pexels.com

I’m reading the a 1991 paper by Barbara Tversky that deals with the directional representation of time. One sentence in the paper interview says

“There does not seem to be strong universal cognitive associations of quantity or quality to left or right”

Whenever I make a similar statement in the context of data visualization, I frequently get a self-assured response “of course there is – smaller numbers appear on the left!”. To answer this remark, Barbara Tversky added a small footnote that says

“Anyone in doubt should consult politicians on both the left and the right.”

Photo by Pixabay on Pexels.com

So gentle, yet so powerful.

Lie factor in ad graphs

It’s fun to look at the visit statistics and to discover old stories. I wrote this post in 2016. For a reason I don’t know, this post has been one of the most viewed posts in my blogs during the last week. 

So, I decided to publish it again. I won’t add any new examples, but if you want to see more stuff, type [lying with data visualization] in your favorite search engine

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

StellarGraph — another promising network analysis library for Python and Scala

Network (graph) analysis is a complicated topic. There are several tools available for this task with different pros and cons. Recently, I stumbled upon another tool StellarGraph. StellarGraph authors claim to provide excellent performance; NumPy, Pandas, TensorFlow integration, an impressive set of algorithms, inter compatibility with Neo4j (THE graph database); and much more. The documentation looks very clear and extensive too.

I didn’t use it yet, but I certainly plan to.

https://www.stellargraph.io

The hazard of being a wizard. On balance between specialization and the risk to become obsolete.

A wizard is a person who continually improves his or her professional skill in a particular and defined field. I learned about this definition of wizardness from the book “Managing project, people and yourself” by Nikolay Toverosky (the book is in Russian).  

Recently, Nikolay published an interesting post about the hazards of becoming a wizard. The gist of the idea is that while you are polishing your single skill to perfection, the world changes. You may find your super-skill irrelevant anymore (see my Soviet Shoemaker story).

Nikolay doesn’t give any suggestions. Neither do I. 

Below is the link to the original post. The post is in Russian, and you can use Google Translate to read it.

Страница о магах У меня в книге есть глава про полководцев и магов. В её конце я подвожу итог: Несмотря на свою кру­тость, маг уяз­вим. Он поле­зен, только если его навык под­хо­дит к задаче. 658 more words

Почему опасно быть магом — Об управлении проектами и дизайне

Bioinformatics career advice and a story about a Soviet shoemaker

When I was in elementary school (back in the USSR of the mid 80’s), I had a friend whose father was a shoemaker. Due to the crazy stupid way the Soviet economy worked, a Soviet shoemaker was much richer than a physician or an engineer. But this is not the story. The story is that one day this friend’s father had a chat with me about selecting a profession. This man’s point was that for as long as people have feet and need shoes on their feet, a shoemaker would be required and well-earning occupation. Guess what? People still have feet, and still, ware shoes, but I don’t see too many successful shoemakers anymore. 

Common wisdom says, “It is very hard to predict, especially the future.” And I will add “even more especially, about the job market.”. Nevertheless, people need to decide what to do with their lives, how to live, and what career paths to pursue. Some of them ask me, and I’m glad to answer. If you have any career-related questions, don’t be shy! Write to boris@gorelik.net, and I’ll see what wisdom I will be able to share with you.

Anyhow, this is a letter that I got from another pharmacist looking for a data science career.

Hope you are doing well. I saw your posts on Quora and thought of asking a doubt.
First let me tell my background. I am from India, I completed my Doctor of Pharmacy program (Pharm D). I am familiar with computer programming. I have intermediate knowledge in python and R programming.  So I thought taking up Bioinformatics and computational biology Masters program so that I can connect Pharma industry and my knowledge in computer science. 
What do you think? 
I have applied to University XYZ and got offer letter. I have to take a decision within 2 weeks.
Please let me know your thoughts on this.

To which I replied

Obviously, since the path you are describing similar to the one I took, I will think that it is a good idea. Moreover, as you might have read in my blog (for example, here), my opinion is that advanced degrees give much more stable foundations, compared to the “fast and easy” courses. Having said that, your life is yours, not mine, and the job market today is not the job market in 2001 when I graduated my B.Pharm.  

Thank you so much for replying to my silly question. I am honoured to get a response from you. 

First of all, I don’t believe in “there are no silly questions” bullshit, but asking a silly question is better than not asking at all. Secondly, these questions are not silly at all.

I have a question, in your post dated 2017, you have mentioned that Bioinformatics was booming in 2001 and now it has lost its significance. Are you still have the same thoughts? 

I think that this person refers to the most visited post of mine “Don’t study data science as a career move; you’ll waste your time!”.  There is also a 2019 follow-up.

If that is the case then me taking a master’s in bioinformatics and computational genomics would be a bad idea, right ?

Here’s what I responded. Keep in mind that I wrote this before the COVID-19 outbreak.

Look, the markets in different countries are different. 

Back in the old days, there was a worldwide wave of closing bioinfo companies. All the Israeli ones were either closing or counting weeks before closing. One anecdote: I was interviewing at a company. Two weeks later, I called the person who interviewed me to ask whether I got the job or not, and the secretary told me that that person was fired due to layoffs. 

Right now, Israel sees a renaissance of bioinformatics companies, but I don’t know what will happen in the future. These companies live mostly out of investors’ money and are subject to strict regulations. However, if you get a good education, your head will be full of useful mental models, relevant basic knowledge, and good practices. 

End of quote. One of The COVID-19 madness side effects is the massive influx of money into biotech companies. Is this a short-term anecdote, or will it become a sustainable trend? I have no idea.

Do you have any career-related questions to me? You don’t have to be a pharmacist to ask :-). Write to boris@gorelik.net. I promise to respond, even if by sending a link to my blog posts. 

The difference between statistically meaningful and practically meaningful. An interview with me

Recently, I gave an interview to the Techie Leadership site. Andrei Crudu, the interviewer, made a helpful outline of the conversation. I marked the most important parts in bold.

  • Academic views on leadership;
  • Managing people isn’t for everyone;
  • Lessons from a practical approach;
  • Data Science is predominantly about data cleaning;
  • The difference between statistically meaningful and practically meaningful;
  • How sometimes companies tweak results to match expectations;
  • Bad managers make you appreciate the good managers;
  • Giving credit, being decent and not cheating;
  • All good teamwork starts with effective communication;
  • You don’t know that the stuff that you know is unknown to others;

Overall, I enjoyed chatting with Andrei, and I hope you’ll enjoy listening to the interview. If you have any comments, feel free sharing them here or on the Techie Leadership size

Is Distributed Work a Divide and Conquer Strategy?

Photo by Markus Spiske on Pexels.com

Before becoming a freelance data scientist, I used to work at Automattic, which I used to regard as my dream job. Not every current and ex-Automattician share that rosy point of view. Antimattic is an anonymous blog that allows ex-Automattic employees to vent their feelings about what used to be their workplace. One recent post on that blog raises a fascinating question about distributed (or work from home, or remote) companies. “Is Distributed Work a Divide and Conquer Strategy?” I have to admit that I haven’t thought about this perspective before. It looks like we will see more and more companies switching to remote work. It’s an interesting interpretation of the “future of work.”

Obviously this site exists because people have had negative experiences at Automattic. But many people have also had very positive experiences at the company. Could it be that the distributed nature of Automattic allows for such varying experiences? 45 more words

Is Distributed Work a Divide and Conquer Strategy? — Antimattic

Logarithmic scale misinforms. Period

Being a data scientist and a self-proclaimed data visualization expert, I like using log scale graphs when I find them appropriate. However, as a speaker and a communicator, I refrain from using them in presentations as much as possible. From my experience as a data visualization lecturer, I noticed that even “technical” struggle grasping the concept of log scale graphs. 

One of the Coronavirus side effects was the introduction of the term “exponential growth” to every living room. Naturally (to some of us), exponential growth is best presented using a semi-log graph, where the X-axis represents the time (linear), and the Y-axis represents the degree of magnitude of a value (log scale). 

A recent study (link) tested and demonstrated how bad log-scale is. The research title is “The Logarithmic Scale Misinforms the Public and Affects Policy Preferences.” From my experience, log scale graphs misinform everybody. Except for experienced data scientists. Nothing can confuse or misinform us, obviously 😉

It is a bummer though that data visualization in that paper sucks so much.

Don’t publish graphs like this. Especially not in data visualization papers.

Thanks to Bella Graph who pointed me to the original study.

Book review: The Year Without Pants. WordPress.com and the future of work by Scott Berkun

TL;DR Interesting “history of work” book (definitely not “future of work”) with insights on transition-state organizations. Read it if history of work is your thing, or if you work in a small company that grows rapidly. 4.5/5 (due to the personal connection)



I got The Year Without Pants in 2014 as an onboarding present when I joined Automattic. The author, Scott Berkun, used to work as a manager at Microsoft (and maybe more places) before he quit and became a career of an adviser and an author. In 2011, the Automattic founder brought Scott to work at the company. About seventy people were working in the company back then and the company was growing rapidly. Automattic has just introduced a concept of teams, and the idea was that Scott will work as a team leader, consulting the management on how to deal with the transition.

Being an ex-Microsoft manager, Scott was fascinated by the small distributed company, and wrote a book on it, proclaiming that the way Automattic worked was “the future of work”.

The book was published in 2012. Today, in post-COVID 2020, nobody is surprised by people who don’t need to go to the office every day. Automattic has now more than 1,000 employees and has adopted many of the rituals big companies have, such as endless meetings, tedious coordination, name tags, and corporate speak.

Why, then, did I enjoy the book? First, for me, it was a pleasant “time travel.” I enjoyed reading about people I knew, teams I worked with, and practices I used to love or hate. Secondly, this book provides insights on a transition from a small group of like-thinkers to a formalized organization.

“Why it burns when you P” and other statistics rants

Do you sometimes Google for something only to find stuff written by yourself?
I teach a course called “data-based decision making.” While googling for examples of statistics misuse, I stumbled upon an interesting blog post that I wrote about one and a half years ago.

The post is so good; I decided to post it again.

——————————

“Sunday grumpiness” is an SFW translation of Hebrew phrase that describes the most common state of mind people experience on their first work weekday. My grumpiness causes procrastination. Today, I tried to steer this procrastination to something more productive, so I searched for some statistics-related terms and stumbled upon a couple of interesting links in which people bitch about p-values.

Why it burns when you P” is a five-years-old rant about P values. It’s funny, informative and easy to read

Everything Wrong With P-Values Under One Roof” is a recent rant about p-values written in a form of a scientific paper. William M. Briggs, the author of this paper, ends it with an encouraging statement: “No, confidence intervals are not better. That for another day.”

Everything wrong with statistics (and how to fix it)” is a one-hour video lecture by Dr. Kristin Lennox who talks about the same problems. I saw this video, and two more talks by Dr. Lennox on a flight I highly recommend all her videos on YouTube.

Do You Hate Statistics as Much as Everyone Else?” — A Natan Yau’s (from flowingdata.com) attempt to get thoughtful comments from his knowledgeable readers.

This list will not be complete without the classics:

Why Most Published Research Findings Are False“, “Mindless Statistics“, and “Cargo Cult Science“. If you haven’t read these three pieces of wisdom, you absolutely should, they will change the way you look at numbers and research.

*The literal meaning of שביזות יום א is Sunday dick-brokenness.

Visualising Odds Ratio — Henry Lau

Besides being a freelancer data scientist and visualization expert, I teach. One of the toughest concepts to teach and to visualize is odds ratio. Today, I stumbled upon a very interesting post that deals exactly with that

On Thursday 7 May, the ONS published analysis comparing deaths involving COVID-19 by ethnicity. There’s an excellent summary on twitter but the headline is that when taking into account age and other socio-demographic factors, such as deprivation, household composition, education, health and disability, there is higher risk for some ethnic groups of a COVID related…

Visualising Odds Ratio — Henry Lau

Bad advice from a reputable source is bad advice.

Would you buy a grammar book with a clear spelling mistake on its cover? I hope not. That’s what happened to IBM when it published it’s new data visualization guide. I didn’t bother reading the manual because of what IBM decided to use as the first image of their guide.

We use graphs to transfer information into images that are supposed to be later transformed in our brains to information. What visual attributes do we use to interpret the information behind a pie chart? It is the segment angle, its area, or maybe the arc length? Most probably, the answer is “all of the above” (see Robert Kosara’s works for more info). When done right, the three attributes of pie segments are linearly connected one to another, which allows synergism between the visual clues.
But what did our friends at IBM do? The deliberately distorted the data! I took the screenshot from the guide homepage and made some measurements.
The purple segment has the angle of 182 degrees, and the angle of the black segment is 75 degrees, which gives us the ratio of 2.42. However, while the radius of the purple segment is 135 pixels, the radius of the black one is only 110 pixels. Why is this a problem? Well, due to the radius differences, the ratio between the arc lengths is 2.91, and the ratio between the areas is 3.66. So now, let me ask you: what is the ratio between the numbers represented by the purple and the black segments?
It is correct that the colors that IBM people used in their guide are neat, but data visualization that distorts information is not visualization but a piece of garbage. I assume that IBM produces decent computers, but don’t learn data visualization from them

Finally We May Have a Path to the Fundamental Theory of Physics… and It’s Beautiful — Stephen Wolfram Blog

OK, so Stephen Wolfram (a mega celebrity in the computational intelligence world and, among other things a physicist) claims that he may have found a path to the Fundamental Theory of Physics. The blog post is long, and I hope to be able to finish reading it in a week or two. The accompanying technical text is a 450-page tome available on a dedicated site.

Also, it turns out that Stephen Wolfram has a Twitch.tv channel in which he talks about science.

Website: Wolfram Physics Project Technical Intro: A Class of Models with the Potential to Represent Fundamental Physics How We Got Here: The Backstory of the Wolfram Physics Project… 26,455 more words

Finally We May Have a Path to the Fundamental Theory of Physics… and It’s Beautiful — Stephen Wolfram Blog

The quintessence of data visualization usefulness

I have to admit, I was skeptical at the beginning of the COVID-19 crisis. I started becoming skeptical now when it seems that the crisis didn’t hit my country too hard. But then I saw the graphs in this Financial Times article, and the skepticism disapeared. The graphs are accompanied by hundreds of words, but there is no need for reading the text to understand almost everything.

These graphs are so good, so convincing, so well performed, they don’t leave any place for doubt or misunderstanding of the message the author wants to convey.

If you study data visualization, look at these graphs. Look at the color choice, legend location, and design. Look at the ticks on the X- and Y-axes, how they are spaced and typeset. Note the amount of details on the axes, specifically how sparse these details are.

Book review: Never Split the Difference by Chris Voss

TL;DR: Dull on the surface but has a lot of good points

Never_Split_3D_Jacket_copy.png

I read Never Split the Difference following a friend’s recommendation. While reading the book, I kept feeling a constant sense of disappointment and mental eye-rolling. The author, Chris Voss, is a former FBI negotiator. The book is full of FBI war stories and pieces of advice that, on the top of it, sound either trivial or well known. HOWEVER, when the book was over, I sat summarizing my Kindle notes. Forty-five minutes later, I found myself staring at six pages of handwritten text of notes and takeaways. Which, surely, is a good sign.

What I didn’t like: too many “war stories” from the author’s past as an FBI negotiator; their connection to the business world sometimes seems too far-fetched.

What I liked: I liked the overall approach. Sometimes, the author cites academic research. Again, the fact that I took so many notes, is very impressive (to me).

The bottom line: 4/5 Read it, even if you already read a negotiation book.

Why is forecasting s-curves hard?

Constance Crozier (@clcrozier on Twitter) shared an interesting simulation in which she tried to fit a sigmoid curve (s-curve) to predict a plateau in a time-series. The result was a very intuitive and convincing animation that shows how wrong her initial forecasts were.

The matter of fact is that this phenomenon is not new at all. My first post-University job involved fitting numerous pharmacodynamics models. We always had to keep in mind that if the available data does not account for at least 95% of the maximum effect, the model will be very much suboptimal. It took me a while, but I managed to find the reference for this phenomenon [here]. Maybe, when I have some time, I will repeat Constance Crozier’s analysis, and add confidence intervals to emphasize the point.

EDIT: I came the conclusion that the most important takaway message of this demonstration is the necessity of reporting uncertainty with any forecast, and how small the value of a forecast is without uncertainty estimations.

S-curves (or sigmoid functions) are commonly used to model the evolution of social or biological systems over time [1]. These functions start with exponential growth, then increase linearly, and finally level off (therefore end up looking like a wonky s). Many things that we think of as exponential functions will actually follow an s-curve (otherwise […]

Forecasting s-curves is hard — Constance Crozier

On oranizing a data org in a company, job titles, and more

Photo by Khimish Sharma on Pexels.com

My colleague, Simon Ouderkik, recorded a REALLY interesting interview with Stephen Levin of Zapier and Emilie Schario of Gitlab on organizing data org in a company, job titles, career ladders, and other important stuff.

As y’all may recall, last year I was lucky enough to spens some time working with the fine folks at Locally Optimistic to produce and run some AMA content for them – they ended up being more similar to traditional interviews, but folks seemed to enjoy them! You can find those all here! These were […]

I’m Giving Video Content a Try! — Simon Ouderkirk

If there is only one document you can read about data visualization, this is the one

I’m sorting my teaching material, and I found this gem. The UK Government Statistical Service published a guideline for effective data visualization and tables. If you know a busy person who doesn’t have time to study data visualization and can only read one document, this document is for them (it has less than 40 pages full of examples). Click o the image above to go to the guideline

Everything is NOT just fine (repost)

My job wasn’t affected by the COVID madness in almost any way. I used to work from home before, and I work from home now, none on my customers cancelled any projects, the health system in Israel is still functioning, all of my relatives are in good health, everything is just fine! I know how unusual I am in the current world, with the skyrocketing unemployment, non-functioning governments, and three-digit body counts. I was about to write about that, but then I read AnnMaria’s post.

You should read it too

I’ve read a lot of cheery tweets that said something like, “Buffy, Biff and I are isolated at home with our terrier, Boo. Here’s a picture. Isn’t he cute? We played card games, then I baked this three-course meal I saw on Pinterest. Biff is taking this time to finally become proficient in Mandarin with…

Everything is NOT just fine — AnnMaria’s Blog

Blogging isn’t what it used to be. Podcasting is on the rise

Photo by Magda Ehlers on Pexels.com

More than two years ago, I took a look at Google Trends for three phrases “start a blog”, “create a blog”, and  “create a site”. I was surprised by the high volume of blog searches, compared to “create a site”.

Today, I decided to go back to Google Trends and to add the new rising star: podcasting. 

It looks like podcasting starts its exponential growth, while the blogging continues its slow but steady decline. I will be unsurprised if, in 2022, the green, podcasting line will surpass the other lines in this graph. Let’s wait and see.

A super-important read on the COVID-19 situation. I'm finally convinced

Until now I was very sceptical about the COVID-19 measures taken by many the governments around the world, especially the Israeli one. Today, finally, I read a post that addressed the three issues I was pointing to:

  1. This first lockdown will last for months, which seems unacceptable for many people.
  2. A months-long lockdown would destroy the economy.
  3. It wouldn’t even solve the problem, because we would be just postponing the epidemic: later on, once we release the social distancing measures, people will still get infected in the millions and die.
  4. My biggest concern: Either a lot of people die soon and we don’t hurt the economy today, or we hurt the economy today, just to postpone the deaths.

There’s no point rephrasing here the original post, just go and read it. I’m convinced. Thank you, Tomas Pueyo

Go and read. The image is clickable

The single most important thing about remove 1:1 meetings

The COVID-19 lockdown forced many organizations to a remote work mode. Recently, I spoke with three managers from three “conventional” companies and all the three told me how surprisingly efficient their 1:1 meetings became. This is how one of them described the situation “I prepare the agenda, we log in, boom, boom, boom, and we are done”.

The effectiveness of distributed work doesn’t surprise me, after all, I have been working in a distributed mode for about six years now. However, this super-efficiency has its own problems that one needs to know. Here’s the thing. We, humans, are social creatures. We depend on social interactions for our mental and physical well being. When people share the same physical office, they have enough social interactions “in-between” — in the hallway, next to the watercooler and in the parking lot. However, working in a distributed team creates isolation. That is why it is very important to start and end every meeting with a personal conversation. It is also important to make sure that the meeting feels as personal as possible. To do so, place the chat window below the camera, so that the person feels as if you are looking at them. During the conversation, resist the urge to check emails, read your Facebook feed or check my blog. Make the personal meeting personal, even if it’s remote.

I have been working in distributed teams for about six years. If you need advice on how to make the transition easier for your organization, I’ll be glad to give one (or two).

An interesting solution of the data giraffe problem

Photo by Pixabay on Pexels.com

A data giraffe is a situation where a very prominent data point shades everything else. I learned this term from a post by Pini Yakuel and immediately liked it a lot.

Taken from https://www.optimove.com/blog/beware-the-giraffes-in-your-data
Taken from https://www.optimove.com/blog/beware-the-giraffes-in-your-data

Dealing with data giraffes is hard, especially when dealing with bar charts. Today I saw one interesting approach to this problem

Katherine S. Rowell is a co-funder of a Boston firm that specializes in data visualization. In December, she published a post dedicated to one of the most popular but also most abused graph types, the bar charts. One of the examples in her post demonstrates a nice treatment of data giraffes

http://ksrowell.com/blog-visualizing-data/2019/12/18/bar-humbug/

In this example, Katherine draws the graph twice. The zoomed-out version shows the giraffes in all their glory, while the zoomed-in one gives the spotlight to the foxes, hyenas, and mice.
Also, note how these graphs respect the rules that every bar chart has to include the zero.

Another piece of career advice

Here’s another email that I got with the question about switching to the data science career

Hello, my name is X. I saw your blog, and to be honest, I said, “Wow, is this me :)” I’m a pharmacist 5th-grade student currently working on a project in computational drug design. I started programming, and I loved it. After that, I heard the term “Data Science” and started to do some research […]

Basically, I loved being on a computer and solving problems its a good career option for me (at least for now, you can’t predict future) my mom has a pharmacy I worked there (internship), and it is not for me (i am counting the time when I’m in a pharmacy.) so I have a few questions for you

I don’t have any degree in statistics or CS or something equivalent I am determined to learn these topics, but some people want to see the degree, and probably no one accept a pharmacist to a master degree in statistics (I also wish to do my Ms in computational drug design because, in the end, I don’t want to be a data scientist in social sciences or economics, at least for now, I want to use that knowledge in my field which is drugs and pharmaceuticals)

Ph.D. on Bioinformatics would help ? or Biostatistics ( is it easier for us to be accepted in biostatistics rather than statistics? To be honest, I don’t know the difference much, I took a biostatistics class, but it was just one semester and probably not enough for Ph.D. :))

Do I really need a degree in CS or statistics to be a pharmaceutical data scientist? I want to do my Ph.D. but also want to be realistic, it sounds amazing doing online masters in statistics while you are doing Computational drug design or Bİoinformatics Ph.D., but it is very hard and frustrating and also decrease your productivity in both fields.

I asked a lot of questions, sorry, but I have many :). You can reply when you have time. Thank you, and I loved your blog. I read and watched tons of things, but yours was the best suited for me because being a pharmacist, computational drug design, considering bioinformatics, it is all fits. By the way, I also considering cybersecurity (not working in a company but learning). I see that as a “martial arts of the future,” maybe I am wrong, but a person should know it to protect him/her self. Thank you again 🙂

Indeed, X’s background sounds very much like mine.
I’m not sure I have too much to add to what I already wrote here, in this blog. The only thing that I have to say is that in my biased opinion, a Ph.D. is something worth pursuing. The more time passes by, the more Ph.Ds there are, and the lack of a degree might be a problem in the future job market. On the other hand, there are many smart and rich people who claim that university degrees are a waste of time. Go figure 🙂

I hope that this helps.

No signs (yet?) of the COVID-19 pandemic on StackOverflow job postings

I suppose that you knot that THE software developement Q&A site has its own job board. I suspected that the Corona pandemic would lead to a sharp decrease in the number of job postings on that board. I scraped the data, and it looks like for now, there are no drastic changes in the amount of postings published in the last couple of days.

The cardiovascular safety of antiobesity drugs—analysis of signals in the FDA Adverse Event Report System Database

I am glad and proud to announce that a paper which I helped to prepare and publish is available on the Nature’s group site.

The paper, The cardiovascular safety of antiobesity drugs—analysis of signals in the FDA Adverse Event Report System Database, by Einat Gorelik et al. (including myself) analyzes the data in the FDA Adverse Event Reporting System (FAERS). In this study, we found interesting and relevant safety information about the long-term safety of the antiobesity drug Lorcaserin. Due to the interdisciplinary nature of the paper, the review process took about a year. Interestingly enough, the FDA requested the withdrawal of Lorcaserin due to long-term safety issues but not the ones we studied.

https://doi.org/10.1038/s41366-020-0544-4
https://doi.org/10.1038/s41366-020-0544-4

Tips for making remote presentations

Before becoming a freelancer data scientist, I used to work in a distributed company. Remote communication, including remote presentations were the norm for me, long before the remote work experiment no one asked for. In this post, I share some tips for delivering better presentations remotely.

Me presenting in front of the computer

  • Stand up! Usually, we stand up when we present in front of live audience. For some reason, when presenting remotely, people tend to sit. A sitting person is less dynamic and looks less engaging. I have a standing desk which allows me to stand up and to raise the camera to my face level.
  • If you can’t raise the camera, stay sitting. You don’t want your audience staring at your groin.
  • I always use a presentation remote control. It frees me up and lets me move more naturally. My remote is almost ten years old and I have a strong emotional attachment to it
  • When presenting, it is very important to see your audience. Use two monitors. Use one monitor for screen sharing, and the other one to see the audience.
  • Put the Skype/Zoom/whatever window that shows your audience under the camera. This way you’ll look most natural on the other side of the teleconference.
  • Starting a presentation in Powerpoint or Keynote “kidnaps” all the displays. You will not be able to see the audience when that happens. I export the presentation to a PDF file and use Acrobat Reader in full-screen mode. The up- and down- buttons in my presentation remote control work with the Reader. The “make screen black” button doesn’t.
  • I open a “lightable view” of my presentation and put it next to the audience screen. It’s not as useful as seeing the presenter’s notes using a “real” presentation program, but it is good enough.

Auditorium in Chisinau showing me on their screen

  • Make a dry run. Ideally, the try run should be a day or two before the event, to make sure all the technical problems are fixed.
  • Go online at least five minutes before the schedule. Be in front of the camera, don’t let the audience stare at your empty room
  • Make sure nothing in your background will embarrass you. This risk is especially high if you present from home or a hotel. Nobody needs to see your bed during a business meeting.

One idea per slide. It’s not that complicated

A lot of texts that talk about presentation design cite a very clear rule: each slide has to contain only one idea. Here’s a slide from a presentation deck that says just that.

And here’s the next slide in the same presentation

Can you count how many ideas there are on this slide? I see four of them.

Can we do better?

First of all, we need to remember that most of the time, the slides accompany the presenters and not replace them. This means that you don’t have to put everything you say as a slide. In our case, you can simply show the first slide and give more details orally. On the other hand, let’s face it, the presenters often use slides to remined themselves of what they want to say. 

So, if you need to expand your idea, split the sub-ideas into slides.

You can add some nice illustrations to connect the information and emotion. 

Making it more technical

“Yo!”, I can hear you saying, “Motivational slides are one thing, and technical presentation is a completely different thing! Also,” you continue, “We have things to do, we don’t have time searching the net for cute pics”. I hear you. So let me try improving a fairly technical slide, a slide that presents different types of machine learning.
Does slide like this look familiar to you?

First of all, the easiest solution is to split the ideas into individual slides.

It was simple, wasn’t it. The result is so much more digestible! Plus, the frequent changes of slides help your audience stay awake.

Here’s another, more graphical attempt

When I show the first slide in the deck above, I tell my audience that I am about to talk about different machine learning algorithms. Then, I switch to the next slide, talk about the first algorithm, then about the next one, and then mention the “others”. In this approach, each slide has only one idea. Notice also how the titles in these last slides are smaller than the contents. In these slides, they are used for navigation and are therefore less important.  In the last slide, I got a bit crazy and added so much information that everybody understands that this information isn’t meant to be read but rather serves as an illustration. This is a risky approach, I admit, but it’s worth testing.

To sum up

“One idea per slide” means one idea per slide. The simplest way to enforce this rule is to devote one slide per a sentence. Remember, adding slides is free, the audience attention is not.

5 Basics of Consulting Success: Part 1

Being a data science freelancer, and a long-time AnnMaria’s fan, I HAVE to repost here latest post on consulting success

Last week, I mentioned that successful consultants have five categories of skills; communication, testing, statistics, programming and generalist. COMMUNICATION Communication is the number one most important skill. All five are necessary to some extent, but a terrific communicator with mediocre statistical analysis skills will get more business than a stellar statistician that can’t communicate. Communication…

5 Basics of Consulting Success: Part 1 — AnnMaria’s Blog

Career advice. A clinical pharmacist, epidemiologist, and a Ph.D. student wants to become a data scientist.

Photo by Pixabay on Pexels.com

From time to time, I get emails from people who seek advice in their career paths. If I have time, I write them an extended reply and if they agree, I publish the questions and my replies here, in my blog. Here’s one such email exchange. All similar pieces of advice, as well as other rants about a career in data science, can be found here.

“Hi Boris 🙂
My name is XXXXX. I came across your blog while searching for people with a mix of pharmacy and data science skillsets. Your blog has been so informative to me so far but I was compelled to write to you to ask for your advice.
I am a clinical pharmacist by background but decided to leave the clinical pharmacy to pursue public health. Whilst doing my MPH, I fell in love with epidemiology and statistics and am now doing a Ph.D. in biostatistics. Your blog has made me feel very happy that I made this career move <…>  I feel better about my decision to leave the pharmacy and pursue a quant Ph.D. I have gone from pharmacy, to internships at <YYYY> as I wanted to pursue a career in <ZZZZZ> and now I am thinking of data science in the tech industry…my background is a bit confusing!”

In the past, I also felt that the pharmacy degree was confusing many potential employers, and since I wanted to leave the bio/pharma world and move to “pure data” positions, I omitted the B.Pharm title & studies from my CV. Ten years ago, the salaries in the bio sector, here in Israel, were much lower than the salaries in the “high tech” field. I think that today this situation is more or less normalized and that the people got used to the fact that a typical “data scientist” can have a very wide range of degrees.

“I was just wondering if I could get your opinion on the three questions I have. 
1. I work part-time as a clinical pharmacist to not forget my clinical skills. What do you think about the future of the pharmacy career overall?”

My last shift as a pharmacist

This is a huge question and I don’t have answers to it. Moreover, the answer depends heavily on legal regulations in the given country. I say that if you enjoy treating people, and can afford this time, why not? I, personally, was a very lousy pharmacist 🙂 so I was very happy to leave the pharmacy.

“I am wondering if I should keep up my pharmacist title or pursue data science full-time.”

Again, it depends. For many years, I didn’t have my pharmacy title in my CV because it felt unrelated to what I was doing. It was also a nice icebreaker to tell people with whom I worked “by the way, I’m a pharmacist” and it was fun to see their reactions. If I were you, I would ask two-three HR people or people who recruit employees what they think. Different countries may behave differently. 

“2. At what point can someone call themselves a data scientist?”

In my opinion, as long as you are comfortable enough to call yourself a data scientist, you are good to go. Note that unlike many people who got their data science “title” after taking some online courses, you already have a very strong theoretical base. Not only are your Master’s and the future Ph.D. degree relevant to data science, but they also give you strong and unique advantages. 

“I am looking at DS jobs at large tech companies. I am not sure how qualified and experienced I have to be for these jobs. I code in R using regression, clustering and time series methods and I am quite fluent in this language. I have just started to learn ML algorithms. I have a basic foundation in Python and SQL. I use Tableau for visualization and love communicating my research at any opportunity I get. I was wondering…how good do I have to be able to apply to DS jobs? What are the methods that data scientists use mostly? Would I be able to learn on the job?”

It sounds like a good combination of techniques. I am not recruiting but if I would, I would definitely like this list of skills. Personally, I don’t like R too much and prefer Python. But once you program one language, moving to another one is a doable task. As to what methods do data scientists use mostly, this hugely depends on your job. Most of my time, I clean data and write wrapper functions around known algorithms. The task that I have been facing during my professional life required regression, classification, network analysis. I never did real deep learning stuff, but I know people who only do deep learning for image and sound analysis. Also, in many cases, the data science part takes only 10% of your time because the “customer” doesn’t care about an algorithm, they want a solution. See this post for a nice example.

“3. If you had the opportunity to start your career again, say you were in your early twenties, what would you study and why? What advice would you have for your younger self? I would be so keen to hear what you think.”

It’s a philosophical task which I never like doing. What is done is done. The fact that I am a pretty successful data scientist may mean that I took the right decisions or that I was super lucky. 

Not a wasted time

Photo by Pixabay on Pexels.com

Being a freelancer data scientist, I get to talk to people about proposals that don’t materialize into projects. These conversations take time, but strangely enough, I enjoy them very much, I also find these conversations educating. How else could I have learned about a business model X, or what really happens behind the scenes of company Y?

Which coffee is this?

Photo by samer daboul on Pexels.com

Gilad Almosnino is an internationalization expert. I’m reading his post “Eight emojis that will create a more inclusive experience for Middle Eastern markets,” in which he mentions “Turkish or Arabic Coffee,” which reminded me of my last visit to Athens. When, in one restaurant, I asked for a Turkish coffee, the waiter looked at me harshly and said: “It’s not Turkish coffee; it’s Greek coffee!”

Turkish, Arabic, or Greek

Further Research is Needed

Do you believe in telepathy? Yesterday, I submitted final proofs of a paper in which I actively participated. During the proofreading, I noticed that our abstract ends with “further research is needed” and scratched my head. I submitted the proofs and then then, I saw this pearl in my blog feed

Further Research is Needed — xkcd.com

Book review: Great mental models by Shane Parrish

TL;DR shallow and disappointing

The Great Mental Models by Shane Parrish was highly praised by Automattic’s CEO Matt Mullenweg. Since I appreciate Matt’s opinion a lot, I decided to buy the book. I read it and was disappointed.

Image result for great mental models

This book is very ambitious but yet shallow and non-engaging. If you consider reading a book on mental models, then chances are you already know some of them. I expected the book to shed light on aspects I didn’t know or didn’t think of. Nothing like that happened. I didn’t learn new facts, neither was I impressed by a new way of thinking. I also think that this book won’t do the job with teenagers who still don’t have the arsenal of mental models, for them this book is full of unclear shortcuts.

The book is based on the materials of a highly praised blog fs.blog and is a good example how some stuff can work well as a blog post but feel bad as a book.

The bottom line: 2/5 Skip it.

TicToc — a flexible and straightforward stopwatch library for Python.

Photo by Brett Sayles on Pexels.com

Many years ago, I needed a way to measure execution times. I didn’t like the existing solutions so I wrote my own class. As time passed by, I added small changes and improvements, and recently, I decided to publish the code on GitHub, first as a gist, and now as a full-featured Github repository, and a pip package.

TicToc – a simple way to measure execution time

TicToc provides a simple mechanism to measure the wall time (a stopwatch) with reasonable accuracy.

Crete an object. Run tic() to start the timer, toc() to stop it. Repeated tic-toc’s will accumulate time. The tic-toc pair is useful in interactive environments such as the shell or a notebook. Whenever toc is called, a useful message is automatically printed to stdout. For non-interactive purposes, use start and stop, as they are less verbose.

Following is an example of how to use TicToc:

Usage examples

def leibniz_pi(n):
    ret = 0
    for i in range(n * 1000000):
        ret += ((4.0 * (-1) ** i) / (2 * i + 1))
    return ret

tt_overall = TicToc('overall')  # started  by default
tt_cumulative = TicToc('cumulative', start=False)
for iteration in range(1, 4):
    tt_cumulative.start()
    tt_current = TicToc('current')
    pi = leibniz_pi(iteration)
    tt_current.stop()
    tt_cumulative.stop()
    time.sleep(0.01)  # this inteval will not be accounted for by `tt_cumulative`
    print(
        f'Iteration {iteration}: pi={pi:.9}. '
        f'The computation took {tt_current.running_time():.2f} seconds. '
        f'Running time is {tt_overall.running_time():.2} seconds'
    )
tt_overall.stop()
print(tt_overall)
print(tt_cumulative)

TicToc objects are created in a “running” state, i.e you don’t have to start them using tic. To change this default behaviour, use

tt = TicToc(start=False)
# do some stuff
# when ready
tt.tic()

Installation

Install the package using pip

pip install tictoc-borisgorelik

Dispute for the sake of Heaven, or why it’s OK to have a loud argument with your co-worker

Any dispute that is for the sake of Heaven is destined to endure; one that is not for the sake of Heaven is not destined to endure
Chapters of the Fathers 5:27

One day, I had an intense argument with a colleague at my previous place of work, Automattic. Since most of the communication in Automattic happens in internal blogs that are visible to the entire company, this was a public dispute. In a matter of a couple of hours, some people contacted me privately on Slack. They told me that the message exchange sounded aggressive, both from my side and from the side of my counterpart. I didn’t feel that way. In this post, I want to explain why it is OK to have a loud argument with your co-workers.

How it all began?

I’m a data scientist and algorithm developer. I like doing data science and developing algorithms. Sometimes, to be better at my job, I need to show my work to my colleagues. In a “regular” company, I would ask my colleagues to step into my office and play with my models. Automattic isn’t a “regular” company. At Automattic, people from more than sixty countries from in every possible time zone. So, I wanted to start a server that will be visible by everyone in the company (and only by them), that will have access to the relevant data, and that will be able to run any software I install on it.

Two bees fighting

X is a system administrator. He likes administrating the systems that serve more than 2000,000,000 unique visitors in the US alone. To be good at his job, X needs to make sure no bad things happen to the systems. That’s why when X saw my request for the new setup (made on a company-visible blog page), his response was, more or less, “Please tell me why do you think you need this, and why can’t you manage with what you already have.”

Frankly, I was furious. Usually, they tell you to count to ten before answering to someone who made you angry. Instead, I went to my mother-in-law’s birthday party, and then I wrote an answer (again, in a company-visible blog). The answer was, more or less, “because I know what I’m doing.” For which, X replied, more or less, “I know what I do too.”

How it got resolved?

At this point, I started realizing that X is not expected to jeopardize his professional reputation for the sake of my professional aspirations. It was true that I wanted to test a new algorithm that will bring a lot of value to the company for which I work. It is also true that X doesn’t resent to comply with every developers’ request out of caprice. His job is to keep the entire system working. Coincidentally, X contacted me over Slack, so I took the opportunity to apologize for something that sounded as aggression from my side. I was pleased to hear that X didn’t notice any hostility, so we were good.

What eventually happened and was the dispute avoidable?

I don’t know whether it was possible to achieve the same or a better result without the loud argument. I admit: I was angry when I wrote some of the things that I wrote. However, I wasn’t mad at X as a person. I was angry because I thought I knew what was best for the company, and someone interfered with my plans.

I assume that X was angry when he wrote some of the things he wrote. I also believe that he wasn’t angry at me as a person but because he knew what was best for the company, and someone tried to interfere with his plans.

I’m sure though that it was this argument that enabled us to define the main “pain” points for both sides of the dispute. As long as the dispute was about ideas, not personas, and as long as the dispute’s goal was for the sake of the common good, it was worth it. To my current and future colleagues: if you hear me arguing loudly, please know that this is a “dispute that is for the sake of Heaven [that] is destined to endure.”


Featured image: Source: http://mimiandeunice.com/; Bees image: Photo by Flickr user silangel, modified. Under the CC-BY-NC license.

The difference between python decorators and inheritance that cost me three hours of hair-pulling

Photo by Genaro Servu00edn on Pexels.com

I don’t have much hair on my head, but recently, I encountered a funny peculiarity in Python due to which I have been pulling my hair for a couple of hours. In retrospect, this feature makes a lot of sense. In retrospect.

First, let’s start with the mental model that I had in my head: inheritance.

Let’s say you have a base class that defines a function `f`

Now, you inherit from that class and rewrite f

What happens? The fact that you defined f in ClassB means that, to a rough approximation, the old definition of f from ClassA does not exist in all the ClassB objects.

Now, let’s go to decorators.

@dataclass_json
@dataclass
class Message2:
    message: str
    weight: int
    def to_dict(self, encode_json=False):
        print('Custom to_dict')
        ret = {'MESSAGE': self.message, 'WEIGHT': self.weight}
        return ret
m2 = Message2('m2', 2)

What happened here? I used a decorator `dataclass_json` that, among other things, provides a `to_dict` function to Python’s data classes. I created a class `Message2`, but I needed s custom `to_dict` definition. So, naturally, I defined a new version of `to_dict` only to discover several hours later that the new `to_dict` doesn’t exist.

Do you get the point already? In inheritence, the custom implementations are added ON TOP of the base class. However, when you apply a decorator to a class, your class’s custom code is BELOW the one provided by the decorator. Therefore, you don’t override the decorating code but rather “underride” it (i.e., give it something it can replace).

As I said, it makes perfect sense, but still, I missed it. I don’t know whether I would have managed to find the solution without Stackoverflow.

The first things a statistical consultant needs to know — AnnMaria’s Blog

You know that I’m a data science consultant now, don’t you? You know that AnnMaria De Mars, Ph.D. (the statistician, game developer, the world Judo champion) is one of my favorite bloggers, and her blog is the second blog I started to follow don’t you? 

A couple of months ago, AnnMaria wrote an extensive post about 30 things she learned in 30 years as a statistical consultant. One week ago, she wrote another great piece of advice.

I’ll be speaking about being a statistical consultant at SAS Global Forum in D.C. in March/ April. While I will be talking a little bit about factor analysis, repeated measures ANOVA and logistic regression, that is the end of my talk. The first things a statistical consultant should know don’t have much to do with…

The first things a statistical consultant needs to know — AnnMaria’s Blog

Book review. Replay by Ken Grimwood

TL;DR: excellent fiction reading, makes you think about your life choices. 5/5

book cover of "Replay" by Ken Grimwood

“Replay” by Ken Grimwood is the first fiction book that I read in ages. The book is about a forty-three years old man with a failing family and a boring career. The man suddenly dies and re-appears in his own eighteen-years old body. He then lives his life again, using the knowledge of his future self. Then he dies again, and again, and again.
I liked the concept (reminded me of the Groundhog Day movie). The book managed to “suck me in,” and I finished it in two days. It also made me think hard about my life choices. I think that my decision to quit and become a freelancer was partially affected by this book.

What did I not like? Some parts of the book are somewhat pornographic. It doesn’t bother me per se, but I think the plot would stay as good as it is without those parts. Also, I find it a little bit sad that every reincarnation in “Replay” starts with making easy money. Not that I don’t like money; it just makes me sad.

Photo of my kindle with text from "Replay" by Ken Grimwood

Bottom line: Read! 5/5

(Read in Nov 2019)

ASCII histograms are quick, easy to use and to implement

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at the distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when we did most of our work in the console, and when creating a plot from Python required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

How I got a dream job in a distributed company and why I am leaving it

One night, in January 2014, I came back home from work after spending two hours commuting in each direction. I was frustrated and started Googling for “work from home” companies. After a couple of minutes, I arrived at https://automattic.com/work-with-us/. Surprisingly to me, I couldn’t find any job postings for data scientists, and a quick LinkedIn search revealed no data scientists at Automattic. So I decided to write a somewhat arrogant letter titled “Why you should call me?”. After reading the draft, I decided that it was too arrogant and kept it in my Drafts folder so that I can sleep over it. A couple of days later, I decided to delete that mail. HOWEVER, entirely unintentionally, I hit the send button. That’s how I became the first data scientist hired by Automattic (Carly Staumbach, the data scientist and the musician, was already Automattician, but she arrived there by an acquisition).

Screenshot of my email
The email is pretty long.
I even forgot to remove a link that I planned to read BEFORE sending that email.

The past five and a half years have been the best five and a half years in my professional life. I met a TON of fascinating people from different cultural and professional backgrounds. I re-discovered blogging. My idea of what a workplace is has changed tremendously and for good.

What happened?

Until now, every time I left a workplace, I did that for external reasons. I simply had to. I left either due to company’s poor financial situation, due to long commute time, or both. Now, it’s the first time I am leaving a place of work entirely for internal reasons: despite, and maybe a little bit because, the fact that everything was so good. (Of course, there are some problems and disruptions, but nothing is ideal, right?)

What happened? In June, I left for a sabbatical. The sabbatical was so good that I already started making plans for another one. However, I also started thinking about my professional growth, the opportunities I have, and the opportunities I previously missed. I realized that right now, I am in the ideal position to exit the comfort zone and to take calculated professional risks. That’s how, after about four sleepless weeks, I decided to quit my dream job and to start a freelance career.

On January 22, I will become an Automattic alumnus.

BTW, Automattic is constantly looking for new people. Visit their careers page and see whether there is something for you. And if not, find the chutzpah and write them anyhow.

A group photo of about 600 people -- Automattic 2018 grand meetup
2018 Grand Meetup.
A group photo of about 800 people. 2019 Automattic Grand Meetup
2019 Grand Meetup. I have no idea where I am at this picture

Career advice. A research pharmacist wants to become a data scientist.

Recently, I received an email from a pharmacist who considers becoming a data scientist. Since this is not a first (or last) similar email that I receive, I think others will find this message exchange interesting.

Here’s the original email with minor edits, followed by my response.

The question

Hi Boris, 


My name is XXXXX, and I came across your information and your advice on data science as I was researching career opportunities.

I currently work at a hospital as a research pharmacist, mainly involved in managing drugs for clinical trials.
Initially, I wanted to become a clinical pharmacist and pursued 1-year post-graduate residency training. However, it was not something I could envision myself enjoying for the rest of my career.

I then turned towards obtaining a Ph.D. in translational research, bridging the benchwork research to the bedside so that I could be at the forefront of clinical trial development and benefit patients from the rigorous stages of pre-clinical research outcomes. I much appreciate learning all the meticulous work dedicated before the development of Phase I clinical trials. However, Ph.D. in pharmaceutical sciences was overkill for what I wanted to achieve in my career (in my opinion), and I ended up completing with master’s in pharmaceutical sciences.

Since I wanted to be involved in both research and pharmacy areas in my career, I ended up where I am now, a research pharmacist.

My main job description is not any different from typical hospital pharmacists. I do have a chance of handling investigational medications, learning about new medications and clinical protocols, overseeing side effects that may be a crucial barrier in marketing the trial medications, and sometimes participating in development of drug preparation and handling for investigator-initiated trials. This does keep my job interesting and brings variety in what I do. However, I do still feel I am merely following the guidelines to prepare medications and not critically thinking to make interventions or manipulate data to see the outcomes. At this point, I am preparing to find career opportunities in the pharmaceutical industry where I will be more actively involved in clinical trial development, exchanging information about targeting the diseases and analyzing data. I believe gaining knowledge and experiences in critical characteristics for the data science field would broaden my career opportunities and interest. Still, unfortunately, I only have pharmacy background and have little to no experience in computer science, bioinformatics, or machine learning.

The answer

First of all, thank you for asking me. I’m genuinely flattered. I assume that you found me through my blog posts, and if not, I suggest that you read at least the following posts

All my thoughts on the career path of a data scientist appear in this page https://gorelik.net/category/career-advice/

Now, specifically to your questions.

My path towards data science was through gradual evolution. Every new phase in my career used my previous experience and knowledge. From B.Sc studies in pharmacy to doctorate studies in computational drug design, from computational drug design to biomathematical modeling, from that to bioinformatics, and from that to cybersecurity. Of course, my path is not unique. I know at least three people who followed a similar career from pharmacy to data science. Maybe other people made different choices and are even more successful than I am. My first advice to everyone who wants to transition into data science is not to (see the first link in the list above). I was lucky to enter the field before it was a field, but today, we live in the age of specialization. Today we have data analysts, data engineers, machine learning engineers, NLP scientists, image processing specialists, etc. If computational modeling is something that a person likes and sees themselves doing for living, I suggest pursuing a related advanced degree with a project that involves massive modeling efforts. Examples of such degrees for a pharmacist are computational chemistry, pharmacoepidemiology, pharmacovigilance, bioinformatics. This way, one can utilize the knowledge that they already have to expand the expertise, build a reputation, and gain new knowledge. If staying in academia is not an option, consider taking a relevant real-life project. For example, if you work in a hospital, you could try identifying patterns in antibiotics usage, a correlation between demographics and hospital re-admission, … you get the idea.

Whatever you do, you will not be able to work as a data scientist if you can’t write computer programs. Modifying tutorial scripts is not enough; knowing how to feed data into models is not enough.

Also, my most significant knowledge gap is in maths. If you do go back to academia, I strongly suggest taking advantage of the opportunity and taking several math classes: at least calculus and linear algebra and, of course, statistics. 

Do you have a question for me?

If you have questions, feel free writing them here, in the comments section or writing to boris@gorelik.net