Not a feature but a bug. Why having only superstars in your team can be a disaster.

Read this to learn about well-rounded teams that can effectively collaborate and communicate. As an experienced team leader and builder, contact me to learn more about my services and how I can help you achieve better outcomes.

As a freelancer and a manager, I have worked with many companies and teams. Recently,  I talked to a CEO who built a data science team that consisted of several “wonder kids” who obtained University degrees before graduating high school. The CEO was very proud of them. However, he complained that they don’t deliver as expected. This made me realize that having only superstars is not a feature but a bug.

The fact is that most of us are average, even geniuses are average in most aspects. Richard Feynman, the Nobel laureate physicist, was also a painter, musician, and an excellent teacher, but he is unique. I, for example, tend to think of myself as an excellent generalizer, leader, and communicator. However, I need help with attention to detail and deep domain-specific knowledge. To work well, I need to have pedantic specialists in my team. Why? Because, on average, I’m average.

Most “geniuses” are extremely talented in one field but still need help in others. Many tend to be individual workers, meaning their team communication is often suboptimal. Additionally, the fact that the entire team is very young also means they need more expertise in project management, inter-team communication, business orientation, or even enough real-life experience. The result: a disaster. That company got a team of solo players who don’t communicate within the team, don’t communicate with other teams, and don’t deliver on time.

What do I suggest? They say that “A’s hire A’s”. However, this doesn’t mean that each “A person” must ace the same field. A good team needs an A generalizer, an A specialist, an A communicator, and an A business expert. If you only hire “A++ specialists,” you risk ending up with a group of individuals who are “C-” communicators.

As another CEO I consulted once told me, “genius developers can do 10x job. They also tend to enter rabbit holes, and if unattended, they can do 10x damage.” If you build a team, you cannot afford to have unbalanced expertise sets. 

The bottom line is to ensure your team is diverse in its capabilities. Hiring only superstars may seem like a good idea, but it can result in a lack of collaboration, communication, and the necessary skills to succeed as a team. A diverse team with various skills and expertise is essential for achieving better outcomes.

In conclusion, avoid falling into the trap of thinking that only superstars can make a great team. Instead, focus on creating a diverse team with various skills, and you’ll be surprised at how much your team can achieve.

Modern tools make your skills obsolete. So what?

Read this if you are a data scientist (or another professional) worried about your career.

So many people, including me, write about how fields such as copywriting, drawing, or data science change from being accessible to a niche of highly professional individuals to a mere commodity. I claim it’s a good thing, not only for humankind but for the individual professional. Since I know nothing about drawing, I’ll talk about data science.

I started working as a data scientist a long time ago, even before the term data science was coined. Back then, my data science job included:

  • writing code that implements this optimization algorithm or the other
  • writing code that implements this statistical analysis or the other
  • writing code that implements this machine learning technique of the other
  • writing code that implements this quality metric or the other
  • writing code that handles named columns
  • writing code that deals with parallelization, caching, fetching data from the internet

Back then, exactly when the term data scientist was coined, I used to say “data is data”. I claimed that it didn’t matter whether you write a model that detects cancer or detects online fraud, a model that simulates two molecules in a solution or a model that simulates players in the electric appliances market. Data was data, and my job, as a data scientist was to crunch it.

Time passed by. Suddenly, I discovered one cool library, the other, and a third one … Suddenly, my job was to connect these libraries, which allowed me to be more expressive in what I could achieve. It also allowed me to concentrate better on “business logic.” Business logic is the term I use to describe all the knowledge required for the organization that pays your salary to keep doing so. If you work for a gaming company, “business logic” is the gaming psychology, competitor landscape, growth methods, and network effect. If you work for a biotech company, “business logic” is the deep understanding of disease mechanisms, biochemistry, genetics, or whatever is needed to perform the breakthrough. The fact that I don’t need to deal with “low-level coding” made me obsolete and drove me to a state where I became more specialized.

These days, we are facing a new era in knowledge commoditization. This commoditization makes our skills obsolete but also makes us more efficient in tasks that we were slow at and lets us develop new skills. 

In 2017, Gartner predicted that more than 40% of data science tasks would be obsolete by 2020. Today, in 2023, I can safely say that they were right. I can also say that today, despite the recent layouts, there are much more busy data scientists than there were in 2017 or 2020.

The bottom line. Stop worrying.

Let me cite myself from 2017:

Data scientists won’t disappear as an occupation. They will be more specialized.

I’m not saying that data scientists will disappear in the way coachmen disappeared from the labor market. My claim is that data scientists will cease to be perceived as a panacea by the typical CEO/CTO/CFO. Many tasks that are now performed by the data scientists will shift to business developers, programmers, accountants and other domain owners who will learn another skill — operating with numbers using ready to use tools. An accountant can use Excel to balance a budget, identify business strengths, and visualize trends. There is no reason he or she cannot use a reasonably simple black box to forecast sales, identify anomalies, or predict churn.

This is another piece of career advice. I have more of them in my blog

Chances are that you don’t need a data scientist, and three things to consider before hiring one.

Read this if you are considering hiring data scientists

I already wrote about how data science becomes a commodity.

If you read this, I guess data science is not the core part of your business. If this is the case, consider the following before you hire data scientists.

Data engineers

Your data scientists can be as good as the data you provide them. You must collect the correct data, validate it, store it well, and be able to access it easily. I have hours of “war stories” about how each component of the last message went wrong, and the company burned tons of money because of that. Data piping is a serious challenge. So, before you hire a data scientist, ask yourself whether your data engineering needs are covered.

Data analysts

Data Analysts mainly focus on the organization and interpretation of data. Unlike data scientists, Analysts don’t build predictive models or create unique algorithms. However, they identify trends and insights and present their findings clearly and understandably. Not being required to build novel models and algorithms allow them to better connect with stakeholders’ business needs and practical questions. A good data analyst will take the business problem, translate it into a data-based question, will know its potential value, and in many cases, will be able to answer it.

Boxed Solutions

Data Science as a Service is a term for boxed solutions that are constantly becoming more versatile, flexible, and affordable. I was a freelancer for a company that built its data-based product on an open-source implementation of a single optimization algorithm. They managed to run a successful company without a single data scientist for more than five years, and they started thinking of better solutions when they squeezed everything they could from their MRE. At this point, they had their data storage pipelines (data engineering), a better picture of their business (data analysts), and paying customers to finance the development of new algorithms.

How to work with data scientists?
I’ll write separate posts on this topic, but the gist is: to make sure they know your business needs. Ensure you communicate your needs and problems to them and make sure they share their efforts with you. I have seen many failed data science projects in my life. Most failed due to a lack of alignment, communication, or both.

This was another career advice post. Read more of them here.

Data Science Reality Check: My Predictions Come True (or, A Piece of Advice to Young Data Scientists)

Read this if you’re a data scientist or consider becoming one.

Almost six years ago, when Data Scientist was named the “sexiest job of the 21st century”, I wrote a blog post telling young professionals not to learn data science as a career move. My claim was that the data science field fill gets commoditized, and if you don’t possess deep (I mean DEEP) knowledge of either algorithms or the business you are working at, you will end up a mediocre coder.

Look what happened. Data science has indeed become commoditized in many fields. Many data-intence businesses work just fine without data scientists. Even I, a very experienced data scientist, got laid off because I couldn’t bring the company value that would justify my salary. People like Matthew Yglesias from https://www.slowboring.com suggest that data scientists learn how to roll a burrito or mine lithium.

Why did this happen? Well, I was right. Data science has become a commodity. Each self-respecting platform offers AI tools (I hate the term AI, by the way) such as keyword extraction, insights, predictions, anomaly detection, recommendations, and many more. Tableau, PowerBI, and even Google Sheets or Excel offer tools that were once only available through custom data and code fiddling. The Data-Science-As-A-Service niche is full of products such as https://www.pecan.ai and https://www.anodot.com. And we haven’t even started talking about the new word of the day: the GPT.

Being an experienced data scientist, people often ask for my advice and help. In the past, when this happened, I used to discuss possible custom-tailored solutions. Now, I find myself suggesting the person looking at product X or Y will solve their problems in a fraction of the time and cost. 

So, what do we have? What does all that mean?

Data science has become a commodity. In the past, to get a nice salary and a sexy title, it was enough to know what training, testing, and cross-validation were. Today, you absolutely have to know the theory and be a fast and good coder. But most of all, you must hone your communication skills and learn the business of the company where you work. Only this way will you be able to ensure your efforts are always aligned with the stakeholders and that you can consistently deliver value.

This is a career advice post. Check out the career tag and the Career Advice category of this blog.

14-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between for a perios of several years.

Overall, this period consists of between 14 to 17 working days in a single month (31 days, mind you). This year, we only have 14 working days during the Tishrei holiday period. This is how the working/not-working time during this month looks like:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.

(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

Book review: Extreme ownership

TL;DR Own your wins, own your failures, stay calm and make decisions. Read it. 5/5

Extreme ownership” is a book about leadership in business written by two ex-SEAL fighters. This book is full of war stories, as in actual stories from a real war. I read this book by the recommendation (an instruction, really) of the serial entrepreneur Danny Lieberman. After three years in the Israeli Border Police and after a cumulative year-and-a-half in active IDF reserve over almost twenty years, I learned to dislike war stories strongly. Had Danny not told me, “you have to read this book,” I would have ditched it after the first couple of pages. The war stories are self-bragging, and the business case studies are oversimplified and always have a happy ending. Moreover, the connection between a war story and a business case is sometimes very artificial.

Nevertheless, I’m glad that I read this book. It has several powerful messages and shows leadership aspects that I haven’t managed to formalize in my head before.

Key points

The best leaders don’t just take responsibility for their job. They take Extreme Ownership of everything that impacts their mission. When subordinates aren’t doing what they should, leaders that exercise Extreme Ownership cannot blame the subordinates. They must first look in the mirror at themselves.

  • It’s not what you preach; it’s what you tolerate
  • “Relax, look around, make a call.” 

This point takes me back to my days as the chief combat medic in an IDF infantry battalion (here we come, more war stories!). One day, an instructor, a very experienced paramedic, told me that the first thing a medic should do when they arrive at a scene is to take a pulse, not the pulse of the victims, but your own pulse, to make sure you’re calm and take the right decisions. 

  • Prioritize your problems and take care of them one at a time, the highest priority first. 
  • Leadership doesn’t just flow down the chain of command, but up as well.

This is a super valuable and insightful message.

The bottom line: Read it 5/5

New position, new challenge

I will skip the usual “I’m thrilled and excited…”. I’ll just say it.
As of today, I am the CTO of wizer.me, a platform for teachers and educators to create and share interactive worksheets.

On a scale of 1 to 10, how thrilled am I? 10
On a scale of 1 to 10, how terrified am I? 10
On a scale of 1 to 10, how confident am I that wizer.me will become the “next big thing” and the most significant chapter in my career? You won’t believe me, but also 10.

Back to in-person presentations

Today, I gave my first in-person presentation since the pandemic. It was awesome! I was talking about the study I performed with Nabeel Sulieman about data visualization in environments that use right-to-left writing systems.

I wrote about this study in the past [one, two]. Today, you may find the results of our study at http://direction-matters.com/. I hope to be able to publish the video recording of this presentation really soon.

An example of a very bad graph

An example of a very bad graph

Nature Medicine is a peer-reviewed journal that belongs to the very prestigious Nature group. Today, I was reading a paper that included THIS GEM.

These two graphs are so bad. It looks as if the authors had a target to squeeze as many data visualization mistakes as possible in a single piece of graphics.

Let’s take a look at the problems.

  • Double Y axes. Don’t! Double axes are bad in 99% of cases (exceptions do exist, but they are rare).
  • Two subgraphs that are meant to work together have different category orders and different Y-axis scales. These differences make the comparison much harder.
  • Inverted Y scale in a bar chart. Wow! This is very strange. Bizarre! It took me a while to spot this. First, I tried to understand why the line of P<0.05 (the magic value of statistics) is above 0.1. Then, I realized that the right Y-axis is reversed. At first, I thought, “WTF?!” but then I understood why the authors made this decision. You see, according to the widespread statistical ritual, the lower the “P-value” is, the more significant it is considered. The value of 1 is deemed to be non-significant at all, and the value of 0 is considered “as significant as one can have.” So, in theory, the authors could have renamed the axis to “Significance” and reversed the numbers. Still, the result would not be a real “significance,” nor would the name be intuitive to anyone familiar with statistical analysis. On the other hand, they really wanted more “significant” values to be bigger than less significant ones. So, what the heck? Let’s invert the scale! Well, no, this is not a good idea
  • Slanted category labels. This might be a matter of taste, but I dislike rotated and slanted labels. Turning the graph solves the need for label rotation, thus making it more readable and having zero drawbacks.

What can be done?

I don’t like criticism without improvement suggestions. Let’s see what I would have done with this graph. To make this decision, I first need to decide what I want to show. According to my understanding of the paper, the authors wish to show that the two data sets are very different in determining a specific outcome. To show that, we don’t need to depict both the P-value and variance (mainly since these two values are very much correlated). Thus, I will depict only show one metric. I will stick with the P-value.

I will keep the category order the same between the two subgraphs. Doing so will create a “table lens” effect; it will show the individual values while demonstrating the lack of correlations between the two groups. Finally, I will convert the bars into points, primarily to reduce the data-ink ratio. Two additional arguments against bar charts, in this case, are the facts that the P-values of a statistical test cannot possibly be zero and that bar charts don’t allow log-scale, in case we’ll want to use it.

The result should look like this sketch.

On proper selection of colors in graphs

Photo by Sharon McCutcheon on Pexels.com

How do you properly select a colormap for a graph? What makes the rainbow color map a wrong choice, and what are the proper alternatives?

Today, I stumbled upon a lengthy post that provides an in-depth review of the theory behind our color perception. The article concentrates on quantitative colormaps but also includes information relevant to selecting proper colors for categories. 

If you never learned the theory behind the color and are interested in data visualization, I strongly suggest investing 45-60 minutes of your life in reading this post.

Book review: The Hard Things About Hard Things by Ben Horowitz

TL;DR War stories and pieces of advice from the high tech industry veteran.

I read this book following recomendations by Reem Sherman, the host of the excellent (!!!) podcast Geekonomy (in Hebrew).

Ben Horowitz is a veteran manager and entrepreneur who found the company Opsware, which Hewlett-Packard acquired in 2007. This book describes Horotwitz’s journey in Opsware from the foundation to the sale. Book’s second part is a collection of advice to working and aspiring CEOs. The last part is, actually, an advertisement for Horowitz’s new project — a VC company.

Things that I liked

The behind the scenes stories are interesting and inspiring.
Ben Horowitz devoted the second part of the book to share his experience as a CEO with other actual or aspiring CEOs. I don’t work as a CEO, nor do I see myself in that position in the future. However, this part is valuable for people like me because it provides insights into how CEOs think. Moreover, “The Hard Things” is a popular book, and many managers learn from it.

Things that I didn’t like.

Ben Horowitz was a manager during the early days of the high-tech industry. As such, parts of his attitude are outdated. The most prominent example for this problem is a story that Horowitz tells, in which he asked the entire company to work 12+ hours a day, seven days a week for several months. He was very proud about this, but IMO, employees will not accept such a request in today’s climate.

The bottom line: 4/5

14-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between for a perios of several years.

Overall, this period consists of between 14 to 17 working days in a single month (31 days, mind you). This year, we only have 14 working days during the Tishrei holiday period. This is how the working/not-working time during this month looks like:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.

(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

:-(

Usually, I keep my blog for professional news only, but this time, I’ll make an exception.

This frame is from a video that was taken a couple of days ago, less than one hour away from my home. Note how many people are there. 

Some people will claim that what we see is a peaceful protest by Palestinians against the Israeli occupation. Being a son and a grandson to the Holocaust survivors, I find it hard to connect to the peacefulness of what I see. I don’t have to hear them chanting “from the River to the Sea Palestine will be free” to understand that what they, and many thousands more, really mean is “free of Jews”.

Published
Categorized as blog

Opening a new notebook in my productivity system

Those who know me, know that I always care with me a cheep and thin notebook which I use as an extension to my mind. Today, I opened a new notebook, and this is a good opportunity to share some links about my productivity system.

  • Start with the post “The best productivity system I know
  • Failed attempt with tangible boards is here. This approach has an interesting idea behind it, but I couldn’t stick with it. YMMW
  • Failed attempt with digital/analog/tangible combo is here.

Another example of the power of data visualization

I stumbled upon a great graph that tells a complex story compellingly.

Comparison of two COVID-19 waves in the UK, taken from here.

This graph compares the last two waves of COVID-19 in the United Kingdom and is shows so clearly that the new wave (that is supposedly composed of the Delta variant) is much more infections on the one hand, but on the other hand, causes much less damage. Is the more moderate damage the result of the Delta variant nature of the protective effect of the vaccination is still an open question, but the difference is still striking.

Managing remotely. A podcast interview with Martin Remy

My podcast is mostly in Hebrew, but this interview was recorded in English. I hope you will enjoy it

Martin Remy has been managing teams of data engineers and data scientists for more than a decade, and he has been doing so remotely. What lessons can we learn from Martin? לינקים חשובים: https://marting.blog https://martinremy.com עמוד הפייסבוק של ההסכת:  https://www.facebook.com/reayonavodapodcast/ עמוד הבית שלי https://gorelik.net/about הרשמו להסכת ב־ גוגל פודקאסטס, ספוטיפיי, אפל מיוזיק, פודבין ובכל פלטפורמה […]

רעיון 38. Managing remotely — בוריס גורליק
Published
Categorized as blog

Another evolution of my offline productivity system


This week, I mark an important milestone in my professional life. It is an excellent opportunity to start a new productivity notebook and tell you about the latest evolution of the best productivity system I know.

To sum up, I use a custom variant of Mark Forster’s Final Version productivity system that uses a plain notebook to track, prioritize, and eliminate tasks. Using a physical notebook, as opposed to an electronic tool, is a massive boost in productivity, as it forces you to process your priorities in an unplugged mode, without any distractions.

When I was a freelancer, I felt forced to use a combination of a physical book and an electronic system (http://todoist.com/), but that didn’t work too well for me, the connected nature of this (and any other) app kept distracting me. I also played with a combination of a notebook and a portable kanban board. That didn’t work out for me either. So, right now, I’m back to a physical notebook with a small addition. 

I now have two notebooks. The first one is a small (80 pages) soft notebook that I use to track and prioritize tasks (as in Mark Forster’s system). I also use this notebook to reflect on what’s going on, write questions to my future self, and document my decisions.

The second, larger notebook is used for note keeping, drafts and sketches. The fact that the notebook is vertically bound allows me seemingly switching from Hebrew (that is written from right to left) and English. When a sketch of a draft isn’t relevant anymore, I tear the draft pages away; and I use a small binder to keep the note pages together for future reference.

Overall, I like this combo very much and it fits my workflow well.

Experiment report

In January 2020, I started a new experiment. I quit what was a dream job and became a freelancer. Today, the experiment is over. This post serves as omphaloskepsis – a short reflection on what went well and what could have worked better.

What worked well?

To sum up, I declare this experiment successful. I had a chance to work with several very interesting companies. I got exposed to business models of which I wasn’t aware. Most importantly, I met new intelligent and ambitious people. I also had a chance to feel by myself how it feels to be self-employed, to see the behind-the-scenes of several freelancers and entrepreneurs. I learned to appreciate the audacity and the courage of people who don’t rely on monthly paychecks and take much more responsibility for their lives than the vast majority of the “salarymen.”

Let’s talk about money. Was it worth it in terms of $$$$$ (or ₪₪₪₪₪₪)? Objectively speaking, my financial situation remained approximately unchanged. Towards the end of the experiment, I found myself overbooked, which means that, in theory, I could have increased my income substantially. But this is only in theory. In practice, I decided to end the freelance experiment and “settle down”.

What could have been better?

So, was it peachy? Not at all. For me, being a freelancer is much more stressful than being a hired employee. The stress does not come exclusively from the need to make sure one has enough projects in the pipeline (I had enough of them, most of the time). The more significant source of stress came from the lack of focus, the need for EXTREME context switching, and the lack of a team. 

I did receive one suggestion to mitigate this source of stress; however, when I heard it, I already had several job offers and was already 90% committed to accepting the position at MyBiotics.

To sum up

I’m am very happy I did this experiment. I learned a lot; I enjoyed a lot (and suffered a lot too), I met new people, and I changed the way I think about many things. Was it a good idea? Yes, it was. Should you try becoming a freelancer? How the hell can I know that? It’s your life; you enjoy the success and take the risk of failure. 

A new phase in my professional life

rbt

I’m excited to announce that I’m joining MyBiotics Pharma Ltd as the company’s Head of Data and Bioinformatics. I have been working with this fantastic company and its remarkable people as a freelancer for fourteen fruitful months. But today, I join the MyBiotics family as a full-time member. Together, we will strive to better understanding the interactions between humans and their microbiome to improve health and well-being.

rbt

Black lives matter. Lior Pachter

Almost one year after it was originally published, I stumbled upon this powerful post.

Today, June 10th 2020, black academic scientists are holding a strike in solidarity with Black Lives Matter protests. I strike with them and for them. This is why: I began to understand the enormity of racism against blacks thirty five years ago when I was 12 years old. A single event, in which I witnessed […]

Black lives matter
Published
Categorized as blog

Super useful videos for advanced data visualizers

The great Robert Kosara, also known as the “eager eyes” has started publishing a series of videos he calls Chart Appreciation. In these videos, Robert takes a piece of data visualization from a reputable and known source, and discusses why this particular piece is so good, what decisions were made that made it possible, what alternatives are, and more. If you consider yourself an intermediate or advanced practitioner of data visualization, you should subscribe. Here’s one example.

Career advise. Upgrading data science career

Photo by Kelly Lacy on Pexels.com

From time to time, people send me emails asking for career advice. Here’s one recent exchange.

Hi Boris,

I am currently trying to decide on a career move and would like to ask for your advice.

I have a MSc from a leading university in ML, without thesis.

I have 5 years of experience in data science at <XXX Multinational Company> , producing ML based pipelines for the products. I have experience with Big Data (Spark, …), ML, deploying models to production…

However, I feel that I missed doing real ML complicated stuff. Most of the work I did was to build pipelines, training simple models, do some basic feature engineering… and it worked good enough.

Well, this IS the real ML job for 91.4%* of data scientists. You were lucky to work in a company with access to data and has teams dedicated to keeping data flowing, neat, and organized. You worked in a company with good work ethics, surrounded by smart people, and, I guess, the computational power was never a big issue. Most of the data scientists that I know don’t have all these perks. Some have to work alone; others need to solve “dull” engineering problems, find ways to process data on suboptimal computers or fight with a completely unstandardized data collection process. In fact, I know a young data scientist who quit their first post-Uni job after less than six months because she couldn’t handle most of these problems.

However I don’t have any real research experience. I never published any paper, and feel like I always did easy stuff. Therefore, I lack confidence in the ML domain. I feel like what I’ve been doing is not complicated and I could be easily replaced.

This is a super valid concern. I am surprised how few people in our field think about it. On the one hand, most ML practitioners don’t publish papers because they are busy doing the job they are paid for. I am a big proponent of teaching as a means of professional growth. So, you can decide to teach a course in a local meetup, local college, in your workplace, or at a conference. Teaching is an excellent way to improve your communication skills, which are the best means for job security (see this post).

Since you work at XXXXX , I suggest talking to your manager and/or HR representative. I’m SURE that they will have some ideas for a research project that you can take full-time or part-time to help you grow and help your business unit. This brings me to your next question.

I feel like having a research experience/doing a PhD may be an essential part to stay relevant in the long term in the domain. Also, having an expertise in one of NLP/Computer Vision may be very valuable.

I agree. Being a Ph.D. and an Israeli (we have one of the largest Ph.D. percentages globally) makes me biased.

I got 2 offers:

– One with <YYY Multinational company> , to do research in NLP and Computer Vision. […] which is focused on doing research and publishing papers […]

– One with a very fast growing insurance startup, for a data scientist position, as a part of the founding team team. […] However, I feel it would be the continuation of my current position as a data scientist, and I would maybe miss on this research component in my career.

You can explore a third option: A Ph.D. while working at your current place of work. I know for a fact that this company allows some of their employees to pursue a Ph.D. while working. The research may or may not be connected to their day job.

I am very hesitant because

– I am not sure focusing on ML models in a research team would be a good use of my time as ML may be commoditised, and general DS may be more future-proof. Also I am concerned about my impact there.

– I am not sure that I would have such a great impact in the DS team of the startup, due to regulations in the pricing model [of that company], and the fact that business problems may be solved by outsourced tools.

These are hard questions to answer. First of all, one may see legal constraints as a “feature, not a bug,” as they force more creative thinking and novel approaches. Many business problems may indeed be solved by outsourcing, but this usually doesn’t happen in problems central to the company’s success since these problems are unique enough to not fit an off-the-shelf product. You also need to consider your personal preferences because it is hard to be good at something you hate doing.

From time to time, I give career advice. When the question or the answer is general enough, I publish them in a post like this. You may read all of these posts here.

Interview 27: Racial discrimination and fair machine learning

I invited Dr. Charles Earl for this episode of my podcast “Job Interview” to talk about racial discrimination at the workplace and fairness in machine learning.

Dr. Charles Earl is a data scientist in Automattic, my previous place of work. Charles holds a Ph.D. in computer science, M.A. in education, M.Sc in Electrical engineering, and B.Sc in mathematics. His career covered a position of assistant professor and a wide range of hands-on, managerial, and consulting roles in the field that we like to call today “data science.” 

But there is another aspect in Dr. Earl. His skin is brown. He was born to an African-American family in Atlanta, GA, in the 1960s when racial segregation was explicitly legal. I am sure that this fact affected Charles’ entire life, personal and professional.

Links

If you know Hebrew, follow my podcast Job Interview (Reayon Avoda), and This Week in the Middle East

Five things I wish people knew about real-life machine learning

Deena Gergis is a data science lead at Bayer. I recently discovered Deena’s article on LinkedIn titled “Five Things I Wish I Knew About Real-Life AI.” I think that this article is a great piece of a career advice for all the current and aspiring data scientists, as well as for all the professionals who work with them. Let’ me take Deena’s headings and add my 2 cents.

One. It is all about the delivered value, not the method.

I fully agree with this one. Nobody cares whether you used a linear regression or recurrent neural network. Nobody really cares about p-values or r-squared. What people need are results, insights, or working products. Simple, right?

Two. Packaging does matter

Again, well said. The way you present your solution to your colleagues, customers, or stakeholders can determine whether your project will get more funds and resources or not. 

Three. Doing the right things != doing things right.

Exactly. Citing Deena: “you might be perfectly predicting a KPI that no one cares about.” Enough said. 

Four. Set realistic expectations.

Not everybody realizes that “machine learning” and “artificial intelligence” are not a synonym of “magic” but rather a form of statistics (I hope “real” statisticians won’t get mad at me here). The principle “garbage in – garbage out” holds in machine learning. Moreover, sometimes, ML systems amplify the garbage, resulting in “garbage in, tons of garbage out”. 

Five. Keep humans in the loop.

Let me cite Deena again: “My customers are my partners, not just end-users.” Note that by “customers,” we don’t only mean walk-in clients, but also any internal customer, project manager, even a colleague who works on the same project. They are all partners with unique insights, domain knowledge, and experience. Use them to make your work better. 

Read the original article here. Deena Gergis has several more articles on LinkedIn here. And if you know Arabic, you might want to watch Deena’s videos on YouTube here. Unfortunately, my Arabic is not good enough to understand her Egyptian accent, but I suspect that her videos are as good as her writings.

One of the first dataviz blogs that I used to follow is now a book. Better Posters

I started following data visualization news and opinions quite a few years ago. One of the first bloggers who were active in this area NeurDojo, by the (now) professor Zen Faulkes. On of Zen’s spin-off blogs was devoted to better posters. This poster blog is called, surprisingly enough, Better Posters. Since I’m not in academia anymore, stopped caring about posters many years ago. Today, I stumbled upon this blog and was pleasantly surprised to discover that Better Posters is still active and that it is also now a book.

Working with the local filesystem and with S3 in the same code

Photo by Ekrulila on Pexels.com

As data people, we need to work with files: we use files to save and load data, models, configurations, images, and other things. When possible, I prefer working with local files because it’s fast and straightforward. However, sometimes, the production code needs to work with data stored on S3. What do we do? Until recently, you would have to rewrite multiple parts of the code. But not anymore. I created a sshalosh package that solves so many problems and spares a lot of code rewriting. Here’s how you work with it:

if work_with_s3:
    s3_config = {
      "s3": {
        "defaultBucket": "bucket",
        "accessKey": "ABCDEFGHIJKLMNOP",
        "accessSecret": "/accessSecretThatOnlyYouKnow"
      }
    }
    
else:
    s3_config = None
serializer = sshalosh.Serializer(s3_config)

# Done! From now on, you only need to deal with the business logic, not the house-keeping

# Load data & model
data = serializer.load_json('data.json')
model = serializer.load_pickle('model.pkl')

# Update
data = update_with_new_examples()
model.fit(data)

# Save updated objects
serializer.dump_json(data, 'data.json')
serializer.dump_pickle(model, 'model.pkl')

As simple as that.
The package provides the following functions.

  • path_exists
  • rm
  • rmtree
  • ls
  • load_pickle, dump_pickle
  • load_json, dump_json

There is also a multipurpose open function that can open a file in read, write or append mode, and returns a handler to it.

How to install? How to contribute?

The installation is very simple: pip install sshalosh-borisgorelik
and you’re done. The code lives on GitHub under http://github.com/bgbg/shalosh. You are welcome to contribute code, documentation, and bug reports.

The name is strange, isn’t it?

Well, naming is hard. In Hebrew, “shalosh” means “three”, so “sshalosh” means s3. Don’t overanalyze this. The GitHub repo doesn’t have the extra s. My bad

Book review. The Persuasion Slide by Richard Dooley

TL;DR Very shallow and uninformative. It could be an OK series of blog posts for complete novices, but not a book.

The Persuasion Slide by Richard Dooley was a disappointment for me. I love Dooley’s podcast Brainfluence, and I was sure that Richard’s book would full of in-depth knowledge and case studies. However, it contained neither. 

The only contribution of this book is the analogy between a sale process and an amusement part slide. The theory behind the book is mostly presented as a ground truth with almost no explanation or support from research. One will gain much more knowledge and understanding by reading Kahneman’s “Thinking, Fast and Slow,” Arieli’s “Predictably irrational.” or Weisman’s “59 seconds.”

Should I read this book?

No

Graphical comparison of changes in large populations with “volcano plots”

I recently rediscovered a volcano plot — a scatter plot that aims to visualize changes in large populations.

Volcano plots are very technical and specialized and, most probably, are not a good fit for explanatory data visualization. However, they can be useful during the exploration phase, and they come with a set of well-established metrics.

Moreover, if you are lucky enough to have well-behaved data, the plots look very cool

Visualization of RNA-Seq results with Volcano Plot
From here

Of course, in real life, the data is messy. Add bad visualization practices to the mess and you get a marvel like this one

From here

The bottom line: if you have two populations to compare, consider volcano plots. But do remember dataviz good practices.

Book review: Manager in shorts by Gal Zellermayer

TL;DR Nice’n’easy reading for novice managers

I read this book after hearing the author, Gal Zellermayer, in a podcast. Gal is an Israeli guy who has been working as a manager in several global companies’ Israeli offices. He brings a perspective that combines (what is perceived) the best practices of American managing style with the Israeli tendency to make things straight and simple. 

The greater part of the book is devoted to helping the people in your team develop. The book serves as a good motivator and helps to keep the importance of “peopleware.” I wish, however, it would bring more practical advice and cite more research and external analyses. 

Should you read this book?

If you are a beginning manager or want to be one – yes. 

If you never read a book on management – maybe (although Peopleware might be a better read).

The bottom line: 4/5

One idea per slide. It’s not that complicated


I wrote this post in 2009, I published it in March 2020, and am republishing it again


A lot of texts that talk about presentation design cite a very clear rule: each slide has to contain only one idea. Here’s a slide from a presentation deck that says just that.

And here’s the next slide in the same presentation

Can you count how many ideas there are on this slide? I see four of them.

Can we do better?

First of all, we need to remember that most of the time, the slides accompany the presenters and not replace them. This means that you don’t have to put everything you say as a slide. In our case, you can simply show the first slide and give more details orally. On the other hand, let’s face it, the presenters often use slides to remined themselves of what they want to say. 

So, if you need to expand your idea, split the sub-ideas into slides.

You can add some nice illustrations to connect the information and emotion. 

Making it more technical

“Yo!”, I can hear you saying, “Motivational slides are one thing, and technical presentation is a completely different thing! Also,” you continue, “We have things to do, we don’t have time searching the net for cute pics”. I hear you. So let me try improving a fairly technical slide, a slide that presents different types of machine learning.
Does slide like this look familiar to you?

First of all, the easiest solution is to split the ideas into individual slides.

It was simple, wasn’t it. The result is so much more digestible! Plus, the frequent changes of slides help your audience stay awake.

Here’s another, more graphical attempt

When I show the first slide in the deck above, I tell my audience that I am about to talk about different machine learning algorithms. Then, I switch to the next slide, talk about the first algorithm, then about the next one, and then mention the “others”. In this approach, each slide has only one idea. Notice also how the titles in these last slides are smaller than the contents. In these slides, they are used for navigation and are therefore less important.  In the last slide, I got a bit crazy and added so much information that everybody understands that this information isn’t meant to be read but rather serves as an illustration. This is a risky approach, I admit, but it’s worth testing.

To sum up

“One idea per slide” means one idea per slide. The simplest way to enforce this rule is to devote one slide per a sentence. Remember, adding slides is free, the audience attention is not.

Innumeracy

Innumeracy is the “inability to deal comfortably with the fundamental notions of number and chance”.
I wish there was a better term for “innumeracy”, a term that would reflect the importance of analyzing risks, uncertainty, and chance. Unfortunately, I can’t find such a term. Nevertheless, the problem is huge. In this long post, Tom Breur reviews many important aspects of “numeracy”. I already shared this post a long time ago, but it’s worth sharing again.

https://tombreur.wordpress.com/2018/10/21/innumeracy/

Published
Categorized as blog

Before and after — stacked bar charts

A fellow data analyst asked a question? What do we do when we need to draw a stacked bar chart that has too many colors? How do we select the colors so that they are nice but also are easily distinguishable? To answer this question, let’s look at the data similar to what appeared in the original question. I also tried to recreate the actual chart’s style

So, how do we select colors?
The answer to this question is pretty complicated. To have a set of easily distinguishable colors, one needs to model the color perception in a typical human being properly. Luckily, a tool called I Want Hue that’s based on a solid theory explained here. The problem, however, isn’t in colors.

This is not the right question

Distinguishing between eight colors in a graph is a challenging task. Selecting the right color scheme might help, but it won’t solve this fundamental problem. Moreover, stacked bar plots are tricky due to another complication.

We, the humans, are somewhat good are comparing positions but not as good at comparing sizes. This is why comparing the heights of the bars is relatively easy. It is easy because the bars start at the same line, and our task is to compare the bar end position, not the bar size. Reading the heights of the lowest segment in the bars is also an easy task for the same reason: we don’t compare the sizes but the heights.

However, comparing the sizes of the middle components is more challenging. As a result, the intermediate parts of a graph don’t add useful information but rather add noise. Thus, let us explain two options. First, we will reduce the number of groups. Next, we will explore what happens when reducing the number of groups is not an option.

Option 1. Reduce the number of categories

It is hard to advise about data visualization when I don’t know what conclusion the author wants to convey. However, I am sure that in many cases, the number of categories that are relevant to the viewer is much smaller than the number of types that are relevant to the analyst. The viewer might not care about all the hard job you did while collecting the data; what they are about is an insight. For example, if we reduce the discussion to two groups: the USA and non-USA data centers, the graph becomes much more readable.

Note how two groups in a stacked graph pose no problem in deciphering the sizes. If we take care of readability and improve the data-ink ratio, we get a nice data visualization piece.

Option 2. When reducing the number of categories is not an option

But what if reducing the number of categories is not an option? If you are absolutely sure that the audience absolutely needs to see all the information, you can split the different groups into separate subgraphs.

Have you noticed that the X-axis in our case represents time? In this case, we can replace the bars with an evolution plot and create a separate chart for each category in the data set. I took special care to keep the Y-axis scale equal between all the graphs so that the viewer can easily distinguish between data centers with a lot of errors and data centers with only a few of them. Here’s the result:

But what if the overall error rate is of greater importance than the individual groups. In that case, we can plot them in a larger graph and add the separate groups below, in smaller, un-emphasized subplots.

Summary — the Why and the What define the How

When you have a technical question about improving a graph, make sure you ask yourself “why.” Why is, does technical problems matter? Why will it improve the chart? To answer this question, you will have to ask another question: “what?”. “What is it that I want to say.” The easiest way to force yourself to ask these questions is to force yourself to add titles to every graph you create (see my how to suck less in data visualization post for more details).

Once you have your conclusion ready, you will notice that you don’t need a technical solution but rather a conceptual one. In this case, we solved the technical problem of looking for eight distinct colors by reducing the number of categories to two or splitting one elaborate graph into several straightforward ones.

So, remember, the Why and the What define the How

Python code that was used to generate all this graphs is available on (https://gist.github.com/bgbg/6c645a5fc48e61b1a917c9d1d66fa72f)

The Problem With Slope Charts (by Nick Desbarats)

Slope charts are often suggested as a valid alternative to clustered bar charts, especially for “before and after” cases.

So, instead of a clustered bar char like this

we tend to recommend a slope chart (or slope graph) like this

However, a slope chart isn’t free of problems either. In the past, I already wrote about a case of a meaningless slopegraph [here]. Today, I stumbled upon an interesting blog post (and a video) that surveys the problems of slope chars and their alternatives

All the graphs here come from the original post by Nick Desbarats that can be found [here].

Before and after: Alternatives to a radar chart (spider chart)

A radar chart (sometimes called “spider charts”) look cool but are, in fact,
pretty lame. So much so that when the data visualization author Stephen Few mentioned them in his book Show me the numbers, he did so in a chapter called “Silly graphs that are best forsaken.”

Here, I will demonstrate some of its problems, and will suggest an alternative

Before: The problems of a radar (spyder) plot

Above is my reconstruction of the original plot that I saw in a Facebook discussion. The graph looks pretty cool, I have to admit, but it is full of problems.
What are the problems of a spyder plot or a radar plot?
Let’s start with readability. Can you quickly tell the value of “Substance abuse” for the red series? Not that easy.

But a more significant problem emerges when one realizes that in most cases, the order of the categories is arbitrary and that different sorting options may result in entirely different visual pictures.

After: conclusion-based graph design

I have been continually preaching to add meaningful titles to all the graphs you are creating. (See How to suck less in data visualization and professional communication).

One of the byproducts of adding a title is the fact that when you write down your main takeaway of a graph, you force yourself to think, “does this graph show what it says it shows?” Thus, you guide yourself to better graph choices.

Let’s say that we conclude that there is no correlation between the two series of data. Is this conclusion evident from the graphs? I would say, not so much.

Instead of a radar chart, I suggest creating two aligned, horizontal graph plots. This way, we may sort one subplot according to the values, and then, correlation (or lack of thereof) will be evident.

But what if we noticed something interesting about the differences between A and B groups? If this is true, let’s show precisely this: the differences.

Notice how the bars in this version are sorted according to the difference. Sorting a bar chart is the easiest way to make it readable.

Python code that I used to create these graphs is available here https://gist.github.com/bgbg/db833db723998cd244b5049bfe01f5ac

Another language

بعد حوالي سنتين من الدراسة ، بحس حالي جاهز لإضافة اللغة العربية إلى قائمة اللغات في ال-LinkedIn 

After about two years of study, I feel ready to add Arabic to LinkedIn’s language list

Basic data visualization video course (in Hebrew)

I had the honor to record an introductory data visualization course for high school students as a part of the Israeli national distance learning project. The course is in Hebrew, and since it targets high schoolers, it does not require any prior knowledge.

I got paid for this job. However, when I divide the money that I received for this job by the time I spent on it, I get a ridiculously low rate. On the other hand, I enjoyed the process, and I view this as my humble donation to the public education system.
Since a government agency makes the course site, it’s UI is complete shit. For example, the site doesn’t support playlists, and the user is expected to search through the video clips by their titles. To fix that, I created a page that lists all the videos in the right order.

Text Visualization Browser

I’ve stumbled upon an exciting project — text visualization browser. It’s a web page that allows one to search for different text visualization techniques using keywords and publication time. 

Text visualization browser https://textvis.lnu.se

The ability to limit the search to various years gives a nice historical perspective on this interesting topic

This site’s information is based on a 2015 paper Text visualization techniques: Taxonomy, visual survey, and community insights. I wish the authors updated it with more recent data, though. 

Sharing the results of your Python code

Photo by veeterzy on Pexels.com

If you work, but nobody knows about your results or cares about them, have you done any work at all? 

A proverbial tree in the proverbial forest. Photo by veeterzy on Pexels.com

As a data scientist, the product of my work is usually an algorithm, an analysis, or a model. What is a good way to share these results with my clients? 

Since 99% of my time, I write in Python, I fell in love with a framework called Panel (http://panel.holoviz.org/). Panel allows you to create and serve basic interactive UI around data, an analysis, or a method. It plays well with API frameworks such as FastAPI or Flask.  The only problem is that to share this work. Sometimes, it is enough to run a local demo server, but if you want to share the work with someone who doesn’t sit next to you, you have to host it somewhere and to take care of access rights. For this purpose, I have a cheap cloud server ($5/month), which is more than enough for my personal needs.

If you can share the entire work publicly, some services can pick up your Jupyter notebooks from  Github and interactively serve them. I know of voila  and Binder)

Recently, Streamlit.io is entering this niche. It currently only allows sharing public repos, but promises to add a paid service for your private code. I’m eager to see that.

The information is beautiful. The graphs are shit!

I apologize for my harsh language, but recently I was exposed to a bunch of graphs on the “information is beautiful” site, and I was offended (well, ot really, but let’s pretend I was). I mean, I’m a liberal person, and I don’t care what graphs people do in their own time. Many people visit that site because they try to learn good visualization practices, but some charts on that site are wrong. Very wrong.

Here’s the gem:

I deliberately don’t share the link to this site. I don’t want let Google think it’s valuable in any way.

Now, the geniuses from “Information is beautiful” (let’s call them IB for brevety) wanted to share with us some positive stats. How nice of them. So what they did? They gathered together nine pairs of metrics collected at two different time points: one in the past and one furthermore in history. They used nice colors to create some sleeky shapes. So, what’s the problem? What’s wrong with that?

Everything is wrong!

Let’s start from my guess that they cherry-picked the stats with “positive” changes. Secondly, the comparison of this sort is mostly meaningless if we compare points at different years. What stopped the authors of that tasteless “infographic” from collecting data from the same years? I guess, their laziness. That’s how we ended up comparing the number of death penalties in 1990 and 2016, but the malaria deaths numbers are for 2000 and 2016, and dying mothers are compared for years 2000 and 2017?

Now, let’s talk about data viz.

Take a look at this graph.

The only time we use shapes like that is when we want to convey information about uncertainty. To do that, the X-axis represents the thing we are measuring, and the Y-axis represents our certainty about the current value. When we compare to uncertain measurements, we may judge the difference between these measurements by the distance between the curve peaks, and the width of the curve represents the uncertainty.

Here’s a good example from [this link]:

Can you see how the metric of interest is on the X-axis? The width of each bell curve represents the uncertainty and the difference between any pair of cases is the difference on the horizontal (X) axis, not the vertical one.

Instead, what do the IB authors did? They obviously like sleek looking shapes but know nothing about how to use them. They could have used two bars and let the viewer compare their heights. But nooooo! Bars are not c3wl! Bars are boring! Instead, they took probability density curves (that’s how they are technically called) and made them pretend to be bars.

Bars. Is this THAT hard?

I can hear some of you saying, “Stop being so purist! What’s wrong with comparing the heights of bell curves?” I’ll tell you what’s wrong! Data visualization is a language. As with any language, it has some rules and traditions. If you hear me saying, “me go home,” you will understand me without any problem. However, you will silently judge me for my poor use of the English language. I know that, and since English is my third language, I use all the help to make as few mistakes as possible. The same is correct with data visualization. Please respect its rules and traditions, even if (and especially if) are not fluent in it.

I never write more than two sentences in English without Grammarly

Visit the worst practice tag in this blog to see more bad examples

The Empirical Metamathematics of Euclid and Beyond — Stephen Wolfram Blog

I am seldomly jealous of people, but when I am, I’m jealous of Stephen Wolfram

Towards a Science of Metamathematics One of the many surprising things about our Wolfram Physics Project is that it seems to have implications even beyond physics. In our effort to develop a fundamental theory of physics it seems as if the tower of ideas and formalism that we’ve ended up inventing are actually quite general,…

The Empirical Metamathematics of Euclid and Beyond — Stephen Wolfram Blog
Published
Categorized as blog

Boris Gorelik on the biggest missed opportunity in data visualization — Data for Breakfast

My guest talk at Automattic.

Boris Gorelik recently joined us to present on The Biggest Missed Opportunity in Data Visualization based on his recent talk at the NDR conference. Boris was a data scientist at Automattic, is now a data science consultant, and blogs regularly on data visualization and productivity.  Some of highlights (along with a handy timestamp) include: Keep […]

Boris Gorelik on the biggest missed opportunity in data visualization — Data for Breakfast
Published
Categorized as blog

15-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between for a perios of several years.

Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.
(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

Career advice. Becoming a freelancer immediately after finishing a masters degree

Photo by Miguel u00c1. Padriu00f1u00e1n on Pexels.com

Will Cray [link] is a fresh M.Sc. in Computer Science and considers becoming a freelancer in the Machine Learning / Artificial Intelligence / Data Science field. Will asked for advice on the LocallyOptimistic.com community Slack channel. Here’s will question (all the names in this post are used with people’s permissions).

Read more career advices [here].

Let’s begin.

Will Cray 

I’m hoping to start a career as a freelancer in the AI space after finishing my Master’s in CS with a focus in AI. I don’t, however, have any industry experience in AI or data science. Do you all think it’s feasible to start a freelancing career without any industry experience? If so, do you have any tips on how to do it successfully?
[I worked for] two years at a major tech company, but I was a systems engineer. It was experience that isn’t necessarily relevant to what I want to work on as a freelancer.

Let’s divide the response to Will’s questions into two parts that correspond to Slack’s two discussion threads.

Thread #1 – Michael Kaminsky

This is a copy/paste from Slack.

Michael Kaminsky 

LocallyOptimistic.com — a valuable source for data folks

My hunch is that it’s going to be pretty tough to get started, though not impossible. You’re probably looking at a pretty lean year or two to build up a reputation out of the gate

Michael Kaminsky 

AI work in general is sort of difficult to contract out — so you might have more luck if you team up with a larger consulting outfit that can handle the other non-AI parts of the work

Michael Kaminsky 

very rarely is someone like “we have all of the data pipeline and pieces working, now we just need to hire someone to do the AI part” — in general, the model-fitting part of an AI project is the easiest and fastest

Will Cray 

Thank you so much for the info–it’s really helping me getting a better understanding of the landscape. Would your opinion, especially regarding that last message, change if the AI work I was doing was more custom model/agent design and training, rather than doing something quick like .fit() in sklearn?

Michael Kaminsky

ummm maybe? but like who needs custom model/agent design and training that doesn’t already have in-house data scientists working on it?

Michael Kaminsky

I don’t want to dissuade you, but my point is that you should think about who your customers are, and how you can market your services in such a way that it will provide them value. If you don’t have a clear map of the three concepts in italics, it could get rough — you can definitely figure it out by doing it, but that’s what you’ll be up against

Will Cray

You mentioned “larger consulting outfits” earlier–do you have any examples of organizations that you think could be a good fit?

Michael Kaminsky

so Brooklyn Data Company and 4 mile consulting are the two that jump to my mind — they specialize in BI and data but might want flex capacity into DS — they might be able to give you deal flow, etc. I know there are a number of others, maybe even folks in this channel

Thread #2 – Boris Gorelik

This is a copy/paste from Slack with some later edits and additions. 

Boris Gorelik 

Another thing to consider is what your risks are. If there are people who depend on you financially, starting with a freelance career might be too risky, especially if you don’t have 1-2 (better 2) customers who already committed to paying you for your services.

If you can afford several months without a steady income, or no income at all, being a freelancer might expose you to a larger variety of companies and business models in the market. I know some people who used to work as freelancers and gradually “adopted” one customer and moved to full employment. In these cases, freelance projects were, in fact, mutual trial periods where both sides decided whether there is a good fit.

Will Cray 

I greatly appreciate this insight. I have little risks. I’m single, my living expenses are low, and I have some financial runway. Part of the reason I like the idea of freelancing is for the reason you stated–I’ll get to see many different business models. As an aspiring entrepreneur, I think diversity of experiences and exposure would be useful to me. I also think being flexible in how many hours I work will allow me to allocate more time to developing my own ideas/projects; although, I understand that’s a luxury that comes with being an established freelancer. I don’t have any clients currently. Do you have any recommendations for channels to try and garner clients?

Boris Gorelik

> As an aspiring entrepreneur, I think ….

Even though a freelancer and an entrepreneur’s legal status may be the same, they are different occupations and careers. An entrepreneur creates and realizes business models; a freelancer sells their time and expertise to fulfill someone else’s ideas. That’s true that most of the time (not always), combining freelance with entrepreneurship is easier than combining entrepreneurship with being a full-time employee in a traditional company.

 > Do you have any recommendations for channels to try and garner clients?

Nothing except the regular facebook/linkedin/ but mostly friends and former coworkers and, in your case, teachers/lecturers. I got my first job interview via my Ph.D. advisor. Later, when I helped in hiring processes, I asked him and other professors to refer me to proper candidates. So yeah, make sure your professors know your status.

Exploring alternatives to population pyramids

A population pyramid also called an “age-gender-pyramid”, is a graphical illustration that shows the distribution of various age groups in a population (typically that of a country or region of the world), which forms the shape of a pyramid when the population is growing [citation from Wikipedia].

In some cases, the pyramid provides interesting insights into the entire population. In this post, I will explore ways to make some of these insights more visible. 

The basic case

Let’s start with the basic case. If you have two-three hours of spare time, you can go to the site devoted to population pyramids — https://www.populationpyramid.net. There, you will find population pyramids for every country in the world. The site provides present and past data, as well as future forecasts. To understand how insightful age pyramids can be, look at the graph that represents the entire world.

(this and most other images in this post are from the site http://populationpyramid.net/)

You can clearly see that the world is mostly young, that the amount of people declines as the age progresses, and that there is a rough balance between men and women in the world, at least before the ages of 70+.

Now, examine the stark difference between the populations of Western Africa and Western Europe. Citing the late professor Hans Rosling, we can still see two worlds, one with large families and short lives, and one with small families and long lives. 

Another starking example of an age pyramid is the following

Do you want to guess what country is that? This particular graph shows the age distribution of the United Arab Emirates. Such a vast distortion in symmetry and age distribution stems from the fact that more than 80% of the UAE’s population is composed of expats who come to this rich country to work. The pyramid below (taken from [this article]) sheds some light to the population composition of UAE. (Note that the genders in this graph are reversed).

Whose bar is longer?

The male-female disbalance in the UAE and some other Gulf countries is very striking and cannot be missed. But what about other, more subtle cases? Take a look at the world graph above. If you follow the numbers on the bars, you will notice that more boys are born than girls, but there are more old ladies than old gents in the world. Can we make such differences less subtle?

To answer this question, we need to understand why we find it hard to compare almost equal bars. The reason for that is that our eyes (or brains) are not so good at comparing sizes. They do, however, do a much better job comparing positions. Thus, if we overlap these bars, we will see the small differences in a much more precise manner. 

(I thank the data visualization expert Bella Graf from InfoServiz.co.il for the idea of this graph).

Now, the subtle differences in gender composition are more visible. 

What am I looking at?

When I teach data visualization, I always tell my students to add a meaningful title to the graph. By “meaningful,” I mean a title that does not answer the question “what” but rather “so what”? (See my posts “How to suck less in data visualization,” and “C for conclusion“). What would a good title for this graph be? Let’s try the following

OK, so now, when we have a title, we can ask ourselves, “does the graph show what it says it shows”? And the answer is no. Right now, the title talks about differences, but we don’t see the differences. We see the differences and other stuff. Let’s look only at the differences.

I don’t like this.

What about this?

Now, this is not an age pyramid. That’s for sure. This graph doesn’t show the wealth of data that the classical pyramid shows. On the other hand, it does offer one thing, and it does it very well. Look, for example, at the male/female distortion in China in 1990.

You may find the code I used to create the graphs in this post [on GitHub].

The Mysterious Status of .blog Domains

Photo by Bruno Bueno on Pexels.com

When the .blog TLD was started by Automattic, employees were given the option to reserve a domain for free. In return […], they asked that the domain be used as a primary domain (no forwarding to a different site), and that the site be updated with new content at least once a month. This requirement was the last argument for me NOT taking boris.blog — I didn’t want to make this commitment, plus I like gorelik.net a lot.

Recently, there were some not so nice developments about .blog names that were given away to Automatticians. The complains about this situation are usually anonymously, but I think that in this case, anonymity isn’t the right approach. That is why, I decided to share here an anonymous post from the Antimattic blog. Although I am not the author of this original post, and I don’t share the views of some of the posts written there, I do share the concerns expressed in this particular article. Posting in return for a domain name might have been a reasonable request at the beginning of the .blog TLD to help promoting its adaptation. But now, several years after this TLD is active, this requirement is simply not OK. To read the original post, click the screenshot below.

The first paragraph of this post is a verbatim copy from Antimattic.

ASCII histograms are quick, easy to use and to implement

From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Recently, I updated the python function that I use to create ASCII histograms. The updated function [link] uses more modern formatting and includes several signal-to-noise improvements. One can also use it with custom output functions, such as logging.info.

A short compilation of productivity blog posts

Photo by Mike on Pexels.com

This post contains a bunch of links to blogs that write about productivity.

  1. Musings of Brown Girls

This is not an exclusively productivity blog. The authors of this collective effort write about other interesting things. I read some posts, and I liked them

2. Self care

Do you know that feeling when you feel bad and don’t have the energy to do anything about that? This post is for you.

3. Saying NO

Being a freelancer, I have to practice saying NO. Saying NO isn’t only good for productivity but also for your mental health. Interesting post.

Many is not enough: Counting simulations to bootstrap the right way — Yanir Seroussi

An interesting post by my former coworker, Yanir Seroussi.

Previously, I encouraged readers to test different approaches to bootstrapped confidence interval (CI) estimation. Such testing can done by relying on the definition of CIs: Given an infinite number of independent samples from the same population, we expect a ci_level CI to contain the population parameter in exactly ci_level percent of the samples. Therefore, we […]

Many is not enough: Counting simulations to bootstrap the right way — Yanir Seroussi
Published
Categorized as blog

Book review: The Abyss: Bridging the Divide between Israel and the Arab World

TL;DR If you are an Israeli and don’t feel like learning the behind the scenes stories, skip it. Otherwise, I do recommend reading this book. I enjoyed it a lot 4.5/5

The Abyss: Bridging the Divide between Israel and the Arab World went to print slightly after the outbreak of the “Arab Spring.” The author, Eli Avidar, is a former Israeli intelligence officer and diplomat. Among other things, Eli Avidar served as the head of the Israeli diplomatic mission to Qatar in 1999. Today, Eli Avidar is a Knesset member for the right-wing Yisrael Beiteinu party. Even though so many things have changed since the book was published, I didn’t find any claim that Eli Avidar made, and that turned out to be wrong, nine years after the publication. 

I enjoyed reading this book a lot despite the fact that most of Eli Avidar’s claims are not new to me. Most of them are widely known to all the Israelis, and the real question is not whether you are aware of these claims, but whether you agree with them and what conclusions you make out of them.

On the other hand, The Abyss is an interesting storybook full of behind the scenes anecdotes and gossip. All who know me know how much I like gossips. It also provides a great introspection of how the (Jewish-)Israeli society sees the Arab-Israeli conflict, and what it feels towards it.

Should you read the book? If you are an Israeli and don’t feel like learning the behind the scenes stories, you may skip it. Otherwise, I do recommend reading this book. I don’t know how accurate is Avidar’s description of the Arab world, but his analysis of the Israeli behavior and attitude is very accurate. If you ever cough yourself wondering “What the fuck do the Israelis think?”, this book might shed some light for you. That is why I write this review in English, despite my tendency to review Hebrew books in my Hebrew blog.

Fun fact. I finished reading this book on August the 13th. I closed the book, opened Twitter, and saw my feed FULL with news about the upcoming normalization treaty between Israel and UAE. 

What is the biggest problem of the Jet and Rainbow color maps, and why is it not as evil as I thought?

There was a consensus among the data visualization purists that the rainbow color map, and it’s close cousin Jet are bad. Really bad. These colormaps used to be popular at the beginning of the computational data visualization era. However, their popularity decreased in the last five years or so. The sentiment isn’t as bad as it used to be a couple of years ago, but still.

A screenshot from circa 2016. Today we are less fanatic than that

What is the biggest problem of the rainbow colormap? The most apparent problem with this particular colormap is that it not perceptually uniform. By “perceptually uniform,” I mean that equal changes in the value that we encode using a colormap should correspond to same changes in the color perception. This is not the case with the rainbow or the Jet colormaps. They have distinct bright and dark stripes within the number range, making them the wrong choice to encode numerical data. The situation is even worse for people with impaired color vision.

Can you be less perceptually uniform?

The solution to this problem was proposed in the form of better colormaps. The first one that I know of is Parula by Matlab, and it’s opensource alternative Viridis that is available in matplotlib and many other plotting libraries. (Watch this video about viridis to get a good introduction to color perception and color maps).

Viridis, the new rainbow

Everything was nice and good, and I was trashing the rainbow colormap whenever I could. Until yesterday, when I read about Turbo, the improved rainbow colormap developed by Google.

In the long and interesting blog post that describes Turbo, Anton Mikhailov, a software engineer in Google, describes several relevant applications of a “good rainbow” scheme. 

According to Anton, “Because of rapid color and lightness changes, Jet accentuates detail in the background that is less apparent with Viridis and even Inferno. Depending on the data, some detail may be lost entirely to the naked eye. The background in the following images is barely distinguishable with Inferno (which is already punchier than Viridis), but clear with Turbo.”

I must admit that I’m convinced. 

The biggest problem with that is mentioned concerning the original rainbow scheme that its brightness varies too much. However, it turns out that the color saturation and hue attract our attention more than the lightness (here’s the reference which I haven’t read yet). As such, it makes sense to construct a colormap that relies more on color and hue changes. 

Moreover, in many cases, the interesting details appear in the extreme values of the data range, not in the middle. In thes cases, a properly applied rainbow-like color scheme becomes a valid choice.

The bottom line is that one should not refrain from using rainbow(-like) color maps in their visualizations anymore, provided that they use a modern implementation. Luckily, it’s even available in matplotlib

If you don’t teach yet, start! It will make you a better professional.

Many people know me as a data scientist. However, I also teach, which is sort of unnoticed to many of my friends and colleagues. I created a page dedicated to my teaching activity. Talk to me if you want to organize a course or a workshop.

I also highly recommend teaching as way of learning. So, if you don’t teach yet, start! It will make you a better professional.

How to suck less in data visualization and professional communication

In technical communication, the main thing is to keep the main thing the main thing. There are multiple ways to ensure this principle. Some of these ways require careful chart fine-tuning. However, there is one tool that is easy to master, fast to apply, and that provides a high return on the investment rate. I refer to chart titles. In this talk, I had two main theses. My first thesis is that most of you suck in communication (and not only data visualization).

My second thesis is that you can quickly improve your graphs by merely adding a good title. The importance of good titles is not new to my preaching, but I thought it was an excellent thing to formalize this thesis a bit, and I’m thankful to the NDR organizers for giving me this opportunity.

Following is the slide stack from my NDR presentation.

35 (and more) Ways Data Go Bad — Stats With Cats Blog

If you plan working data analysis or processing, read the excellent post in the “stats with cats blog” titled “35 Ways Data Go Bad” post. I did experience each and every one of the 35 problems. However, this list is far from being complete. One should add the comprehensive list of Falsehoods Programmers Believe About Time.

When you take your first statistics class, your professor will be a kind person who cares about your mental well-being. OK, maybe not, but what the professor won’t do is give you real-world data sets. The data may represent things you find in the real world but the data set will be free of errors. […]

35 Ways Data Go Bad — Stats With Cats Blog

Unexpected hitch of working in a distributed team

Photo by Porapak Apichodilok on Pexels.com

It has been about half a year after I became a freelance data scientist. Before my career change, I worked in a distributed team for more than five years. Today, I suddenly realized that working in a distributed team has a significant problem, inherent to its distributed, multinational, nature.

My team was always spread over multiple time zones. Sometimes, the time zone span was so broad, that we could never find a time slot where all the team members were ordinarily awake. Automattic, the company I used to work for, is a firm believer in asynchronous communication, but from time to time, you HAVE to meet over a Zoom/Slack/Whatever call. Since I wasn’t a manager, the number of live calls that I had to attend was kept to a minimum, and yet, I found myself at least twice a week in a 10 pm Zoom call. I don’t know what about you, but my brain keeps working for at least two outs after log off. Thus, twice a week, I would find myself going to bed after one o’clock at night. As a result, I was sleep deprived for the majority of the week.

Only now have I noticed the fact that my sleep has improved so much after the career change. I know that people who work in “colocated” teams also find themselves in late night phone calls, but working in a distributed group means that you’ll do it regularly.

Hybrid digital/analog tangible week planning

Here’s a neat method that helps me organize my week, increase my productivity and fight procrastination. 

Being a freelancer data scientist, I’m involved in three hands-on projects for two clients. I also manage/mentor two data scientists in two other projects, and participate in strategic discussions for a customer of mine, and in a startup in which I invest. Oh, I am also in the final stages of writing a paper. I never imagined I would be in the situation with so many balls that I need to keep in the air. How do I manage to keep sanity? 

This is what I do. Following the advice in “15 Secrets Successful People Know About Time Management“, I try to keep as many items in my calendar as possible. When my workweek starts, I print out the weekly schedule on a sheet of paper. Then, I apply the tangible GTD hack that I learned from another book [link] and write out all my projects on a bunch of small post-it notes. These notes allow me “dumping” all my brain contents into an external medium, which frees up my brain to spend more CPU cycles on processing, rather than remembering and worrying. 

Next comes the fun part, I get to play with my cards by arranging them on the weekly schedule. The geometry of the post-it notes and the sheet of paper ensures that I allocate reasonably larget chunks of time for each “big thing.” It also reminds me that the amount of time each day is limited, and I can’t stick too many plans into a day or a week. (No, I won’t be able to finalize the paper, complete the analysis for a retail shop, learn a chapter in Bayesian statistics book, before the end of today).

After I’m done, I copy each post-it note into my calendar. Thanks to the integration with Todoist (an excellent productivity tool), all these tasks end up in my todo list, where I can further work with them.

To sum up:

  • Global week overview – check
  • Prioritization and honesty – check.
  • Fun playing with sticky notes – check.
  • Work gets done – (I wish!).

Oh, did you notice the appointments between 5 and 6 am? This is my sports activity. Sometimes working out charges me for the entire day. Sometimes, all I want to do for the entire day is to have a nap 🙂

Before and after. Even excellent graphs can be improved

Being a data visualization consultant, I can’t help looking for dataviz problems in graphs that I see. Even if the graph is good. Even if I know that I would not be able to create a graph that good. Even if the overall graph is excellent, and the problems are minor, or maybe especially when the graph is excellent, and the problems are minor.

This is a nice graph published by Nevo Benita on Linkedin.

The graph presents the gap between the men and the women in the Israeli job market. As I said, the graph is excellent. However, there are several small problems that, like grains of sand in a chocolate mousse, stand in the way. Let’s take a look at them.

The time-series line in the upper right part of the graph shows good use of the real estate. The problem is that the X-axis ticks (the years) look as if they belong to the chart below. It takes some time to realize that the numbers are years of the upper graph, and not the X-axis of the graph below. Moving the numbers upwards by several pixels would have fixed that.

Now, it is more clear that “1990” and “2018” relate to the time-series graph above.
Before (left) and after (right).

Let’s talk about the left-side bar chart. It took me a while to understand what it is. As a matter of fact, I managed to write a critique paragraph about that bar chart, how it is unclear what the percentages are, and how they were computed. Only then had I noticed the explanation below. Such confusion isn’t the viewer’s fault. Since we usually scan images from top to bottom, moving the title to the top of the chart will reduce this confusion. The word “percent” is also redundant in that title since it comes after the percent sign.

Moving the explanation to the top makes it easier to notice. Before (left) and after (right)

The last point that is worth optimizing is the color order. Consistent element order in an image makes navigation and comprehension much easier. When the order is preserved, our brain can use mental shortcuts without losing much information. When these shortcuts are broken, the brain has to work harder. What am I talking about? The graph author made the correct decision to use different font colors in the graph title to specify which color stands for which gender. This way, we don’t need a separate legend, and this is good. The title is an ordered sequence of words. The visualizer could use this order to create the order heuristic that is so helpful. Such a heuristic isn’t always possible. Fortunately for the visualizer (and sadly for the society), the salary gap in all the occupations in this graph have the same direction: men earn more than women. As a result, the rightmost part has all the green dots on the right, and the purple dots are on the left. This direction is opposite to the gender direction in the title and the color direction in the bar chart. To fix this situation, I made sure that the color that stands for the women (purple) is always to the left of the color that designates the men (green).

Keeping the color order. Before (left) and after (right)

So, this is the final result. I hope you can see why I like it better.

That’s how I took and excellent graph and made it even more awesome.

Data visualization is not only dots, bars, and pies

Look at this wonderful piece of data visualization (taken from here). If you know the terms “tertiary structure” and “glycan”, there is NO way you miss the message that the author of this figure wanted to convey.

Also, note how using appropriate colors in the title, the authors got rid of graph legend.

How to become a Python professional in 42 hours?

Here’s an appealing ad that I saw

This image has an empty alt attribute; its file name is image-2.png

How to become a Python professional in 42 hours? I’ll tell you how. There is no way. I don’t know any field of knowledge in which one can become professional after 42 hours. Certainly not Python. Not even after 42 days. Maybe after 42 weeks if that’s mostly what you do and you already a programmer.

Book review. Five Stars by Carmine Gallo

TL;DR Good motivation to improve communication. Inadequate source of information on how to achieve that 

The central premise of Five Stars Communication Secrets to Get from Good to Great by Carmine Gallo is that professionals who don’t invest in communication skills are at high risk of being replaced by computers and robots. One of the book’s sections bares the title that summarises this premise very well “Storytelling isn’t a soft skill; it’s the equivalent of hard cash.” I firmly believe in these premises. That is why I invest so much time in learning and teaching data visualization, in public speaking, and blogging. 

When I started reading this book, I got excited. I kept marking one passage after another. Gallo packed the first part of the book with numerous citations and explanations on how a lack of communication skills is the most severe risk factor in the career of a modern professional, team, or company. One example leads to another one, and one smart conclusion followed another one. 

Then, I started noticing that the book tries to convince me more and more, but I didn’t need that convincing in the first place. More than half of the book is evangelism. The author tells you how essential communication skills are, then he gives you some examples of people who did it right, and then again talks on importance. Again, and again, and again. Where are all those “secrets to get from good to great”???

When, finally, we get to the practical parts, the reader is left mostly with shallow, almost trivial bits of advice. 

Some of the most important points I took from this book

Slight feeling of a hamster-wheel while reading this book

Adopt the three-act storytelling approach to presentations. The three-act storytelling approach worked for Homerus, Shakespear, Tarantino, and there is no reason it should fail you in your technical presentations. Fair enough. On the other hand, this 2012 article by Nancy Duarte, provides more depth and more actionable information on this approach (follow Duarte’s blog if presentation skills are something you are interested in). 

“In the first two to three minutes of a presentation, I want people to lean forward in their chairs.” I like this citation by Avinash Kaushik, Google’s digital marketing evangelist. I will undoubtedly try this approach in my next presentations.

Should you read this book?

If you read these lines, your job depends on your communication and presentation skills. If you believe this premise, you can skip the first 60% of the book. If you want to improve your communication skills, I suggest reading Jean-luc Doumont’s “Trees, Maps, and Theorems,” which is much shorter, but also much denser in methods and practical advice. 

The bottom line

3.5/5

The delicate art of fine trolling

Photo by Pixabay on Pexels.com

I’m reading the a 1991 paper by Barbara Tversky that deals with the directional representation of time. One sentence in the paper interview says

“There does not seem to be strong universal cognitive associations of quantity or quality to left or right”

Whenever I make a similar statement in the context of data visualization, I frequently get a self-assured response “of course there is – smaller numbers appear on the left!”. To answer this remark, Barbara Tversky added a small footnote that says

“Anyone in doubt should consult politicians on both the left and the right.”

Photo by Pixabay on Pexels.com

So gentle, yet so powerful.

Lie factor in ad graphs

It’s fun to look at the visit statistics and to discover old stories. I wrote this post in 2016. For a reason I don’t know, this post has been one of the most viewed posts in my blogs during the last week. 

So, I decided to publish it again. I won’t add any new examples, but if you want to see more stuff, type [lying with data visualization] in your favorite search engine

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

StellarGraph — another promising network analysis library for Python and Scala

Network (graph) analysis is a complicated topic. There are several tools available for this task with different pros and cons. Recently, I stumbled upon another tool StellarGraph. StellarGraph authors claim to provide excellent performance; NumPy, Pandas, TensorFlow integration, an impressive set of algorithms, inter compatibility with Neo4j (THE graph database); and much more. The documentation looks very clear and extensive too.

I didn’t use it yet, but I certainly plan to.

https://www.stellargraph.io

The hazard of being a wizard. On balance between specialization and the risk to become obsolete.

A wizard is a person who continually improves his or her professional skill in a particular and defined field. I learned about this definition of wizardness from the book “Managing project, people and yourself” by Nikolay Toverosky (the book is in Russian).  

Recently, Nikolay published an interesting post about the hazards of becoming a wizard. The gist of the idea is that while you are polishing your single skill to perfection, the world changes. You may find your super-skill irrelevant anymore (see my Soviet Shoemaker story).

Nikolay doesn’t give any suggestions. Neither do I. 

Below is the link to the original post. The post is in Russian, and you can use Google Translate to read it.

Страница о магах У меня в книге есть глава про полководцев и магов. В её конце я подвожу итог: Несмотря на свою кру­тость, маг уяз­вим. Он поле­зен, только если его навык под­хо­дит к задаче. 658 more words

Почему опасно быть магом — Об управлении проектами и дизайне

Bioinformatics career advice and a story about a Soviet shoemaker

When I was in elementary school (back in the USSR of the mid 80’s), I had a friend whose father was a shoemaker. Due to the crazy stupid way the Soviet economy worked, a Soviet shoemaker was much richer than a physician or an engineer. But this is not the story. The story is that one day this friend’s father had a chat with me about selecting a profession. This man’s point was that for as long as people have feet and need shoes on their feet, a shoemaker would be required and well-earning occupation. Guess what? People still have feet, and still, ware shoes, but I don’t see too many successful shoemakers anymore. 

Common wisdom says, “It is very hard to predict, especially the future.” And I will add “even more especially, about the job market.”. Nevertheless, people need to decide what to do with their lives, how to live, and what career paths to pursue. Some of them ask me, and I’m glad to answer. If you have any career-related questions, don’t be shy! Write to boris@gorelik.net, and I’ll see what wisdom I will be able to share with you.

Anyhow, this is a letter that I got from another pharmacist looking for a data science career.

Hope you are doing well. I saw your posts on Quora and thought of asking a doubt.
First let me tell my background. I am from India, I completed my Doctor of Pharmacy program (Pharm D). I am familiar with computer programming. I have intermediate knowledge in python and R programming.  So I thought taking up Bioinformatics and computational biology Masters program so that I can connect Pharma industry and my knowledge in computer science. 
What do you think? 
I have applied to University XYZ and got offer letter. I have to take a decision within 2 weeks.
Please let me know your thoughts on this.

To which I replied

Obviously, since the path you are describing similar to the one I took, I will think that it is a good idea. Moreover, as you might have read in my blog (for example, here), my opinion is that advanced degrees give much more stable foundations, compared to the “fast and easy” courses. Having said that, your life is yours, not mine, and the job market today is not the job market in 2001 when I graduated my B.Pharm.  

Thank you so much for replying to my silly question. I am honoured to get a response from you. 

First of all, I don’t believe in “there are no silly questions” bullshit, but asking a silly question is better than not asking at all. Secondly, these questions are not silly at all.

I have a question, in your post dated 2017, you have mentioned that Bioinformatics was booming in 2001 and now it has lost its significance. Are you still have the same thoughts? 

I think that this person refers to the most visited post of mine “Don’t study data science as a career move; you’ll waste your time!”.  There is also a 2019 follow-up.

If that is the case then me taking a master’s in bioinformatics and computational genomics would be a bad idea, right ?

Here’s what I responded. Keep in mind that I wrote this before the COVID-19 outbreak.

Look, the markets in different countries are different. 

Back in the old days, there was a worldwide wave of closing bioinfo companies. All the Israeli ones were either closing or counting weeks before closing. One anecdote: I was interviewing at a company. Two weeks later, I called the person who interviewed me to ask whether I got the job or not, and the secretary told me that that person was fired due to layoffs. 

Right now, Israel sees a renaissance of bioinformatics companies, but I don’t know what will happen in the future. These companies live mostly out of investors’ money and are subject to strict regulations. However, if you get a good education, your head will be full of useful mental models, relevant basic knowledge, and good practices. 

End of quote. One of The COVID-19 madness side effects is the massive influx of money into biotech companies. Is this a short-term anecdote, or will it become a sustainable trend? I have no idea.

Do you have any career-related questions to me? You don’t have to be a pharmacist to ask :-). Write to boris@gorelik.net. I promise to respond, even if by sending a link to my blog posts. 

The difference between statistically meaningful and practically meaningful. An interview with me

Recently, I gave an interview to the Techie Leadership site. Andrei Crudu, the interviewer, made a helpful outline of the conversation. I marked the most important parts in bold.

  • Academic views on leadership;
  • Managing people isn’t for everyone;
  • Lessons from a practical approach;
  • Data Science is predominantly about data cleaning;
  • The difference between statistically meaningful and practically meaningful;
  • How sometimes companies tweak results to match expectations;
  • Bad managers make you appreciate the good managers;
  • Giving credit, being decent and not cheating;
  • All good teamwork starts with effective communication;
  • You don’t know that the stuff that you know is unknown to others;

Overall, I enjoyed chatting with Andrei, and I hope you’ll enjoy listening to the interview. If you have any comments, feel free sharing them here or on the Techie Leadership size

Is Distributed Work a Divide and Conquer Strategy?

Photo by Markus Spiske on Pexels.com

Before becoming a freelance data scientist, I used to work at Automattic, which I used to regard as my dream job. Not every current and ex-Automattician share that rosy point of view. Antimattic is an anonymous blog that allows ex-Automattic employees to vent their feelings about what used to be their workplace. One recent post on that blog raises a fascinating question about distributed (or work from home, or remote) companies. “Is Distributed Work a Divide and Conquer Strategy?” I have to admit that I haven’t thought about this perspective before. It looks like we will see more and more companies switching to remote work. It’s an interesting interpretation of the “future of work.”

Obviously this site exists because people have had negative experiences at Automattic. But many people have also had very positive experiences at the company. Could it be that the distributed nature of Automattic allows for such varying experiences? 45 more words

Is Distributed Work a Divide and Conquer Strategy? — Antimattic

Logarithmic scale misinforms. Period

Being a data scientist and a self-proclaimed data visualization expert, I like using log scale graphs when I find them appropriate. However, as a speaker and a communicator, I refrain from using them in presentations as much as possible. From my experience as a data visualization lecturer, I noticed that even “technical” struggle grasping the concept of log scale graphs. 

One of the Coronavirus side effects was the introduction of the term “exponential growth” to every living room. Naturally (to some of us), exponential growth is best presented using a semi-log graph, where the X-axis represents the time (linear), and the Y-axis represents the degree of magnitude of a value (log scale). 

A recent study (link) tested and demonstrated how bad log-scale is. The research title is “The Logarithmic Scale Misinforms the Public and Affects Policy Preferences.” From my experience, log scale graphs misinform everybody. Except for experienced data scientists. Nothing can confuse or misinform us, obviously 😉

It is a bummer though that data visualization in that paper sucks so much.

Don’t publish graphs like this. Especially not in data visualization papers.

Thanks to Bella Graph who pointed me to the original study.

Book review: The Year Without Pants. WordPress.com and the future of work by Scott Berkun

TL;DR Interesting “history of work” book (definitely not “future of work”) with insights on transition-state organizations. Read it if history of work is your thing, or if you work in a small company that grows rapidly. 4.5/5 (due to the personal connection)



I got The Year Without Pants in 2014 as an onboarding present when I joined Automattic. The author, Scott Berkun, used to work as a manager at Microsoft (and maybe more places) before he quit and became a career of an adviser and an author. In 2011, the Automattic founder brought Scott to work at the company. About seventy people were working in the company back then and the company was growing rapidly. Automattic has just introduced a concept of teams, and the idea was that Scott will work as a team leader, consulting the management on how to deal with the transition.

Being an ex-Microsoft manager, Scott was fascinated by the small distributed company, and wrote a book on it, proclaiming that the way Automattic worked was “the future of work”.

The book was published in 2012. Today, in post-COVID 2020, nobody is surprised by people who don’t need to go to the office every day. Automattic has now more than 1,000 employees and has adopted many of the rituals big companies have, such as endless meetings, tedious coordination, name tags, and corporate speak.

Why, then, did I enjoy the book? First, for me, it was a pleasant “time travel.” I enjoyed reading about people I knew, teams I worked with, and practices I used to love or hate. Secondly, this book provides insights on a transition from a small group of like-thinkers to a formalized organization.

“Why it burns when you P” and other statistics rants

Do you sometimes Google for something only to find stuff written by yourself?
I teach a course called “data-based decision making.” While googling for examples of statistics misuse, I stumbled upon an interesting blog post that I wrote about one and a half years ago.

The post is so good; I decided to post it again.

——————————

“Sunday grumpiness” is an SFW translation of Hebrew phrase that describes the most common state of mind people experience on their first work weekday. My grumpiness causes procrastination. Today, I tried to steer this procrastination to something more productive, so I searched for some statistics-related terms and stumbled upon a couple of interesting links in which people bitch about p-values.

Why it burns when you P” is a five-years-old rant about P values. It’s funny, informative and easy to read

Everything Wrong With P-Values Under One Roof” is a recent rant about p-values written in a form of a scientific paper. William M. Briggs, the author of this paper, ends it with an encouraging statement: “No, confidence intervals are not better. That for another day.”

Everything wrong with statistics (and how to fix it)” is a one-hour video lecture by Dr. Kristin Lennox who talks about the same problems. I saw this video, and two more talks by Dr. Lennox on a flight I highly recommend all her videos on YouTube.

Do You Hate Statistics as Much as Everyone Else?” — A Natan Yau’s (from flowingdata.com) attempt to get thoughtful comments from his knowledgeable readers.

This list will not be complete without the classics:

Why Most Published Research Findings Are False“, “Mindless Statistics“, and “Cargo Cult Science“. If you haven’t read these three pieces of wisdom, you absolutely should, they will change the way you look at numbers and research.

*The literal meaning of שביזות יום א is Sunday dick-brokenness.

Visualising Odds Ratio — Henry Lau

Besides being a freelancer data scientist and visualization expert, I teach. One of the toughest concepts to teach and to visualize is odds ratio. Today, I stumbled upon a very interesting post that deals exactly with that

On Thursday 7 May, the ONS published analysis comparing deaths involving COVID-19 by ethnicity. There’s an excellent summary on twitter but the headline is that when taking into account age and other socio-demographic factors, such as deprivation, household composition, education, health and disability, there is higher risk for some ethnic groups of a COVID related…

Visualising Odds Ratio — Henry Lau

Calling bullshit on “persistence leads to success”

Did you know that J.K. Rowling, the author of Harry Potter, submitted her books 13 times before it was accepted? Did you know that Thomas Edison tried again and again, even though his teachers thought he was “too stupid to learn anything?” Did you know that Lior Raz (Fauda’s creator and lead actor) was an anonymous actor for more than ten years before he broke the barrier of anonymity? What do these all people have in common? They persisted, and they succeeded. BUT, and there is a big but.

girl wearing pink framed sunglasses

People keep telling us: follow your dream, and if you persist, it will come true. You will learn from your mistakes, improve, and adapt, and finally, will reach your goal. I call bullshit

Think of the Martingale betting strategy. In theory, it works. Why doesn’t it work in practice? Because nobody has infinite time and infinite pockets. The same is right with chasing your dream. We need to pay for the shelter above our heads, the food on our tables, the clothes that we wear. Often other people depend on us. Time passes by. I had to be a party pooper, but some people who chase their dreams will eat all their savings and will either have to give up or declare bankruptcy (and then give up).

Survivorship bias

But what about all those successful failers? What we see a typical example of survivorship bias, the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. We know the names Rowling, Edison, Raz, and others not because of their multiple failures but DESPITE them. For every Rowling, Edison, and Raz, there are thousands of failed writers, engineers, and actors who ended up broke and caused sorrow to their families.

So, should I quit?

I don’t know. Maybe. Maybe not. It’s your life, your decision.

On a person that falls into the water. Or why thinking short-time is a good strategy in times of crisis

Photo by Life Of Pix on Pexels.com

At the beginning of the COVID-19 crisis, I tried to explain to my daughter (and to myself) the rationale behind the draconic measures the governments take to fight with the crisis. One rationalization that I found was an analogy of a person that falls into the water. In this situation, the person needs to act FAST to stabilize the situation. Only than, he or she can start planning their steps.

I have been very vocal criticizing the dramatic measures that many governments took in the beginning of this crisis. It looks like these measures were more-or-less correct, and that the countries that didn’t implement them are now in a much worse situation, compared to the countries that did impose severe limitations. But even if in the retrospective it will turn out that one could do much better without the many “hammers,” I tend to think that those hammers were inevitable.

The conclusion? One day or another, we will all need to act very fast. This means that we need to be prepared, have plan B’s work on resilience, and maybe perform emergency drills.

Bad advice from a reputable source is bad advice.

Would you buy a grammar book with a clear spelling mistake on its cover? I hope not. That’s what happened to IBM when it published it’s new data visualization guide. I didn’t bother reading the manual because of what IBM decided to use as the first image of their guide.

We use graphs to transfer information into images that are supposed to be later transformed in our brains to information. What visual attributes do we use to interpret the information behind a pie chart? It is the segment angle, its area, or maybe the arc length? Most probably, the answer is “all of the above” (see Robert Kosara’s works for more info). When done right, the three attributes of pie segments are linearly connected one to another, which allows synergism between the visual clues.
But what did our friends at IBM do? The deliberately distorted the data! I took the screenshot from the guide homepage and made some measurements.
The purple segment has the angle of 182 degrees, and the angle of the black segment is 75 degrees, which gives us the ratio of 2.42. However, while the radius of the purple segment is 135 pixels, the radius of the black one is only 110 pixels. Why is this a problem? Well, due to the radius differences, the ratio between the arc lengths is 2.91, and the ratio between the areas is 3.66. So now, let me ask you: what is the ratio between the numbers represented by the purple and the black segments?
It is correct that the colors that IBM people used in their guide are neat, but data visualization that distorts information is not visualization but a piece of garbage. I assume that IBM produces decent computers, but don’t learn data visualization from them

Why is it (almost) impossible to set deadlines for data science projects?

I wrote this post in 2017. For some reason, it started gaining traffic in the last two weeks. I reviewed this post and couldn’t find any new insights. But maybe you can help me.

Boris Gorelik

In many cases, attempts to set a deadline to a data science project result in a complete fiasco. Why is that? Why, in many software projects, managers can have a reasonable time estimate for the completion but in most data science projects they can’t? The key points to answer this question are complexity and, to a greater extent, missing information. By “complexity” I don’t (only) mean the computational complexity. By “missing information” I don’t mean dirty data. Let us take a look at these two factors, one by one.

Complexity

Illustration: famous xkcd comic. Two programmers play during the compilation time
Think of this. Why most properly built bridges remain functional for decades and sometimes for centuries, while the rule in every non-trivial program is that “there is always another bug?”. I read this analogy in Joel Spolsky’s post written in 2001. The answer Joel provides is:

Once you’ve written a subroutine, you can call it as often as you…

View original post 665 more words

Published
Categorized as blog

Finally We May Have a Path to the Fundamental Theory of Physics… and It’s Beautiful — Stephen Wolfram Blog

OK, so Stephen Wolfram (a mega celebrity in the computational intelligence world and, among other things a physicist) claims that he may have found a path to the Fundamental Theory of Physics. The blog post is long, and I hope to be able to finish reading it in a week or two. The accompanying technical text is a 450-page tome available on a dedicated site.

Also, it turns out that Stephen Wolfram has a Twitch.tv channel in which he talks about science.

Website: Wolfram Physics Project Technical Intro: A Class of Models with the Potential to Represent Fundamental Physics How We Got Here: The Backstory of the Wolfram Physics Project… 26,455 more words

Finally We May Have a Path to the Fundamental Theory of Physics… and It’s Beautiful — Stephen Wolfram Blog

The quintessence of data visualization usefulness

I have to admit, I was skeptical at the beginning of the COVID-19 crisis. I started becoming skeptical now when it seems that the crisis didn’t hit my country too hard. But then I saw the graphs in this Financial Times article, and the skepticism disapeared. The graphs are accompanied by hundreds of words, but there is no need for reading the text to understand almost everything.

These graphs are so good, so convincing, so well performed, they don’t leave any place for doubt or misunderstanding of the message the author wants to convey.

If you study data visualization, look at these graphs. Look at the color choice, legend location, and design. Look at the ticks on the X- and Y-axes, how they are spaced and typeset. Note the amount of details on the axes, specifically how sparse these details are.

Book review: Never Split the Difference by Chris Voss

TL;DR: Dull on the surface but has a lot of good points

Never_Split_3D_Jacket_copy.png

I read Never Split the Difference following a friend’s recommendation. While reading the book, I kept feeling a constant sense of disappointment and mental eye-rolling. The author, Chris Voss, is a former FBI negotiator. The book is full of FBI war stories and pieces of advice that, on the top of it, sound either trivial or well known. HOWEVER, when the book was over, I sat summarizing my Kindle notes. Forty-five minutes later, I found myself staring at six pages of handwritten text of notes and takeaways. Which, surely, is a good sign.

What I didn’t like: too many “war stories” from the author’s past as an FBI negotiator; their connection to the business world sometimes seems too far-fetched.

What I liked: I liked the overall approach. Sometimes, the author cites academic research. Again, the fact that I took so many notes, is very impressive (to me).

The bottom line: 4/5 Read it, even if you already read a negotiation book.

The missing graves

Today, Israel marks Holocaust Day. Many words have been written about the Holocaust, and I want to write about missing graves.
If you visit a Jewish cemetery, you might see a lot of gravestones with additional memorial plates.

I took this picture in the Chișinău (Kishinev) Jewish cemetery. Burial of the deceased is considered the final act of kindness a person can perform to the dead. Erecting a “reminder and a name” (Yad-va-Shem), i.e a gravestone, is an intrinsic part of the burial. The Hebrew term for this act of kindness is “Chesed shel emet” — the truthful kindness. Many people died during the Holocaust without a grave, without a gravestone, and without any sign of kindness around them. That is why, when the Holocaust survivors started passing away after the war, their relatives decided to perform this final act of kindness by adding names of those who did not have the fortune to have their own grave.

This is the gravestone of my grandmother’s sister Etl (Ester). The lower plate is a list of eleven relatives who never had a grave

Why is forecasting s-curves hard?

Constance Crozier (@clcrozier on Twitter) shared an interesting simulation in which she tried to fit a sigmoid curve (s-curve) to predict a plateau in a time-series. The result was a very intuitive and convincing animation that shows how wrong her initial forecasts were.

The matter of fact is that this phenomenon is not new at all. My first post-University job involved fitting numerous pharmacodynamics models. We always had to keep in mind that if the available data does not account for at least 95% of the maximum effect, the model will be very much suboptimal. It took me a while, but I managed to find the reference for this phenomenon [here]. Maybe, when I have some time, I will repeat Constance Crozier’s analysis, and add confidence intervals to emphasize the point.

EDIT: I came the conclusion that the most important takaway message of this demonstration is the necessity of reporting uncertainty with any forecast, and how small the value of a forecast is without uncertainty estimations.

S-curves (or sigmoid functions) are commonly used to model the evolution of social or biological systems over time [1]. These functions start with exponential growth, then increase linearly, and finally level off (therefore end up looking like a wonky s). Many things that we think of as exponential functions will actually follow an s-curve (otherwise […]

Forecasting s-curves is hard — Constance Crozier

On oranizing a data org in a company, job titles, and more

Photo by Khimish Sharma on Pexels.com

My colleague, Simon Ouderkik, recorded a REALLY interesting interview with Stephen Levin of Zapier and Emilie Schario of Gitlab on organizing data org in a company, job titles, career ladders, and other important stuff.

As y’all may recall, last year I was lucky enough to spens some time working with the fine folks at Locally Optimistic to produce and run some AMA content for them – they ended up being more similar to traditional interviews, but folks seemed to enjoy them! You can find those all here! These were […]

I’m Giving Video Content a Try! — Simon Ouderkirk

If there is only one document you can read about data visualization, this is the one

I’m sorting my teaching material, and I found this gem. The UK Government Statistical Service published a guideline for effective data visualization and tables. If you know a busy person who doesn’t have time to study data visualization and can only read one document, this document is for them (it has less than 40 pages full of examples). Click o the image above to go to the guideline

Data giraffe is sometimes a feature, not a problem

I wrote about data giraffes two weeks ago. Usually, “data giraffes” are a problem and we need to work hard in order to solve it. Sometimes, they are a useful feature. Take a look at this NYT front page that shows the number of new unemployment applications in the United States over the time

And this is the pseudochartchart version of the same data

Credits: I’ve found these examples on Stott Berkun’s page.

Everything is NOT just fine (repost)

My job wasn’t affected by the COVID madness in almost any way. I used to work from home before, and I work from home now, none on my customers cancelled any projects, the health system in Israel is still functioning, all of my relatives are in good health, everything is just fine! I know how unusual I am in the current world, with the skyrocketing unemployment, non-functioning governments, and three-digit body counts. I was about to write about that, but then I read AnnMaria’s post.

You should read it too

I’ve read a lot of cheery tweets that said something like, “Buffy, Biff and I are isolated at home with our terrier, Boo. Here’s a picture. Isn’t he cute? We played card games, then I baked this three-course meal I saw on Pinterest. Biff is taking this time to finally become proficient in Mandarin with…

Everything is NOT just fine — AnnMaria’s Blog

Blogging isn’t what it used to be. Podcasting is on the rise

Photo by Magda Ehlers on Pexels.com

More than two years ago, I took a look at Google Trends for three phrases “start a blog”, “create a blog”, and  “create a site”. I was surprised by the high volume of blog searches, compared to “create a site”.

Today, I decided to go back to Google Trends and to add the new rising star: podcasting. 

It looks like podcasting starts its exponential growth, while the blogging continues its slow but steady decline. I will be unsurprised if, in 2022, the green, podcasting line will surpass the other lines in this graph. Let’s wait and see.

A super-important read on the COVID-19 situation. I'm finally convinced

Until now I was very sceptical about the COVID-19 measures taken by many the governments around the world, especially the Israeli one. Today, finally, I read a post that addressed the three issues I was pointing to:

  1. This first lockdown will last for months, which seems unacceptable for many people.
  2. A months-long lockdown would destroy the economy.
  3. It wouldn’t even solve the problem, because we would be just postponing the epidemic: later on, once we release the social distancing measures, people will still get infected in the millions and die.
  4. My biggest concern: Either a lot of people die soon and we don’t hurt the economy today, or we hurt the economy today, just to postpone the deaths.

There’s no point rephrasing here the original post, just go and read it. I’m convinced. Thank you, Tomas Pueyo

Go and read. The image is clickable