The third wave data scientist – a useful point of view

diagram that shows "business mindset" in the middle, surrounded by three segments: "soft skills" "statistics toolbox" and "software engineering craftsmanship"

In 2019, it’s hard to find a data-related blogger who doesn’t write about the essence and the future of data science as a profession. Most of these posts (like this one for example) are mostly useless both for existing data scientists who think about their professional plans and for people who consider data science as their career.

Today I saw yet another post which I find very useful. In this post, Dominik Haitz identifies a “third wave data scientist.” In Dominik’s opinion, a successful data scientist has to combine four features: (1) Business mindset (2) Software engineering craftsmanship (3) Statistics and algorithmic toolbox, and (4) Soft skills. In Dominik’s classification, the business mindset is not “another skill” but the central pillar.

The professional challenges that I have been facing during the past eighteen months or so, made me realize the importance of points 1, 2, and 3 from Dominik’s list (number 4 was already very important on my personal list). However, it took reading his post to put the puzzle parts in place.

Dominik’s additional contribution to the discussion is ditching the famous data science Venn Diagram in favor of another, “business-oriented” visual which I used as the “featured image” to this post.

Painting: sailors in a wavy sea
A fragment from an 1850 painting by the Russian Armenian marine painter Ivan Aivazovsky named “The Ninth Wave.” I wonder what the “ninth wave data scientist” will be.

To specialize, or not to specialize, that is the data scientists’ question

In my last post on data science career, I heavily promoted the idea that a data scientist needs to find his or her specialization. I back my opinion with my experience and by citing other people opinions. However, keep in mind that I am not a career advisor, I never surveyed the job market, and I might not know what I’m talking about. Moreover, despite the fact that I advocate for specialization, I think that I am more of a generalist.

Since I published the last post, I was pointed to some other posts and articles that either support or contradict my point of view. The most interesting ones are: “Why you shouldn’t be a data science generalist” and “Why Data Science Teams Need Generalists, Not Specialists“, both are very recent and very articulated but promote different points of view. Go figure

The featured image is based on a photo by Tom Parsons on Unsplash

The data science umbrella or should you study data science as a career move (the 2019 edition)?

Illustration: photo of people under umbrellas

TL/DR: Studying data science is OK as long as you know that it’s only a starting point.

Almost two years ago, I wrote a post titled “Don’t study data science as a career move.” Even today, this post is the most visited post on my blog. I was reminded about this post a couple of days ago during a team meeting in which we discussed what does a “data scientist” mean today. I re-read my original post, and I think that I was generally right, but there is a but…

The term “data science” was born as an umbrella term that meant to describe people who know programming, statistics, and business logic. We all saw those numerous Venn diagrams that tried to describe the perfect data scientist. Since its beginning, the field of “data science” has finally matured. There are more and more people that question the mere definition of data science.

Here’s what an entrepreneur Chuck Russel has to say:

Now don’t get me wrong — some of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental results and after proving or disproving the conjecture move on or iterate.

Screenshot of a Google image search showing many Venn diagrams
There can’t be enough Venn diagrams

Now, “create and test hypotheses” is a very vague requirement. After all, any A/B test is a process of “creating and testing hypotheses” using data. Is anyone who performs A/B tests a data scientist? I think not.
Moreover, a couple of years ago, if you wanted to run an A/B test, perform a regression analysis, build a classifier, you would have to write numerous lines of code, debug and tune it. This tedious and intriguing process certainly felt very “sciency,” and if it worked, you would have been very proud of our job. Today, on the other hand, we are lucky to have general-purpose tools that require less and less coding. I don’t remember the last time I had to implement an analysis or an algorithm from the first principles. With the vast amount of verified tools and libraries, writing an algorithm from scratch feels like a huge waste of time.
On the other hand, I spend more and more time trying to understand the “business logic” that I try to improve: why has this test fail? Who will use this algorithm and what will make them like the results? Does effort justify the potential improvement?

I (a data scientist) have all this extra time to think of a business logic thanks to the huge arsenal of generalized tools to choose from. These tools were created mostly by those data scientists whose primary job is to implement, verify, and tune algorithms. My job and the job of these data scientists is different and requires different sets of skills.

There is another ever-growing group of professionals who work hard to make sure someone can apply all those algorithms to any amount of data they feel suitable. These people know that any model is at most as good as the data it is based on. Therefore, they build systems that deliver the right information on time, distribute the data among computation nodes, and make sure no crazy “scientist” sends a production server to a non-responsive state due to a bad choice of parameters. We already have a term for professionals whose job is to build fail-proof systems. We call them engineers, or “data engineers” in this case.

The bottom line

Up till now, I mentioned three major activities that used to be covered by the data science umbrella: building new algorithms, applying algorithms to business logic, and engineering reliable data systems. I’m sure there are other areas under that umbrella that I forgot. In 2019, we reached the point where one has to decide what field of data science does one want to practice. If you consider stying data science think of it as studying medicine. The vast majority of physicians don’t end up general practitioners but rather invest at least five more years of their lives professionalize. Treat your data science studies as an entry ticket into the life-long learning process, and you’ll be OK. Otherwise, (I’m citing myself here): You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.

PS. Here’s a one-week-old article on Forbes.com with very similar theses: link.

Chișinău Jewish cemetery

Two years ago I visited Chișinău (Kishinev), the city in Moldova where I was born and where I grew up until the age of fifteen. Today I saw a post with photos from the ancient Chișinău Jewish cemetery and recalled that I too, took many pictures from that sad place. Less than half of the original cemetery survived to these days. The bigger part of it was demolished in the 1960s in favor of a park and a residential area. If you scroll through the pictures below, you will be able to see how they used tombstones to build the park walls.

Another notable feature of many Jewish cemeteries is memorial plates in memoriam of the relatives who don’t have their own graves — the relatives who were murdered over the course of the Jewish history.

בניית אתרים עם תמיכה בארץ

מדי פעם אנשים ששומעים שאני עובד בחברה שמפעילה את וורדפרקס.קום מבקשים ממני עזרה אם בניית האתר שלהם. אני חוקר נתונים, לא בונה אתרים. ברור שהחברה בה אני עובד עושה המון מאמצים כדי לאפשר לאנשים לבנות אתרים בעצמם, אבל לפעםמים אנשים צריכים להאציל את הסמכות הזאת למומחים, רוצים גמישות ושליטה וגם תמיכה. אני מכיר אישית את דידי אריאלי מהאתר ״קליקי בניית אתרים״ שעושה בדיוק את זה: בנייה ותחזוקת אתרים מותאמים אישית. מה שנחמד הוא שדידי נאמן לעקרונות הקוד הפתוח: הלקוח לא קשור אליו ושומר על השליטה בתוכן ובקוד של האתר.

דרך אגב, באתר של ״קליקי״ יש גם בלוג עם פרטי מידע שימושיים לבוני האתרים בוורדפרס

נ.ב. אני מכיר את דידי אישית אבל אין לי אתו קשרי עסקים. אני לא מרוויח שום דבר מהפוסט הזה.

How to Increase Retention and Revenue in 1,000 Nontrivial Steps

The journey of a thousand miles begins with one step. My coworker, Yanir Seroussi, wrote about the work of data scientists in the marketing team.

Data for Breakfast

Recently, Automattic created a Marketing Data team to support marketing efforts with dedicated data capabilities. As we got started, one important question loomed for me and my teammate Demet Dagdelen: What should we data scientists do as part of this team?

Even though the term data science has been heavily used in the past few years, its meaning still lacks clarity. My current definition for data science is: “a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.” This is a very broad definition that offers a vague direction for what marketing data scientists should do. Indeed, many ideas for data science work were thrown around when the team was formed. Because Demet and I wanted our work to be proactive and influential, we suggested a long-term marketing data science…

View original post 2,068 more words

On procrastination, or why too good can be bad

Illustration: a plastic gun and letters that say "bang"

I’m a terrible procrastinator. A couple of years ago, I installed RescueTime to fight this procrastination. The idea behind RescueTime is simple — it tracks the sites you visit and the application you use and classifies them according to how productive you are. Using this information, RescueTime provides a regular report of your productivity. You can also trigger the productivity mode, in which RescueTime will block all the distractive sites such as Facebook, Twitter, news sites, etc. You can also configure RescueTime to trigger this mode according to different settings. This sounded like a killer feature for me and was the main reason behind my decision to purchase a RescueTime subscription. Yesterday, I realized how wrong I was.

RescueTime logo

When I installed RescueTime, I was full of good intentions. That is why I configured it to block all the distractive sites for one hour every time I accumulate more than 10 minutes of surfing such sites. However, from time to time, I managed to find a good excuse to procrastinate. Although RescueTime allows you to open a “bad” site after a certain delay, I found this delay annoying and ended up killing the RescueTime process (killing a process is faster than temporary disabling a filter). As a result, most of my workday stayed untracked, unmonitored, and unfiltered.

So, I decided to end this absurd situation. As of today, RescueTime will never block any sites. Instead of blocking, I configured it to show a reminder and to open my RescueTime dashboard, as a reminder to behave myself. I don’t know whether this non-intrusive reminder will be effective or not but at least I will have correct information about my day.

“Why it burns when you P” and other statistics rants

Illustration: computer terminal window that says "p-value=0.55001 is GREATER than 0.05"

“Sunday grumpiness” is an SFW translation of Hebrew phrase that describes the most common state of mind people experience on their first work weekday. My grumpiness causes procrastination. Today, I tried to steer this procrastination to something more productive, so I searched for some statistics-related terms and stumbled upon a couple of interesting links in which people bitch about p-values.

Why it burns when you P” is a five-years-old rant about P values. It’s funny, informative and easy to read

Everything Wrong With P-Values Under One Roof” is a recent rant about p-values written in a form of a scientific paper. William M. Briggs, the author of this paper, ends it with an encouraging statement: “No, confidence intervals are not better. That for another day.”

Everything wrong with statistics (and how to fix it)” is a one-hour video lecture by Dr. Kristin Lennox who talks about the same problems. I saw this video, and two more talks by Dr. Lennox on a flight I highly recommend all her videos on YouTube.

Do You Hate Statistics as Much as Everyone Else?” — A Natan Yau’s (from flowingdata.com) attempt to get thoughtful comments from his knowledgeable readers.

This list will not be complete without the classics:

Why Most Published Research Findings Are False“, “Mindless Statistics“, and “Cargo Cult Science“. If you haven’t read these three pieces of wisdom, you absolutely should, they will change the way you look at numbers and research.

*The literal meaning of שביזות יום א is Sunday dick-brokenness.

Hackers beware: Bootstrap sampling may be harmful

Anything is better when bootstrapped. Read my co-worker’s post on bootstrapping. Also make sure following the links Yanir gives to support his claims

Yanir Seroussi

Bootstrap sampling techniques are very appealing, as they don’t require knowing much about statistics and opaque formulas. Instead, all one needs to do is resample the given data many times, and calculate the desired statistics. Therefore, bootstrapping has been promoted as an easy way of modelling uncertainty to hackers who don’t have much statistical knowledge. For example, the main thesis of the excellent Statistics for Hackers talk by Jake VanderPlas is: “If you can write a for-loop, you can do statistics”. Similar ground was covered by Erik Bernhardsson in The Hacker’s Guide to Uncertainty Estimates, which provides more use cases for bootstrapping (with code examples). However, I’ve learned in the past few weeks that there are quite a few pitfalls in bootstrapping. Much of what I’ve learned is summarised in a paper titled What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum by Tim…

View original post 1,462 more words