In 2019, it’s hard to find a data-related blogger who doesn’t write about the essence and the future of data science as a profession. Most of these posts (like this one for example) are mostly useless both for existing data scientists who think about their professional plans and for people who consider data science as their career.
Today I saw yet another post which I find very useful. In this post, Dominik Haitz identifies a “third wave data scientist.” In Dominik’s opinion, a successful data scientist has to combine four features: (1) Business mindset (2) Software engineering craftsmanship (3) Statistics and algorithmic toolbox, and (4) Soft skills. In Dominik’s classification, the business mindset is not “another skill” but the central pillar.
The professional challenges that I have been facing during the past eighteen months or so, made me realize the importance of points 1, 2, and 3 from Dominik’s list (number 4 was already very important on my personal list). However, it took reading his post to put the puzzle parts in place.
Dominik’s additional contribution to the discussion is ditching the famous data science Venn Diagram in favor of another, “business-oriented” visual which I used as the “featured image” to this post.
In my last post on data science career, I heavily promoted the idea that a data scientist needs to find his or her specialization. I back my opinion with my experience and by citing other people opinions. However, keep in mind that I am not a career advisor, I never surveyed the job market, and I might not know what I’m talking about. Moreover, despite the fact that I advocate for specialization, I think that I am more of a generalist.
TL/DR: Studying data science is OK as long as you know that it’s only a starting point.
Almost two years ago, I wrote a post titled “Don’t study data science as a career move.” Even today, this post is the most visited post on my blog. I was reminded about this post a couple of days ago during a team meeting in which we discussed what does a “data scientist” mean today. I re-read my original post, and I think that I was generally right, but there is a but…
The term “data science” was born as an umbrella term that meant to describe people who know programming, statistics, and business logic. We all saw those numerous Venn diagrams that tried to describe the perfect data scientist. Since its beginning, the field of “data science” has finally matured. There are more and more people that question the mere definition of data science.
Here’s what an entrepreneur Chuck Russel has to say:
Now don’t get me wrong — some of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental results and after proving or disproving the conjecture move on or iterate.
Now, “create and test hypotheses” is a very vague requirement. After all, any A/B test is a process of “creating and testing hypotheses” using data. Is anyone who performs A/B tests a data scientist? I think not. Moreover, a couple of years ago, if you wanted to run an A/B test, perform a regression analysis, build a classifier, you would have to write numerous lines of code, debug and tune it. This tedious and intriguing process certainly felt very “sciency,” and if it worked, you would have been very proud of our job. Today, on the other hand, we are lucky to have general-purpose tools that require less and less coding. I don’t remember the last time I had to implement an analysis or an algorithm from the first principles. With the vast amount of verified tools and libraries, writing an algorithm from scratch feels like a huge waste of time. On the other hand, I spend more and more time trying to understand the “business logic” that I try to improve: why has this test fail? Who will use this algorithm and what will make them like the results? Does effort justify the potential improvement?
I (a data scientist) have all this extra time to think of a business logic thanks to the huge arsenal of generalized tools to choose from. These tools were created mostly by those data scientists whose primary job is to implement, verify, and tune algorithms. My job and the job of these data scientists is different and requires different sets of skills.
There is another ever-growing group of professionals who work hard to make sure someone can apply all those algorithms to any amount of data they feel suitable. These people know that any model is at most as good as the data it is based on. Therefore, they build systems that deliver the right information on time, distribute the data among computation nodes, and make sure no crazy “scientist” sends a production server to a non-responsive state due to a bad choice of parameters. We already have a term for professionals whose job is to build fail-proof systems. We call them engineers, or “data engineers” in this case.
The bottom line
Up till now, I mentioned three major activities that used to be covered by the data science umbrella: building new algorithms, applying algorithms to business logic, and engineering reliable data systems. I’m sure there are other areas under that umbrella that I forgot. In 2019, we reached the point where one has to decide what field of data science does one want to practice. If you consider stying data science think of it as studying medicine. The vast majority of physicians don’t end up general practitioners but rather invest at least five more years of their lives professionalize. Treat your data science studies as an entry ticket into the life-long learning process, and you’ll be OK. Otherwise, (I’m citing myself here): You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.
PS. Here’s a one-week-old article on Forbes.com with very similar theses: link.
Please leave a comment to this post. It doesn’t matter what, it can be a simple Hi or an interesting link. It doesn’t matter when or where you see it. I want to see how many real people are actually reading this blog.
Two years ago I visited Chișinău (Kishinev), the city in Moldova where I was born and where I grew up until the age of fifteen. Today I saw a post with photos from the ancient Chișinău Jewish cemetery and recalled that I too, took many pictures from that sad place. Less than half of the original cemetery survived to these days. The bigger part of it was demolished in the 1960s in favor of a park and a residential area. If you scroll through the pictures below, you will be able to see how they used tombstones to build the park walls.
Another notable feature of many Jewish cemeteries is memorial plates in memoriam of the relatives who don’t have their own graves — the relatives who were murdered over the course of the Jewish history.
מדי פעם אנשים ששומעים שאני עובד בחברה שמפעילה את וורדפרקס.קום מבקשים ממני עזרה אם בניית האתר שלהם. אני חוקר נתונים, לא בונה אתרים. ברור שהחברה בה אני עובד עושה המון מאמצים כדי לאפשר לאנשים לבנות אתרים בעצמם, אבל לפעםמים אנשים צריכים להאציל את הסמכות הזאת למומחים, רוצים גמישות ושליטה וגם תמיכה. אני מכיר אישית את דידי אריאלי מהאתר ״קליקי בניית אתרים״ שעושה בדיוק את זה: בנייה ותחזוקת אתרים מותאמים אישית. מה שנחמד הוא שדידי נאמן לעקרונות הקוד הפתוח: הלקוח לא קשור אליו ושומר על השליטה בתוכן ובקוד של האתר.
Recently, Automattic created a Marketing Data team to support marketing efforts with dedicated data capabilities. As we got started, one important question loomed for me and my teammate Demet Dagdelen: What should we data scientists do as part of this team?
Even though the term data science has been heavily used in the past few years, its meaning still lacks clarity. My current definition for data science is: “a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.” This is a very broad definition that offers a vague direction for what marketing data scientists should do. Indeed, many ideas for data science work were thrown around when the team was formed. Because Demet and I wanted our work to be proactive and influential, we suggested a long-term marketing data science…
I’m a terrible procrastinator. A couple of years ago, I installed RescueTime to fight this procrastination. The idea behind RescueTime is simple — it tracks the sites you visit and the application you use and classifies them according to how productive you are. Using this information, RescueTime provides a regular report of your productivity. You can also trigger the productivity mode, in which RescueTime will block all the distractive sites such as Facebook, Twitter, news sites, etc. You can also configure RescueTime to trigger this mode according to different settings. This sounded like a killer feature for me and was the main reason behind my decision to purchase a RescueTime subscription. Yesterday, I realized how wrong I was.
When I installed RescueTime, I was full of good intentions. That is why I configured it to block all the distractive sites for one hour every time I accumulate more than 10 minutes of surfing such sites. However, from time to time, I managed to find a good excuse to procrastinate. Although RescueTime allows you to open a “bad” site after a certain delay, I found this delay annoying and ended up killing the RescueTime process (killing a process is faster than temporary disabling a filter). As a result, most of my workday stayed untracked, unmonitored, and unfiltered.
So, I decided to end this absurd situation. As of today, RescueTime will never block any sites. Instead of blocking, I configured it to show a reminder and to open my RescueTime dashboard, as a reminder to behave myself. I don’t know whether this non-intrusive reminder will be effective or not but at least I will have correct information about my day.
“Sunday grumpiness” is an SFW translation of Hebrew phrase that describes the most common state of mind people experience on their first work weekday. My grumpiness causes procrastination. Today, I tried to steer this procrastination to something more productive, so I searched for some statistics-related terms and stumbled upon a couple of interesting links in which people bitch about p-values.
“Everything Wrong With P-Values Under One Roof” is a recent rant about p-values written in a form of a scientific paper. William M. Briggs, the author of this paper, ends it with an encouraging statement: “No, confidence intervals are not better. That for another day.”