Compiled by my teammate Yanir Serourssi, the following is a reading list an A/B tests that you should read even if you don’t plan to perform an A/B test anytime soon. The list is Yanir’s. The reviews are mine. Collective intelligence in action 🙂
If you don’t pay attention, data can drive you off a cliff
In this post, Yanir lists seven common mistakes that are common to any data-based analysis. At some point, you might think that this is a list of trivial truths. Maybe it is. The fact that Yanir’s points are trivial doesn’t make them less correct. Awareness doesn’t exist without knowledge. Unfortunately, knowledge doesn’t assure awareness. Which is why reading trivial truths is a good thing to do from time to time.
How to identify your marketing lies and start telling the truth
This post was written by Tiberio Caetano, a data science professor at the University of Sidney. If I had to summarize this post with a single phrase, that would be “confounding factors”. A confounding variable is a variable hidden from your eye that influences a measured effect. One example of a confounding variable is when you start an ad campaign for ice cream, your sales go up, and you conclude that the ad campaign was effective. What you forgot was that the ad campaign started at the beginning of the summer, when people start buying more ice cream anyhow.
See this link for a detailed textbook-quality review of confounding variables.
Seven rules of thumb for web site experimenters
I read this review back in 2014, shortly after it was published by, among others, researchers from Microsoft and LinkedIn. Judging by the title, one would expect yet another list of trivial truths in a self-promoting product blog. This is not the case here. In this paper, you will find several real-life case studies, many references to marketing studies, and no advertising of shady products or schemes.
I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting — the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.
Quoting Tom Breur:
Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!
The current shortage of data scientists will go away, as more and more general purpose tools are developed.
When this happens, you’d better be an expert in the underlying domain, or in the research methods. The many programs that exist today are too shallow to provide any of these.
Recently, the research company Gartner published a press release in which they claim that “More than 40 percent of data science tasks will be automated by 2020, resulting in increased productivity and broader usage of data and analytics by citizen data scientists, according to Gartner, Inc.” Gartner’s main argument is similar to mine: the emergence of ready-to-use tools, algorithm-as-a-service platforms and the such will reduce the amount of the tedious work many data scientists perform for the majority of their workday: data processing, cleaning, and transformation. There are also more and more prediction-as-a-service platforms that provide black boxes that can perform predictive tasks with ever increasing complexity. Once good plug-and-play tools are available, more and more domain owners, who are not necessary data scientists, will be able to use them to obtain reasonably good results. Without the need to employ a dedicated data scientist.
Data scientists won’t disappear as an occupation. They will be more specialized.
I’m not saying that data scientists will disappear in the way coachmen disappeared from the labor market. My claim is that data scientists will cease to be perceived as a panacea by the typical CEO/CTO/CFO. Many tasks that are now performed by the data scientists will shift to business developers, programmers, accountants and other domain owners who will learn another skill — operating with numbers using ready to use tools. An accountant can use Excel to balance a budget, identify business strengths, and visualize trends. There is no reason he or she cannot use a reasonably simple black box to forecast sales, identify anomalies, or predict churn.
So, what is the future of data science occupation? Will the emergence of out-of-box data science tools make data scientists obsolete? The answer depends on the data scientists, and how sustainable his or her toolbox is. In the past, bookkeeping used to rely on manual computations. Has the emergence of calculators, and later, spreadsheet programs, result in the extinction of bookkeepers as a profession? No, but most of them are now busy with tasks that require more expertise than just adding the numbers.
The similar thing will happen, IMHO, with data scientists. Some of us will develop a specialization in a business domain — gain a better understanding of some aspect of a company activity. Others will specialize in algorithm optimization and development and will join the companies for which algorithm development is the core business. Others will have to look for another career. What will be the destiny of a particular person depends mostly on their ability to adapt. Basic science, solid math foundation, and good research methodology are the key factors the determine one’s career sustainability. The many “learn data science in 3 weeks” courses might be the right step towards a career in data science. A right, small step in a very long journey.
I’ve just realized that Ikigai is what happened to my career as a data scientist. There was no point in my professional life where I felt boredom or lack of motivation. Some people think that I’m good at what I’m doing. If they are right (which I hope they are), It is due to my love for what I have been doing since 2001. I am so thankful for being able to do things that I love, I care about, and am good at. Not only that, I’m being paid for that! The chart shared by Sinan Aral in his tweet should be guiding anyone in their career choices.
Featured image is taken from this article. Original image credit: Toronto Star Graphic
It seems that career in data science is the hottest topic many data scientists are asked about. To help an aspiring data scientist, I’m reposting here a FAQ by my teammate Yanir Seroussi
Aspiring data scientists and other visitors to this site often repeat the same questions. This post is the definitive collection of my answers to such questions (which may evolve over time). How do I become a data scientist? It depends on your situation. Before we get into it, have you thought about why you want […]
If you know me in person or follow my blog, you know that I have a keen interest in teaching. Indeed, besides being a full-time data scientist at Automattic, I teach data visualization anywhere I can. Since I started teaching, I became much better in communication, which is one of the required skills of a good data scientist.
In my constant strive for improving what I do, I joined the Data Carpentry instructor training. Recently, I got my certification as a data carpentry instructor.
Software Carpentry (and it’s sibling project Data Carpentry) aims to teach researchers the computing skills they need to get more done in less time and with less pain. “Carpentry” instructors are volunteers who receive a pretty extensive training and who are committed to evidence-based teaching techniques. The instructor training had a powerful impact on how I approach teaching. If teaching is something that you do or plan to do, invest three hours of your life watching this video in which Greg Wilson, “Carpentries” founder, talks about evidence-based teaching and his “Carpentries” project.
I also recommend reading these papers, which provide a brief overview of some evidence-based results in teaching:
The first section in the original post is called “You’ll learn new things”. This is a universal truth. If you don’t “learn new things” every day, your professional career is stalling. Taken from the word of classification models, telling a universal truth has a very high sensitivity but very low specificity. In other words, it’s a useless waste of ink.
3. Not for developers only
The first section starts as follows: “When transitioning from a role as a developer to a position focused on data, …”. Most of the data scientists I know were never developers. I, for example, started as a pharmacist, computational chemist, and bioinformatician. I know several physicists, a historian and a math teacher who are now successful data scientists.
4. SQL skills are overrated
Another quote from the post: “Strong SQL skills are table stakes for data scientists and data engineers”. The thing is that in many cases, we use SQL mostly to retrieve data. Most of the “data scienc-y” work requires analytical tools and the flexibility that are not available in most of the SQL environments. Good familiarity with industry-standard tools and libraries are more important than knowing SQL. Statistics is way more important than knowing SQL. Julia Silge did indeed mention the tools (numpy/R) but didn’t emphasize them enough.
5. Communication importance is hard to overestimate
Again, quoting the post:
The ability to communicate effectively with people from diverse backgrounds is important.
Yes, Yes, and one thousand times Yes. Effective communication is a non-trivial task that is often overlooked by many professionals. Some people are born natural communicators. Some, like me, are not. If there’s one book that you can afford buying to improve your communication skills, I recommend buying “Trees, maps and theorems” by Jean-luc Doumont. This is a small, very expensive book that changed the way I communicate in my professional life.
6. It’s not that simple (part 2)
After giving some very general tips, Julia proceeds to suggest her readers checking out the data science jobs at StackOverflow Jobs site. The impression that’s made is that becoming a data scientist is a relatively simple task. It is not. At the bare minimum, I would mention several educational options that are designed for people trying to become data scientists. One such an option is Thinkful (I’m a mentor at Thinkful). Udacity and Coursera both have data science programs too. The point is that to become a data scientist, you have to study a lot. You might notice a potential contradiction between point 1 above and this paragraph. A short explanation is that becoming a data scientist takes a lot of time and effort. The post “Teach Yourself Programming in Ten Years” which was written in 2001 about programming is relevant in 2017 about data science.
Featured image is based on a photo by Jase Ess on Unsplash
Almost half a year ago, I decided to create an online data visualization course. After investing hundreds of hours, I managed to release the first lecture and record another one. However, I decided not to publish new lectures and to remove the existing one from the net. Why? The short answer is a huge cost-to-benefit ratio. For a longer answer, you will have to keep reading this post.
Why creating a course?
It’s not that there are no good courses. There are. However, most of them are tightly coupled with one tool or another. Moreover, many of the courses I have reviewed online are act as an advanced tutorial of a certain plotting tool. The course that I wanted to create was supposed to be tool-neutral, full of theoretical knowledge and bits of practical advice. Another decision that I made was not to write a set of text files (online book, set of Jupyter notebooks, whatever) but to create a course in which the majority of the knowledge is brought to the audience by the means of frontal video lectures. I assumed that this kind of format will be the easiest for the audience to consume.
What went wrong?
So, what went wrong? First of all, you should remember that I work full time at Automattic, which means that every side project is a … side project, that I have to do during my free time. I realized that since the very beginning. However, since I already teach data visualization in different institutions in Israel, I already have a well-formed syllabus with accompanying slide decks full of examples. I assumed that it will take me not more than one hour to prepare every online lecture.
So, instead of verifying this assumption, I started solving the technical problems, such as buying a nice microphone (which turned out to be a crap), tripods, building a green room in my home office, etc. Once I was satisfied with my technical setup, I decided to record a promo video. Here, I faced a big problem. You see, talking to people and to the camera are completely different things. I feel pretty comfortable talking to people but when I face the camera, I almost freeze. Also, in person-to-person communication, we are somewhat tolerant to small studdering and longish pauses. However, when watching recorded video clips, we expect television quality narration. It turns out that achieving this kind of narration is very hard. Add the fact that English is my third language, and you get a huge time drain. To be able to record a two-minute promo video, I had to write the entire script, rehearse it for a dozen of times, and record it in front of a teleprompter. The filming session alone took around half an hour, as I had to repeat almost every line, time after time.
Preparing slide decks for the lectures wasn’t an easy task either. Despite the fact that I had pretty good slide decks, I realized that they are good for an in-class lecture, where I can point to the screen, go back and forth within a presentation, open external URL’s etc. Once I had my slide decks ready, I faced the narration problem once again. So, I had to write the entire lesson’s script, edit it, rehearse for several days, and shoot. At this time, I became frustrated. I might have been more motivated had my first video received some real traffic. However, with 18 (that’s eighteen) views, most of which lasted not more than a minute or two, I hardly felt a YouTube super star. I know that it’s impossible to get a real traction in such a short period, without massive promotion. However, after I completed shooting the second lecture, I realized that I will not be able to do it much longer. Not without quitting my day job. So, I decided to quit.
Since I already have pretty good texts for the first two lectures, I might be able to convert them to posts in this blog. I also have material for some before-and-after videos that I planned to have as a part of this course. I will make convert them to posts, too, similar to this post on the data.blog.
Was it worth it?
It certainly was! During the preparations, I learned a lot. I learned new things about data visualization. I took a glimpse into the world of video production. I had a chance to restructure several of my presentations.
Featured image for this post by Nicolas Nova under the CC-by license.
I am a pharmacist. I am interested in becoming a data scientist. My > interests are pharmacoeconomics and other areas of health economics. What do I need to study to become a data scientist?
To answer this question, I described how I gradually transformed from a pharmacist to a data scientists by continuous adaptation to the new challenges of my professional career. In the end, I invited anyone to ask personal questions via e-mail (it’s firstname.lastname@example.org). Two days ago, I received a follow-up question:
I would like to know how to learn data science. Would you suggest a master’s degree in analytics? Or is there another way to add “data scientist” label on my resume?
Here’s my answer that will explain why, in my opinion, studying data science won’t give you job security.
Data scientists are real. Data science isn’t.
I think that while “data scientists” are real, “data science” isn’t. We, the data scientists, analyze data using the scientific methods we know and using the tools we mastered. The term “data scientist” was coined about five years ago for the job market. It was meant to help to bring the expertise and the positions together. How else would you explain a person who knows scientific analysis, machine learning, writes computer code and isn’t too an abstract thinker to understand the business need of a company? Before “data scientist,” there was a less catchy “dataist” http://www.dataists.com/. However, “data scientist” sounded better. It is only after the “data scientist” became a reality, people started searching for “data science.” In the future, data science may become a scientific field, similar to statistics. Currently, though, it is not mature enough. Right now, data science is an attempt to merge different disciplines to answer practical questions. Sometimes, this attempt is successful, which makes my life and the lives of many my colleagues so exciting.
One standard feature of most if not all, the data science tasks is the requirement to understand the underlying domain. A data scientist in a cyber security team needs to have an understanding of data security, a bioinformatician needs to understand the biological processes, and a data scientist in a financial institution needs to know how money works.
That is why, career-wise, I think that the best strategy is to study an applied field that requires data-intense solutions. By doing so, you will learn how to use the various data analysis techniques. More importantly, you will also learn how to conduct a complicated research, and how the analysis and the underlying domain interact. Then, one of the two alternatives will happen. You will either specialize in your domain and will become an expert; or, you will switch between several domains and will learn to build bridges between the domains and the tools. Both paths are valuable. I took the second path, and it looks like most of the today’s data scientists took that route too. However, sometimes, I am jealous with the specialization I could have gained had I not left computational chemistry about ten years ago.
Who can use the “data scientist” title?
Who can use the “data scientist” title? I started presenting myself as a “data scientist and algorithm developer” not because I passed some licensing exams, or had a diploma. I did so because I was developing algorithms to answer data-intense questions. Saying “I’m a data scientist” is like saying “I’m an expert,” or “I’m an analyst,” or “I’m a manager.” If you feel comfortable enough calling yourself so, and if you can defend this title before your peers, do so. Out of the six data scientists in my current team, we have a pharmacist (me), a physicist, an electrical engineer, a CS major, and two mathematicians. We all have advanced degrees (M.A. or Ph.D.), but none of us had any formal “data science” training. I think that the many existing data science courses and programs are only good for people with deep domain knowledge who need to learn the data tools. Managers can benefit from these courses too. However, by taking such a program alone, you will lack the experience in scientific methodology, which is central to any data research project. Such a program will not provide you the computer science knowledge and expertise to make you a good data engineer. You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.
You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.
Lessons from the past
When I started my Ph.D. (in 2001), bioinformatics was HUGE. Many companies had bioinformatics departments that consisted of dozens, sometimes, hundreds of people. Every university in Israel (where I live), had a bioinformatics program. I knew at least five bioinformatics startups in my geographic area. Where is it now? What do these bioinformaticians do? I don’t know any bioinformatician who kept their job description. Most of those who I know transformed into data science, some became managers. Others work as governmental clerks.
Existing tools like Tableau have already sweated much of the complexity out of the once-very-hard task of data visualization, said Raghuram. And there are more higher-level tools on the way … that will improve workflow and automate how data interpretations are presented. “That’s the sort of automation that eliminates the need for data scientists to a large degree,” … And as the technology solves more of these problems, there will also be a lot more human job candidates from the 100 graduate programs worldwide dedicated to churning out data scientists Supply, meet demand. And bye-bye perks.
My point is, you have to be versatile and expert. The best way to become one isn’t to take a crash course but to solve hard problems, preferably, under supervision. Usually, you do so by obtaining an advanced degree. By completing an advanced degree, you learn, you learn to learn, and you prove to yourself and your potential employees that you’re capable of bridging the knowledge gaps that will always be there. That is why is why I advocate obtaining a degree in an existing field, keeping the data science as a tool, not a goal.
I might be wrong.
Giving advice is easy. Living the life is not. The path I’m advocating for worked for me. I might be completely wrong here.
I may be completely wrong about data science not being a mature scientific field. For example, deep learning may be the defining concept of data science as a scientific field on its own.
Credits: The crowd image is by Flicker user Amy West. Hilary Mason's photo is from her site https://hilarymason.com/about/