Yesterday, a new episode was published in the Popcorn podcast, where the host, Lior Frenkel, interviewed me. Everyone who knows me knows how much I love talking about myself and what I do. I definitely used this opportunity to talk about the world of data. Some people who listened to this episode told me that they enjoyed it a lot. If you know Hebrew, I recommend that you listen to this episode
A Quora user asked about data science tools with a graphical user interface. Here’s my answer. I should mention though that I don’t usually use GUI for data science. Not that I think GUIs are bad, I simply couldn’t find a tool that works well for me.
Of the many tools that exist, I like the most Orange (https://orange.biolab.si/). Orange allows the user creating data pipelines for exploration, visualization, and production but also allows editing the “raw” python code. The combination of these features makes is a powerful and flexible tool.
The major drawback of Orange (in my opinion) is that is uses its own data format and its own set of models that are not 100% compatible with the Numpy/Pandas/Sklearn ecosystem.
I have made a modest contribution to Orange by adding a six-lines function that computes Matthews correlation coefficient.
There is also RapidMinder but I never used it.
Many years ago, I terribly overfit a model which caused losses of a lot of shekels (a LOT). It’s not that I wasn’t aware of the potential overfitting. I was. Among other things, I used several bootstrapping simulations. It turns out that I applied the bootstrapping in a wrong way. My particular problem was that I “forgot” about confounding parameters and that I “forgot” that peeping into the future is a bad thing.
Anyhow, Yanir Seroussi, my coworker data scientist, gave a very good talk on bootstrapping.
Bootstrapping the right way is a talk I gave earlier this year at the YOW! Data conference in Sydney. You can now watch the video of the talk and have a look through the slides. The content of the talk is similar to a post I published on bootstrapping pitfalls, with some additional simulations.
The main takeaways shared in the talk are:
- Don’t compare single-sample confidence intervals by eye
- Use enough resamples (15K?)
- Use a solid bootstrapping package (e.g., Python ARCH)
- Use the right bootstrap for the job
- Consider going parametric Bayesian
- Test all the things
Testing all the things typically requires writing code, which I did for the talk. You can browse through it in this notebook. The most interesting findings from my tests are summarised by the following figure.
The figure shows how the accuracy of confidence interval estimation varies by algorithm, sample size…
View original post 405 more words
In 2019, it’s hard to find a data-related blogger who doesn’t write about the essence and the future of data science as a profession. Most of these posts (like this one for example) are mostly useless both for existing data scientists who think about their professional plans and for people who consider data science as their career.
Today I saw yet another post which I find very useful. In this post, Dominik Haitz identifies a “third wave data scientist.” In Dominik’s opinion, a successful data scientist has to combine four features: (1) Business mindset (2) Software engineering craftsmanship (3) Statistics and algorithmic toolbox, and (4) Soft skills. In Dominik’s classification, the business mindset is not “another skill” but the central pillar.
The professional challenges that I have been facing during the past eighteen months or so, made me realize the importance of points 1, 2, and 3 from Dominik’s list (number 4 was already very important on my personal list). However, it took reading his post to put the puzzle parts in place.
Dominik’s additional contribution to the discussion is ditching the famous data science Venn Diagram in favor of another, “business-oriented” visual which I used as the “featured image” to this post.
In my last post on data science career, I heavily promoted the idea that a data scientist needs to find his or her specialization. I back my opinion with my experience and by citing other people opinions. However, keep in mind that I am not a career advisor, I never surveyed the job market, and I might not know what I’m talking about. Moreover, despite the fact that I advocate for specialization, I think that I am more of a generalist.
Since I published the last post, I was pointed to some other posts and articles that either support or contradict my point of view. The most interesting ones are: “Why you shouldn’t be a data science generalist” and “Why Data Science Teams Need Generalists, Not Specialists“, both are very recent and very articulated but promote different points of view. Go figure
TL/DR: Studying data science is OK as long as you know that it’s only a starting point.
Almost two years ago, I wrote a post titled “Don’t study data science as a career move.” Even today, this post is the most visited post on my blog. I was reminded about this post a couple of days ago during a team meeting in which we discussed what does a “data scientist” mean today. I re-read my original post, and I think that I was generally right, but there is a but…
The term “data science” was born as an umbrella term that meant to describe people who know programming, statistics, and business logic. We all saw those numerous Venn diagrams that tried to describe the perfect data scientist. Since its beginning, the field of “data science” has finally matured. There are more and more people that question the mere definition of data science.
Here’s what an entrepreneur Chuck Russel has to say:
Now don’t get me wrong — some of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental results and after proving or disproving the conjecture move on or iterate.
Now, “create and test hypotheses” is a very vague requirement. After all, any A/B test is a process of “creating and testing hypotheses” using data. Is anyone who performs A/B tests a data scientist? I think not.
Moreover, a couple of years ago, if you wanted to run an A/B test, perform a regression analysis, build a classifier, you would have to write numerous lines of code, debug and tune it. This tedious and intriguing process certainly felt very “sciency,” and if it worked, you would have been very proud of our job. Today, on the other hand, we are lucky to have general-purpose tools that require less and less coding. I don’t remember the last time I had to implement an analysis or an algorithm from the first principles. With the vast amount of verified tools and libraries, writing an algorithm from scratch feels like a huge waste of time.
On the other hand, I spend more and more time trying to understand the “business logic” that I try to improve: why has this test fail? Who will use this algorithm and what will make them like the results? Does effort justify the potential improvement?
I (a data scientist) have all this extra time to think of a business logic thanks to the huge arsenal of generalized tools to choose from. These tools were created mostly by those data scientists whose primary job is to implement, verify, and tune algorithms. My job and the job of these data scientists is different and requires different sets of skills.
There is another ever-growing group of professionals who work hard to make sure someone can apply all those algorithms to any amount of data they feel suitable. These people know that any model is at most as good as the data it is based on. Therefore, they build systems that deliver the right information on time, distribute the data among computation nodes, and make sure no crazy “scientist” sends a production server to a non-responsive state due to a bad choice of parameters. We already have a term for professionals whose job is to build fail-proof systems. We call them engineers, or “data engineers” in this case.
The bottom line
Up till now, I mentioned three major activities that used to be covered by the data science umbrella: building new algorithms, applying algorithms to business logic, and engineering reliable data systems. I’m sure there are other areas under that umbrella that I forgot. In 2019, we reached the point where one has to decide what field of data science does one want to practice. If you consider stying data science think of it as studying medicine. The vast majority of physicians don’t end up general practitioners but rather invest at least five more years of their lives professionalize. Treat your data science studies as an entry ticket into the life-long learning process, and you’ll be OK. Otherwise, (I’m citing myself here): You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.
PS. Here’s a one-week-old article on Forbes.com with very similar theses: link.
Traditional A/B testsing rests on a fundamentally flawed premise. Most of the time, version A will be better for some subgroups, and version B will be better for others. Choosing either A or B is inherentlyinferior to choosing a targeted mix of A and B.Michael Kaminsky locallyoptimistic.com
The quote above is from a post by Michael Kaminsky “Against A/B tests“. I’m still not fully convinced by Michael’s thesis but it is very interesting and thought-provoking.
I wonder how this analysis of remained unnoticed by the social media
The recent election of Doug Jones […] got me thinking: What if the Black populations of Southern cities were to experience a dramatic increase? How many other elections would be impacted?
From time to time, people ask me for help with non-trivial data visualization tasks. A couple of weeks ago, a friend-of-a-friend-of-a-friend showed me a set of graphs with the following note:
Each row is a different use case. Each use case was tested on three separate occasions – columns 1,2,3. We hope to show that the lines in each row behave similarly, but that there are differences between the different rows.
Before looking at the graphs, note the last sentence in the above comment. Knowing what you want to show is an essential and not trivial part of a data visualization task. Specifying what is it precisely that you want to say is the first required task in any communication attempt, technical or not.
For the obvious reasons, I cannot share the original graphs that that person gave me. I managed to re-create the spirit of those graphs using a combination of randomly generated arrays.
Notice how the X- and Y- axes are aligned between all the subplots. Such alignment is a smart move that provides a shared scale and allows faster and more natural comparison between the curves. You should always try aligning your axes. If aligning isn’t possible, make sure that it is absolutely, 100%, clear that the scales are different. Slight differences are very confusing.
There are several small things that we can do to improve this graph. First, the identical legends in every subplot are a useless waste of ink and thus, of your viewers’ processing power. Since they are identical, these legends do nothing but distracting the viewer. Moreover, while I understand how a variable name such as
event_prob appeared on a graph, showing such names outside technical teams is a bad practice. People who don’t share intimate knowledge with the underlying data will find human-readable labels easier to comprehend, making your message “stickier.”
Let’s improve the signal-to-noise ratio of this plot.
According to our task, each row is a different use case. Notice that I accompanied each row with a human-readable label. I didn’t use cryptic code such as
age_0_10 or the such.
Now, let’s go back to the task specification. “We hope to show that the lines in each row behave similarly, but that there are differences between the separate rows.” Remember my advice to always use conclusions as graph titles? Let’s test how such a title will look like
Really? Is there a better way to justify the title? I claim that there is.
Let’s experiment a little bit. What will happen if we will plot all the lines on the same graph? By doing so, we might create a stronger emphasize of the similarities and the differences.
Not bad. The separate lines create some excessive noise, and the legend isn’t the best way to label multiple lines, so let’s improve the graph even further.
Note that meaningful ticks on the X-axis. The 30, 180, and 365-day marks provide useful anchors.
Now, let us go back to our title. “Low intra- and high inter- group variability” is, in fact, two conclusions. If you have ever read any text about technical presentations, you should remember the “one point per slide” rule. How do we solve this problem? In cases like these, I like to use the same graph in two different slides, one for each conclusion.
During a presentation, I would show this graph with the first conclusion as a title. I would talk about the implications of that conclusion. Next, I will say “wait! There is more”, will promote the slide and start talking about the second conclusion.
To sum up,
First, decide what is it that you want to say. Then ask whether your graph says what you want to say. Next, emphasize what you want to say, and finally, say what you want to say.
To be continued
The case that you see in this post is a relatively easy one because it only compares four groups. What will happen if you will need to compare six, sixteen or sixty groups? I will try answering this question in one of my next posts
What: Data Visualization from default to outstanding. Test cases of tough data visualization
Why: You would never settle for default settings of a machine learning algorithm. Instead, you would tweak them to obtain optimal results. Similarly, you should never stop with the default results you receive from a data visualization framework. Sadly, most of you do.
When: May 27, 2018 (a day before the DataScience summit)/ 13:00 – 16:00
Where: Interdisciplinary Center (IDC) at Herzliya.
More info: here.
1. Theoretical introduction: three most common mistakes in data visualization (45 minutes)
2. Test case (LAB): Plotting several radically different time series on a single graph (45 minutes)
3. Test case (LAB): Bar chart as an effective alternative to a pie chart (45 minutes)
4. Test case (LAB): Pie chart as an effective alternative to a bar chart (45 minutes)
According to the conference organizers, the yearly Data Science Summit is the biggest data science event in Israel. This year, the conference will take place in Tel Aviv on Monday, May 28. One day before the main conference, there will be a workshop day, hosted at the Herzliya Interdisciplinary Center. I’m super excited to host one of the workshops, during the afternoon session. During this workshop, we will talk about the mistakes data scientist make while visualizing their data and the way to avoid them. We will also have some fun creating various charts, comparing the results, and trying to learn from each others’ mistakes.