• Although it is easy to lie with statistics, it is easier to lie without

    Although it is easy to lie with statistics, it is easier to lie without

    October 30, 2017

    I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting – the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.

    Quoting Tom Breur:

    Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!

    Happy dredging indeed.

    October 30, 2017 - 1 minute read -
    advice big-data data science machine learning overfitting small-data blog
  • Gartner: More than 40% of data science tasks will be automated by 2020. So what?

    Gartner: More than 40% of data science tasks will be automated by 2020. So what?

    October 25, 2017

    Recently, I gave a data science career advice, in which I suggested the perspective data scientists not to study data science as a career move. Two of my main arguments were (and still are):

    * The current shortage of data scientists will go away, as more and more general purpose tools are developed.
    * When this happens, you’d better be an expert in the underlying domain, or in the research methods. The many programs that exist today are too shallow to provide any of these.
    

    Recently, the research company Gartner published a press release in which they claim that “More than 40 percent of data science tasks will be automated by 2020, resulting in increased productivity and broader usage of data and analytics by citizen data scientists, according to Gartner, Inc.” Gartner’s main argument is similar to mine: the emergence of ready-to-use tools, algorithm-as-a-service platforms and the such will reduce the amount of the tedious work many data scientists perform for the majority of their workday: data processing, cleaning, and transformation. There are also more and more prediction-as-a-service platforms that provide black boxes that can perform predictive tasks with ever increasing complexity. Once good plug-and-play tools are available, more and more domain owners, who are not necessary data scientists, will be able to use them to obtain reasonably good results. Without the need to employ a dedicated data scientist.

    Data scientists won’t disappear as an occupation. They will be more specialized.

    I’m not saying that data scientists will disappear in the way coachmen disappeared from the labor market. My claim is that data scientists will cease to be perceived as a panacea by the typical CEO/CTO/CFO. Many tasks that are now performed by the data scientists will shift to business developers, programmers, accountants and other domain owners who will learn another skill – operating with numbers using ready to use tools. An accountant can use Excel to balance a budget, identify business strengths, and visualize trends. There is no reason he or she cannot use a reasonably simple black box to forecast sales, identify anomalies, or predict churn.

    So, what is the future of data science occupation? Will the emergence of out-of-box data science tools make data scientists obsolete? The answer depends on the data scientists, and how sustainable his or her toolbox is. In the past, bookkeeping used to rely on manual computations. Has the emergence of calculators, and later, spreadsheet programs, result in the extinction of bookkeepers as a profession? No, but most of them are now busy with tasks that require more expertise than just adding the numbers.

    The similar thing will happen, IMHO, with data scientists. Some of us will develop a specialization in a business domain – gain a better understanding of some aspect of a company activity. Others will specialize in algorithm optimization and development and will join the companies for which algorithm development is the core business. Others will have to look for another career. What will be the destiny of a particular person depends mostly on their ability to adapt. Basic science, solid math foundation, and good research methodology are the key factors the determine one’s career sustainability. The many “learn data science in 3 weeks” courses might be the right step towards a career in data science. A right, small step in a very long journey.

    Featured image: Alex Knight on Unsplash

    October 25, 2017 - 3 minute read -
    advice career career-advise courses data science blog Career advice
  • 1461

    1461

    October 24, 2017

    I teach data visualization to in Azrieli College of Engineering in Jerusalem. Yesterday, during my first lesson, I was talking about the different ways a chart design selection can lead to different conclusions, despite not affecting the actual data. One of the students hypothesized that the preception of a figure can change as a function of other graphs shown together. Which was exactly tested in a research I recently mentioned here. I felt very proud of that student, despite only meeting them one hour before that.

    October 24, 2017 - 1 minute read -
    Data Visualization dataviz teaching blog
  • Who doesn't like some merciless critique of others' work?

    Who doesn't like some merciless critique of others' work?

    October 23, 2017

    Stephen Few is the author of (among others) “Show Me The Numbers”. Besides writing about what should **be done, in the field of data visualization, Dr. Few also writes a lot about what should **not be done. He does that in a sharp, merciless way which makes it very interesting reading (although, sometimes Dr. Few can be too harsh). This time, it was the turn of the Tableau blog team to be at the center of Stephen Few’s attention, and not for the good reason.

    If Tableau wishes to call this research, then I must qualify it as bad research. It produced no reliable or useful findings. Rather than a research study, it would be more appropriate to call this “someone having fun with an eye tracker.”

    Reading merciless critique by knowledgeable experts is an excellent way to develop that “inner voice” that questions all your decisions and makes sure you don’t make too many mistakes. Despite the fear to be fried, I really that some day I’ll be able to know what Stephen Few things of my work.

    http://www.perceptualedge.com/blog/?p=2718

    Disclaimer: Stephen Few was very generous to allow me using the illustrations from his book in my teaching.

    Featured image is Public domain image by Alan Levine from here

    October 23, 2017 - 1 minute read -
    argument Data Visualization dataviz blog
  • Why is it (almost) impossible to set deadlines for data science projects?

    Why is it (almost) impossible to set deadlines for data science projects?

    October 19, 2017

    In many cases, attempts to set a deadline to a data science project result in a complete fiasco. Why is that? Why, in many software projects, managers can have a reasonable time estimate for the completion but in most data science projects they can’t? The key points to answer this question are complexity and, to a greater extent, missing information. By “complexity” I don’t (only) mean the computational complexity. By “missing information” I don’t mean dirty data. Let us take a look at these two factors, one by one.

    Complexity

    Illustration: famous xkcd comic. Two programmers play during the compilation time

    Think of this. Why most properly built bridges remain functional for decades and sometimes for centuries, while the rule in every non-trivial program is that “there is always another bug?”. I read this analogy in Joel Spolsky’s post written in 2001. The answer Joel provides is:

    Once you’ve written a subroutine, you can call it as often as you want. This means that almost everything we do as software developers is something that has never been done before. This is very different than what construction workers do.

    There was a substantial progress in the computer engineering theory since 2001 when Joel wrote its post. We have better static analysis tools, better coverage tools, and better standard practices. Nevertheless, bug-free software only exists in Programming 101 books.

    What about data science projects? Aren’t they essentially a sort of software project? Yes, they are, and as such, the above quote is relevant for them too. However, we can add another statement:

    Once you’ve collected data, you can process it as often as you want. This means that almost everything we do as data scientists is something that has never been done before.

    You see, to account for project uncertainty, we need to multiply the number of uncertainty factors of a software project by the number of uncertainty factors associated with the data itself. The bottom line is an exponential complexity growth.

    Missing information

    Now, let’s talk about another, even bigger problem, the missing information. I’m not talking about “dirty data” – a situation where some values in the dataset are missing, input errors, or fields that change their meaning over the time. These are severe problems but not as tough as the problem I’m about to talk about in this post.

    When a software engineer writes a plotting program, they know when it doesn’t work: the image is either created or not. And if the image isn’t created, the programmer knows that something wrong and has to be fixed. When a programmer writes a compression program, they know when they made a mistake: if the program does not compress a file, or if the result isn’t readable. The programmer knows that there must be a fixable bug in his or her code.

    What about a data science project? Let’s say you’re starting an advertisement targetting project. The project manager gives you the information source and the performance metric. A successful model has to have a performance of 80 or more (the nature of the performance score isn’t important here). You start working. You clean your data, normalize it, build a nice decision tree, and get a score of 60, which is way too low. You explore your data, discover problems in it, retrain the tree and get 63. You talk to the team that collects the data, find more problems, build a random forest, train it and get a score of 66. You buy some computation time, create a deep learning network on AWS, train it for a week, and get 66 again.

    Illustration: a blindfolded man wandering around

    What do you do now? Is it possible that somewhere in your code there is a bug? It certainly is. Is it possible that you can improve the performance by deploying a better model? Probably. However, it is also possible that the data does not contain enough information. The problem, of course, is that you don’t know that. In practice, you hit your head against the wall until you get the results, or give up, or fired. And this is THE most significant problem with data science (and any research) project: your problem is a black box. You only know what you know, but you have no idea what you don’t. A research project is like exploring a forest with your eyes shut: when you hit a tree, you don’t know whether this is the last tree in the forest and you’re out, or you’re in the middle of a tropical jungle.

    I hope that the theoretical data science research will narrow this gap. Meanwhile, the project managers will have to live with this great degree of uncertainty.

    PS. As in any opinion post, I may be mistaken. If you think I am, please let me know in the comments section below.

    The xckd image: https://xkcd.com/303/ under CC-nc. The wandering man image: Illustration: a blindfolded man wandering around. By Flickr user Molly under CC-by-nc-nd

    October 19, 2017 - 4 minute read -
    complexity data science problem project-management blog
  • What is the best thing that can happen to your career?

    What is the best thing that can happen to your career?

    October 19, 2017

    Today, I’ve read a tweet by Sinan Aral (@sinanaral) from the MIT:

    https://twitter.com/sinanaral/status/917162872362463232

    I’ve just realized that Ikigai is what happened to my career as a data scientist. There was no point in my professional life where I felt boredom or lack of motivation. Some people think that I’m good at what I’m doing. If they are right (which I hope they are), It is due to my love for what I have been doing since 2001. I am so thankful for being able to do things that I love, I care about, and am good at. Not only that, I’m being paid for that! The chart shared by Sinan Aral in his tweet should be guiding anyone in their career choices.

    Featured image is taken from this article. Original image credit: Toronto Star Graphic

    October 19, 2017 - 1 minute read -
    advice career data science life blog Career advice
  • We're Reading About Bias in AI, SpaceX, and More

    We're Reading About Bias in AI, SpaceX, and More

    October 18, 2017

    Reading list from the curators of data.blog

    October 18, 2017 - 1 minute read -
    blog
  • Can the order in which graphs are shown change people's conclusions?

    Can the order in which graphs are shown change people's conclusions?

    October 17, 2017

    When I teach data visualization, I love showing my students how simple changes in the way one visualizes his or her data may drive the potential audience to different conclusions. When done correctly, such changes can help the presenters making their point. They also can be used to mislead the audience. I keep reminding the students that it is up to them to keep their visualizations honest and fair. In his recent post, Robert Kosara, the owner of https://eagereyes.org/, mentioned another possible way that may change the perceived conclusion. This time, not by changing a graph but by changing the order of graphs exposed to a person. Citing Robert Kosara:

    Priming is when what you see first influences how you perceive what comes next. In a series of studies, [André Calero Valdez, Martina Ziefle, and Michael Sedlmair] showed that these effects also exist in the particular case of scatterplots that show separable or non-separable clusters. Seeing one kind of plot first changes the likelihood of you judging a subsequent plot as the same or another type.

    via IEEE VIS 2017: Perception, Evaluation, Vision Science — eagereyes

    As any tool, priming can be used for good or bad causes. Priming abuse can be a deliberate exposure to non-relevant information in order to manipulate the audience. A good way to use priming is to educate the listeners of its effect, and repeatedly exposing them to alternate contexts. Alternatively, reminding the audience of the “before” graph, before showing them the similar “after” situation will also create a plausible effect of context setting.

    P.S. The paper mentioned by Kosara is noticeable not only by its results (they are not as astonishing as I expected from the featured image) but also by how the authors report their research, including the failures.

    Featured image is Figure 1 from Calero Valdez et al. Priming and Anchoring Effects in Visualization

    October 17, 2017 - 2 minute read -
    Data Visualization dataviz manipulation presenting priming psychology teaching blog
  • Advice for aspiring data scientists and other FAQs — Yanir Seroussi

    Advice for aspiring data scientists and other FAQs — Yanir Seroussi

    October 15, 2017

    It seems that career in data science is the hottest topic many data scientists are asked about. To help an aspiring data scientist, I’m reposting here a FAQ by my teammate Yanir Seroussi

    Aspiring data scientists and other visitors to this site often repeat the same questions. This post is the definitive collection of my answers to such questions (which may evolve over time). How do I become a data scientist? It depends on your situation. Before we get into it, have you thought about why you want […]

    via Advice for aspiring data scientists and other FAQs — Yanir Seroussi

    October 15, 2017 - 1 minute read -
    advice career data science blog Career advice
  • How to be a better teacher?

    How to be a better teacher?

    October 12, 2017

    If you know me in person or follow my blog, you know that I have a keen interest in teaching. Indeed, besides being a full-time data scientist at Automattic, I teach data visualization anywhere I can. Since I started teaching, I became much better in communication, which is one of the required skills of a good data scientist.
    In my constant strive for improving what I do, I joined the Data Carpentry instructor training. Recently, I got my certification as a data carpentry instructor.

    Certificate of achievement. Data Carpentry instructor

    Software Carpentry (and it’s sibling project Data Carpentry) aims to teach researchers the computing skills they need to get more done in less time and with less pain. “Carpentry” instructors are volunteers who receive a pretty extensive training and who are committed to evidence-based teaching techniques. The instructor training had a powerful impact on how I approach teaching. If teaching is something that you do or plan to do, invest three hours of your life watching this video in which Greg Wilson, “Carpentries” founder, talks about evidence-based teaching and his “Carpentries” project.

    https://www.youtube.com/watch?v=kmVKGxPlTvc

    I also recommend reading these papers, which provide a brief overview of some evidence-based results in teaching:

    * "[The Science of Learning](https://swcarpentry.github.io/instructor-training/files/papers/science-of-learning-2015.pdf)"
    * "[Success in Introductory Programming: What Works?](https://swcarpentry.github.io/instructor-training/files/papers/porter-what-works-2013.pdf)"
    * "[What Can I Do Today to Create a More Inclusive Community in CS?](https://swcarpentry.github.io/instructor-training/files/papers/lee-create-inclusive-community-2015.pdf)"
    
    October 12, 2017 - 1 minute read -
    advice career teaching video work blog Career advice
  • What you need to know to start a career as a data scientist

    What you need to know to start a career as a data scientist

    October 11, 2017

    It’s hard to overestimate how I adore StackOverflow. One of the recent blog posts on StackOverflow.blog is “What you need to know to start a career as a data scientist” by Julia Silge. Here are my reservations about that post:

    1. It’s not that simple (part 1)

    You might have seen my post “Don’t study data science as a career move; you’ll waste your time!”. Becoming a good data scientist is much more than making a decision and “studying it”.

    2. Universal truths mean nothing

    The first section in the original post is called “You’ll learn new things”. This is a universal truth. If you don’t “learn new things” every day, your professional career is stalling. Taken from the word of classification models, telling a universal truth has a very high sensitivity but very low specificity. In other words, it’s a useless waste of ink.

    3. Not for developers only

    The first section starts as follows: “When transitioning from a role as a developer to a position focused on data, …”. Most of the data scientists I know were never developers. I, for example, started as a pharmacist, computational chemist, and bioinformatician. I know several physicists, a historian and a math teacher who are now successful data scientists.

    4. SQL skills are overrated

    Another quote from the post: “Strong SQL skills are table stakes for data scientists and data engineers”. The thing is that in many cases, we use SQL mostly to retrieve data. Most of the “data scienc-y” work requires analytical tools and the flexibility that are not available in most of the SQL environments. Good familiarity with industry-standard tools and libraries are more important than knowing SQL. Statistics is way more important than knowing SQL. Julia Silge did indeed mention the tools (numpy/R) but didn’t emphasize them enough.

    5. Communication importance is hard to overestimate

    Again, quoting the post:

    The ability to communicate effectively with people from diverse backgrounds is important.

    Yes, Yes, and one thousand times Yes. Effective communication is a non-trivial task that is often overlooked by many professionals. Some people are born natural communicators. Some, like me, are not. If there’s one book that you can afford buying to improve your communication skills, I recommend buying “Trees, maps and theorems” by Jean-luc Doumont. This is a small, very expensive book that changed the way I communicate in my professional life.

    6. It’s not that simple (part 2)

    After giving some very general tips, Julia proceeds to suggest her readers checking out the data science jobs at StackOverflow Jobs site. The impression that’s made is that becoming a data scientist is a relatively simple task. It is not. At the bare minimum, I would mention several educational options that are designed for people trying to become data scientists. One such an option is Thinkful (I’m a mentor at Thinkful). Udacity and Coursera both have data science programs too. The point is that to become a data scientist, you have to study a lot. You might notice a potential contradiction between point 1 above and this paragraph. A short explanation is that becoming a data scientist takes a lot of time and effort. The post “Teach Yourself Programming in Ten Years” which was written in 2001 about programming is relevant in 2017 about data science.

    Featured image is based on a photo by Jase Ess on Unsplash

    October 11, 2017 - 3 minute read -
    advice career data science life opinion blog Career advice
  • Graffiti from Chișinău, Moldova

    Graffiti from Chișinău, Moldova

    October 10, 2017

    I’ve stumbled upon a nice post by Jackie Hadel where she shared some graffiti pictures from - Chișinău, the town I was born at. I left Chișinău in 1990 and first visited it in this March. I also took several graffiti pictures which I will share here. Chișinău is also known by its Russian name Kishinev.

    Graffiti in Chisinau. Kishinevers, put your all efforts to rebuild your native city

    This is a partially restored post-WWII writing that says “Kishinevers, give your all efforts to rebuild [your] native town”. Kishinev was ruined almost completely during the World War II. Right now, after the USSR collapse more than 25 years ago, the city still looks as if it needs to be restored.

    Graffiti in Chisinau. Pythagorean theorem.

    Being a data scientist, I liked this graffiti for the maths. It’s the Pythagorean theorem, in case you missed it.

    Swastika on a tombstone in Chisinau

    Swastika on a tombstone in the old Jewish cemetery. One of the saddest places I visited in this city.

    Graffiti in Chisinau. Building-size graffity.

    A mega-graffiti?

    Graffiti in Chisinau. Writing that says "I love Moldova" (in Romanian)

    “I love Moldova”. I love it too.

    See the original post that prompted me to share these pictures: CHISINAU, MOLDOVA GRAFFITI: LEFT IN RUIN, YOU MAKE ME HAPPY — TOKIDOKI (NOMAD)

    15july17 Chisinau, Moldova 🇲🇩

    October 10, 2017 - 2 minute read -
    chisinau graffiti kishinev moldova travel blog
  • Identifying and overcoming bias in machine learning

    Identifying and overcoming bias in machine learning

    October 8, 2017

    Data scientists build models using data. Real-life data captures real-life injustice and stereotypes. Are data scientists observers whose job is to describe the world, no matter how unjust it is? Charles Earl, an excellent data scientist, and my teammate says that the answer to this question is a firm “NO.” Read the latest data.blog post to learn Charles’ arguments and best practices.

    https://videopress.com/embed/jckHrKeF?hd=0&autoPlay=0&permalink=0&loop=0

    Charles Earl on identifying and overcoming bias in machine learning.

    via Data Speaker Series: Charles Earl on Discriminatory Artificial Intelligence — Data for Breakfast

    October 8, 2017 - 1 minute read -
    data science diversity inclusion blog
  • Before and after — the Hebrew holiday season chart

    Before and after — the Hebrew holiday season chart

    October 8, 2017

    Sometimes, when I see a graph, I think “I could draw a better version.” From time to time, I even consider writing a blog post with the “before” and “after” versions of the plot. Last time I had this desire was when I read the repost of my own post about the crazy month of Hebrew holidays. I created this graph three years ago. Since then, I have learned A LOT. So I thought it would be a good opportunity to apply my over-criticism to my own work. This is the “before” version:

    Graph: Tishrei is mostly a non-working month.

    There are quite a few points worth fixing in that plot. Let’s review those problems:

    * The point of the original post is to emphasize the amount of NON-working days in Tishrei. However, the largest points represent the working days. As the result, the emphasis goes to the working days, thus reversing the semantics.
    * It is not absolutely clear what point I intended to make using this graph. A short and meaningful title is an effective way to lead the audience towards the desired conclusion.
    * There are three distinct colors in my graph, representing working, half-working and non-working days. The category order is clear. The color order, on the other hand, is absolutely arbitrary. Moreover, green and red are never a good color combination due to the significantly high prevalence of impaired color vision.
    * Y label is rotated. Rotated Y labels are the default option in all the plotting tools that I know. Why is that is beyond my understanding, given the numerous studies that show that reading rotated text takes more time and is more error-prone (for example, see [ref](http://journals.sagepub.com/doi/abs/10.1177/154193120204601722), [ref](http://jov.arvojournals.org/article.aspx?articleid=2121153), and [ref](http://psycnet.apa.org/record/1986-10970-001).)
    * One interesting piece of information that one might expect to read from a graph is how many working days are there in year X. One can obtain this information either by counting the dots or by looking at a separate graph. It would be a good idea to make this information readily available to the observer.
    * The frame around the plot is useless.
    

    OK, now that we have identified the problems, let’s fix them

    * **Emphasize the right things.** I will use bigger points for the non-working days and small ones for the working days. I will also use squares instead of circles. Placing several squares one next to the other creates solid areas with less white space in-between. This lack of whitespace will help further emphasizing non-working chunks. I will make to leave *some* whitespace between the points, to enable counting.
    * **What's your point?** I will add an explanatory title. Having given some thought, I came up with "How productive can you be?". It is short, thought-provoking, and makes the point.
    * **Reduce the number of colors. **My intention was to use red for non-working days, and blue for the working ones. What color should I use for the half-working ([Chol haMoed](https://en.wikipedia.org/wiki/Chol_HaMoed)) days? I don't want to introduce another color to the improved graph. Since in my case, those days are mostly non-working, I will use a shade of red for Chol haMoed.
    * **Improve label readability. **One way to solve the rotated Y label problem is to remove the Y label at all! After all, most people will correctly assume that "2006", "2010", "2020" and other values represent the years. However, the original post mentions two different methods to count the years, using the Hebrew and Christian traditions. To make it absolutely clear that the graph talks about the Christian (common) calendar, I decided to keep the legend and format it properly.
    * **Add more info. **I added the total number of working days as a separate column of properly aligned gray text labels. The gray color ensures that the labels don't compete with the graph.  I also highlighted the current year using a subtle background rectangle.
    * **Data-ink ratio. **I removed the box around the graph and got rid of lines for the X and Y axes. I also removed the vertical grid lines. I wasn't sure about the horizontal ones but I decided to keep them in place.
    

    This is the result:

    tishrei_working_days_after.png

    I like it very much. I’m sure though, that if I revisit it in a year or two, I will find more ways to make it even better.

    You may find the code that generates this figure here.

    October 8, 2017 - 4 minute read -
    before-after Data Visualization dataviz blog
  • Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP

    Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP

    October 2, 2017

    Frequently, training a machine learning model in a single session is impossible. Most commonly, this happens when one needs to update a model with newly obtained observations. The generic term for such an update is “online learning.” In the scikit-learn world, this concept is also known as partial fit. The problem is that some models or their implementations don’t allow for partial fit. Even if the partial fitting is technically possible, the weight assigned to the new observations is may not be under your control. What happens when you re-train a model from scratch, or when the new observations are assigned too high weights? Recently, I stumbled upon an interesting concept of Pseudo-rehearsal that addresses this problem. Citing Matthew Honnibal:

    Sometimes you want to fine-tune a pre-trained model to add a new label or correct some specific errors. This can introduce the “catastrophic forgetting” problem. Pseudo-rehearsal is a good solution: use the original model to label examples, and mix them through your fine-tuning updates.

    This post is written by Matthew Honnibal from the team behind the excellent Spacy NLP library. This post is valuable in many aspects. First, it demonstrates a simple-to-implement technique. More importantly, it provides the True Name for a problem I encounter from time to time: Catastrophic forgetting.

    Featured image is by Flickr user Herr Olsen under CC-by-nc-2.0


    October 2, 2017 - 1 minute read -
    machine learning blog
  • 16-days work month — The joys of the Hebrew calendar

    16-days work month — The joys of the Hebrew calendar

    September 27, 2017

    Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a *de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation, so we will treat those days as half working days in the following analysis.

    I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between 1993 and 2020 CE, and this is what we get:

    tishrei_working_days

    Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:

    tishrei_workign_weeks.png

    Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to constantly interrupted work day, but at a different scale.

    So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

    (*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan – the month of the Exodus from Egypt as the first month.
    (**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

    September 27, 2017 - 2 minute read -
    Israel blog
  • On data beauty and communication style

    On data beauty and communication style

    August 18, 2017

    There’s an interesting mini-drama going on in the data visualization world. The moderators of DataIsBeautiful invited Stephen Few for an ask-me-anything (AMA) session. Stephen Few is a data visualization researcher and an opinionated blogger. I use his book “Show Me the Numbers” when I teach data visualization. Both in his book and even more so, on his blog, Dr. Few is not afraid of criticizing practices that fail to meet his standards of quality. That is why I wasn’t surprised when I read Stephen Few’s public response to the AMA invitation:

    I stridently object to the work of lazy, unskilled creators of meaningless, difficult to read, or misleading data displays. … Many data visualizations that are labeled “beautiful” are anything but. Instead, they pander to the base interests of those who seek superficial, effortless pleasure rather than understanding, which always involves effort.

    This response triggered some backlash. Randal Olson (a prominent data scientists and a blogger, for example, called his response “petty”:

    https://twitter.com/randal_olson/status/898244310600228865

    I have to respectfully disagree with Randy. Don’t get me wrong. Stephens Few’s response style is indeed harsh. However, I have to agree with him. Many (although not all) data visualization cases that I saw on DataIsBeatiful look like data visualization for the sake of data visualization. They are, basically, collections of lines and colors that demonstrate cool features of plotting libraries but do not provide any insight or tell any (data-based) story. From time to time, we see pieces of “data art,” in which the data plays a secondary role, and have nothing to do with “data visualization” where the data is the “king.” I don’t consider myself an artistic person, but I don’t appreciate the “art” part of most of the data art pieces I see.

    So, I do understand Stephen Few’s criticism. What I don’t understand is why he decided to pass the opportunity to preach to the best target audience he can hope for. It seems to me that if you don’t like someone’s actions and they ask you for advice, you should be eager to give it to them. Certainly not attacking them. Hillel, an ancient Jewish scholar, said

    He who is bashful can’t learn, and he who is harsh can’t teach

    Although I don’t have a fraction of teaching experience that Dr. Few has, I’m sure he would’ve achieved better results had he chosen to accept that invitation.

    Disclaimer: Stephen Few was very generous to allow me using the illustrations from his book in my teaching.

    August 18, 2017 - 2 minute read -
    argument Data Visualization dataviz teaching blog
  • Accepting payments on a WordPress.com site? Easy!

    Accepting payments on a WordPress.com site? Easy!

    August 18, 2017

    This is an exciting feature available to any WordPress.com Premium and Business users, and on Jetpack sites running version 5.2 or higher. The button looks like this:

    [simple-payment id=”757”]

    This page has all the information you need to know about the PayPal button.

    August 18, 2017 - 1 minute read -
    blogging feature wordpress-com blog
  • On procrastination

    On procrastination

    August 17, 2017

    I don’t know anyone, except my wife, who doesn’t consider themselves procrastinator. I procrastinate a lot. Sometimes, when procrastinating, I read about procrastination. Here’s a list of several recent blog posts about this topic. Read these posts if you have something more important to do*.

    procrastination_quote

    An Ode to the Deadlines competes with An Ode to Procrastination.

    I’ll Think of a Title Tomorrow Talks about procrastination from a designer’s point of view. Although it is full of known truths, such as “stop thinking, start doing”, “fear is the mind killer”, and others, it is nevertheless a refreshing reading.

    The entire blog called Unblock Results is written by Nancy Linnerooth who seems to position herself as a productivity coach. I liked her last post The Done ListThe Done List that talks about a nice psychological trick of running Done lists instead of Todo lists. This trick plays well with the productivity system that I use in my everyday life. One day, I might describe my system in this blog.

    We all know that reading can sometimes be hard. Thus, let me suggest a TED talk titled Inside the mind of a master procrastinator. You’ll be able to enjoy it with a minimal mental effort.


    *The pun is intended
    Featured image is by Flickr user Vic under CC-by-2.0 (cropped)
    The graffiti image is by Flickr user katphotos under CC-by-nc-nd

    August 17, 2017 - 2 minute read -
    procrastination productivity blog Productivity & Procrastination
  • Fashion, data, science

    Fashion, data, science

    August 16, 2017

    Zalando is an e-commerce company that sells shoes, clothing and other fashion items. Zalando isn’t a small company. According to Wikipedia, it’s 2015 revenue was almost 3 billion Euro. As you might imagine, you don’t run this kind of business without proper data analysis. Recently, we had Thorsten Dietzsch, a product manager for personalization at the fashion e-commerce at Zalando, joining our team meeting to tell us about how data science works at Zalando. It was an interesting conversation, which is now publically available online.

    [wpvideo 9BSbPlBe]

    In the first of our Data Speaker Series posts, Thorsten Dietzsch shares how data products are managed at Zalando, a fashion ecommerce company.

    via Data Speaker Series: Thorsten Dietzsch on Building Data Products at Zalando — Data for Breakfast

    Featured image: By Flickr user sweetjessie from here. Under the CC BY-NC 2.0 license

    August 16, 2017 - 1 minute read -
    data science fashion industry blog
  • Anomaly detection in time series — now the video

    Anomaly detection in time series — now the video

    August 14, 2017

    Two months ago, on the PyCon-IL conference, I gave a lecture called “Time Series Analysis: When “Good Enough” is Good Enough”. You may find the written version of this talk here. Today, the conference organizers published all the conference talks on YouTube. Here’s mine:

    https://youtu.be/UwkNmXhWmfI?t=15s

    August 14, 2017 - 1 minute read -
    a2f2 anomaly-detection conference presenting talking video blog
  • Эээх-ухнем. Как не забросить свой блог

    Эээх-ухнем. Как не забросить свой блог

    August 12, 2017

    Как это не печально, большинство начинающих блоггеров забрасывают свой блог вскоре после его открытия. Что отличает успешных (стойких?) блоггеров от тех, которым не удаётся продержаться? Стоит ли вести коллективные блоги, и если да, как важно распределение труда между авторами?
    В этой лекции мы попытаемся пролить свет на эти вопросы, анализируя поведение более пяти миллионов пользователей WordPress.com.

    Слайды презентации находятся здесь.

    По этой ссылке находится пост на английском, который я написал, когда впервые опубликовал результаты этого исследования.

    August 12, 2017 - 1 minute read -
    blogging research blog
  • This Week in Data Reading

    This Week in Data Reading

    July 26, 2017
    July 26, 2017 - 1 minute read -
    blog
  • Avoiding being a 'trophy' data scientist

    Avoiding being a 'trophy' data scientist

    July 24, 2017

    In this excellent post, Peadar Coyle lists several anti-patterns in running a data science team. This is an excellent post to read (and a blog to follow).

    July 24, 2017 - 1 minute read -
    blog
  • A successful failure

    A successful failure

    July 23, 2017

    Almost half a year ago, I decided to create an online data visualization course. After investing hundreds of hours, I managed to release the first lecture and record another one. However, I decided not to publish new lectures and to remove the existing one from the net. Why? The short answer is a huge cost-to-benefit ratio. For a longer answer, you will have to keep reading this post.

    Why creating a course?

    It’s not that there are no good courses. There are. However, most of them are tightly coupled with one tool or another. Moreover, many of the courses I have reviewed online are act as an advanced tutorial of a certain plotting tool. The course that I wanted to create was supposed to be tool-neutral, full of theoretical knowledge and bits of practical advice. Another decision that I made was not to write a set of text files (online book, set of Jupyter notebooks, whatever) but to create a course in which the majority of the knowledge is brought to the audience by the means of frontal video lectures. I assumed that this kind of format will be the easiest for the audience to consume.

    What went wrong?

    So, what went wrong? First of all, you should remember that I work full time at Automattic, which means that every side project is a … side project, that I have to do during my free time. I realized that since the very beginning. However, since I already teach data visualization in different institutions in Israel, I already have a well-formed syllabus with accompanying slide decks full of examples. I assumed that it will take me not more than one hour to prepare every online lecture.

    [caption id=”attachment_630” align=”alignright” width=”225”]Green screen and a camera in a typical green room setup

    Green room. All my friends were very impressed to see it[/caption]

    So, instead of verifying this assumption, I started solving the technical problems, such as buying a nice microphone (which turned out to be a crap), tripods, building a green room in my home office, etc. Once I was satisfied with my technical setup, I decided to record a promo video. Here, I faced a big problem. You see, talking to people and to the camera are completely different things. I feel pretty comfortable talking to people but when I face the camera, I almost freeze. Also, in person-to-person communication, we are somewhat tolerant to small studdering and longish pauses. However, when watching recorded video clips, we expect television quality narration. It turns out that achieving this kind of narration is very hard. Add the fact that English is my third language, and you get a huge time drain. To be able to record a two-minute promo video, I had to write the entire script, rehearse it for a dozen of times, and record it in front of a teleprompter. The filming session alone took around half an hour, as I had to repeat almost every line, time after time.

    [caption id=”attachment_648” align=”alignright” width=”300”]Screenshot of my YouTube video with 18 views

    18 views.[/caption]

    Preparing slide decks for the lectures wasn’t an easy task either. Despite the fact that I had pretty good slide decks, I realized that they are good for an in-class lecture, where I can point to the screen, go back and forth within a presentation, open external URL’s etc. Once I had my slide decks ready, I faced the narration problem once again. So, I had to write the entire lesson’s script, edit it, rehearse for several days, and shoot. At this time, I became frustrated. I might have been more motivated had my first video received some real traffic. However, with 18 (that’s eighteen) views, most of which lasted not more than a minute or two, I hardly felt a YouTube super star. I know that it’s impossible to get a real traction in such a short period, without massive promotion. However, after I completed shooting the second lecture, I realized that I will not be able to do it much longer. Not without quitting my day job. So, I decided to quit.

    What now?

    Since I already have pretty good texts for the first two lectures, I might be able to convert them to posts in this blog. I also have material for some before-and-after videos that I planned to have as a part of this course. I will make convert them to posts, too, similar to this post on the data.blog.

    Was it worth it?

    It certainly was! During the preparations, I learned a lot. I learned new things about data visualization. I took a glimpse into the world of video production. I had a chance to restructure several of my presentations.


    Featured image for this post by Nicolas Nova under the CC-by license.

    July 23, 2017 - 4 minute read -
    advice Data Visualization dataviz failure online-education blog
  • Older posts Newer posts