Boris Gorelik |

Another set of ruthless critique pieces

November 15, 2017

You know that I like reading a ruthless critique of others’ work – I like telling myself that by doing so I learn good practices (in reality, I suspect I’m just a case what we call in Hebrew שמחה לאיד – the joy of some else’s failure).

Anyhow, I’d like to share a set of posts by Lior Patcher in which he calls bullshit on several reputable people and concepts. Calling bullshit is easy. Doing so with arguments is not so. Lior Patcher worked hard to justify his opinion.

* The network nonsense of Albert-László Barabási. Albert-László Barabási is a renown network scientist. There’s a network model named after him. Some people claim that prof. Barabási will receive the Nobel prize one day. * The network nonsense of Manolis Kellis. Published one day after “The “The network nonsense of Albert-László Barabási”, this post critiques another renown scientist. Again, with a lot of solid-sounding arguments. * When average is not enough: part II. (“Where is part I?”, you may ask. Read the post to discover).

Unfortunately, I don’t publish academic papers. But if I do, I will definitely want prof. Patcher read it, and let the world know what he thinks about it. For good and for bad.

Speaking of calling bullshit. Believe it or not, University of Washington has a course with this exact title. The course is available online http://callingbullshit.org/ and is worth watching. I watched all the course’s videos during my last flight from Canada to Israel. The featured image of this post is a screenshot of this course’s homepage.

November 15, 2017 - 2 minute read -

Good information + bad visualization = BAD

November 14, 2017

I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.

via Math required for machine learning — Youth Innovation

I would write a post about good practices in pie charts, but Robert Kosara, of https://eagereyes.org does this so well, I don’t really think I need to reinvent the weel. Pie charts are very powerful in conveying information. Make sure you use this tool well. I strongly suggest reading everything Robert Kosara has to say on this topic.

November 14, 2017 - 1 minute read -

What are the best practices in planning & interpreting A/B tests?

November 13, 2017

Compiled by my teammate Yanir Serourssi, the following is a reading list an A/B tests that you should read even if you don’t plan to perform an A/B test anytime soon. The list is Yanir’s. The reviews are mine. Collective intelligence in action :-)

* [If you don’t pay attention, data can drive you off a cliff](https://yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/)   In this post, Yanir lists seven common mistakes that are common to any data-based analysis. At some point, you might think that this is a list of trivial truths. Maybe it is. The fact that Yanir's points are trivial doesn't make them less correct. Awareness doesn't exist without knowledge. Unfortunately, knowledge doesn't assure awareness. Which is why reading trivial truths is a good thing to do from time to time.
* [How to identify your marketing lies and start telling the truth](https://www.linkedin.com/pulse/how-identify-your-marketing-lies-start-telling-truth-tiberio-caetano)   This post was written by Tiberio Caetano, a data science professor at the University of Sidney. If I had to summarize this post with a single phrase, that would be "confounding factors". A confounding variable is a variable hidden from your eye that influences a measured effect. One example of a confounding variable is when you start an ad campaign for ice cream, your sales go up, and you conclude that the ad campaign was effective. What you forgot was that the ad campaign started at the beginning of the summer, when people start buying more ice cream anyhow.   See [this link](https://onlinecourses.science.psu.edu/stat507/node/34) for a detailed textbook-quality review of confounding variables.
* [Seven rules of thumb for web site experimenters](http://www.exp-platform.com/Documents/2014%20experimentersRulesOfThumb.pdf)   I read this review back in 2014, shortly after it was published by, among others, researchers from Microsoft and LinkedIn. Judging by the title, one would expect yet another list of trivial truths in a self-promoting product blog. This is not the case here. In this paper, you will find several real-life case studies, many references to marketing studies, and no advertising of shady products or schemes.
* [A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments](http://exp-platform.com/Documents/2017-08%20KDDMetricInterpretationPitfalls.pdf)   Another academic paper by Microsoft researchers. This one lists a lot of "dont's". Like in the previous link, every advice the authors give is based on established theory and backed up by real data.

November 13, 2017 - 2 minute read -

How to make a racist AI without really trying (a reblog)

November 10, 2017

Perhaps you heard about Tay, Microsoft’s experimental Twitter chat-bot, and how within a day it became so offensive that Microsoft had to shut it down and never speak of it again. And you assumed that you would never make such a thing, because you’re not doing anything weird like letting random jerks on Twitter re-train […]

via How to make a racist AI without really trying — ConceptNet blog

November 10, 2017 - 1 minute read -

Please leave a comment on this post

November 9, 2017

Please leave a comment on this post. It doesn’t matter what you want to write. It can be short or long. Any comment. I need to know that humans read this blog. If you feel really generous, tell me how you found this blog, what you think of it.

November 9, 2017 - 1 minute read -

Data Science or Data Hype?

November 8, 2017

In his blog post Big Data Or Big Hype? , Rick Ciesla is asking a question whether the “Big Data” phenomenon is “a real thing” or just a hype? I must admit that, until recently, I was sure that the term “Data Science” was a hype too – an overbroad term to describe various engineering and scientific activities. As time passes by, I become more and more confident that Data Science matures into a separate profession. I haven’t’ yet decided whether the word “science” is fully appropriate in this case is.

We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning. Just how large of a data set constitutes Big Data? What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes? There’s a survey for that, courtesy of […]

via Big Data Or Big Hype? — VenaData

November 8, 2017 - 1 minute read -

Do you REALLY need the colors?

November 7, 2017

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Look at this example from the seaborn documentation site

>>> import seaborn as sns
>>> sns.set_style("whitegrid")
>>> tips = sns.load_dataset("tips")
>>> ax = sns.barplot(x="day", y="total_bill", data=tips)

Barplot example with colored bars

This example shows the default barplot and is the first barplot. Can you see how easy it is to add colors to the different columns? But WHY? What do those colors represent? It looks like the only information that is encoded by the color is the bar category. We already have this information in the form of bar location. Having this colorful image adds nothing but a distraction. It is sad that this is the default behavior that seaborn developers decided to adopt.

Look at the same example, without the colors

>>> ax = sns.barplot(x="day", y="total_bill", color='gray', data=tips)

Barplot example with gray bars

Isn’t it much better? The sad thing is that a better version requires memorizing additional arguments and more typing.

This was my because you can rant.

November 7, 2017 - 1 minute read -

Numpy vs. Pandas: functions that look the same, share the same code but behave differently

November 6, 2017

I can’t imagine how my professional life would have looked like without pandas, THE data analysis library for Python. Pandas shares much of its functionality and syntax with numpy, a fundamental package for scientific computing with Python. The reason for that is that, under the hood, pandas uses numpy. This similarity is very convenient as it allows passing numpy arrays to many pandas functions and vice versa. However, sometimes it sabs you in the back. Here is a nice example that I discovered after hours (OK, minutes) of debugging.

Let’s create a numpy vector with a single element in it:

>>> import numpy as np

>>> v = np.array([3.14]) 

Now, let's compute the standard deviaiton of this vector. According to the [definition](https://en.wikipedia.org/wiki/Standard_deviation), we expect it to be equal zero.

>>> np.std(v)
0.0

So far so good. No surprises.

Now, let’s make a pandas Series out of our vector. A Series is basically a vector in which the elements can be indexed by arbitrary labels. What do you expect the standard deviation should be now?

>>> import pandas as pd
>>> s = pd.Series(v)
>>> s.std()
nan

What? Not a number? What the hell? It’s not an empty vector! I didn’t ask to perform the corrected sample standard deviation. Wait a second…

>> s.std(ddof=0)
 0.0

Now I start getting it. Compare this

>>> print(np.std.__doc__)
Compute the standard deviation along the specified axis.
....
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
<span style="color:#ff0000;"><strong>By default `ddof` is zero</strong></span>.

… to this

>>> print(pd.Series.std.__doc__)

Return <span style="color:#ff0000;"><strong>sample</strong></span> standard deviation over requested axis.

<span style="color:#ff0000;"><strong>Normalized by N-1 by default</strong></span>. This can be changed using the ddof argument
....
<span style="color:#ff0000;"><strong>ddof : int, default 1</strong></span>
degrees of freedom

Formally, the pandas developers did nothing wrong. They decided that it makes sense to default for normalized standard deviation when working with data tables, unlike numpy that is supposedly meant to deal with arbitrary matrices of numbers. They made a decision, they wrote it at least three times in the documentation, and yet… I didn’t know that even after working with both the libraries for so long.

To sum up:

s.std()
nan

v.std()
0.0
s == v
0 True
dtype: bool

Beware.

November 6, 2017 - 2 minute read -

When scatterplots are better than bar charts, and why?

November 5, 2017

From time to time, you might hear that graphical method A is better at representing problem X than method B. While in case of problem Z, the method B is much better than A, but C is also a possibility. Did you ever ask yourselves (or the people who tell you that) “Says WHO?”

The guidelines like these come from theoretical and empirical studies. One such an example is a 1985 paper “Graphical perception and graphical methods for analyzing scientific data.” by Cleveland and McGill. I got the link to this paper from Varun Raj of https://varunrajweb.wordpress.com/.

It looks like a very interesting and relevant paper, despite the fact that it has been it was published 22 years go. I will certainly read it. Following is the reading list that I compiled for my data visualization students more than two years ago. Unfortunately, they didn’t want to read any of these papers. Maybe some of the readers of this blog will …

November 5, 2017 - 2 minute read -

Because you can — a new series of data visualization rants

November 1, 2017

Here’s an old joke:

Q: Why do dogs lick their balls?
A: Because they can.

Canine behavior aside, the fact that you can do something doesn’t mean that you should to it. I already wrote about one such example, when I compared between chart legends to muttonchops.

Citing myself:

Chart legends are like Muttonchops — the fact that you can have them doesn’t mean you should.

When reviewing my notes, I realized that I have more bad data visualization examples that share a common problem: effortless addition of elements or features.

Stay tuned and check the because-you-can tag.

Featured image by Unsplash user Nicolas Tessari

November 1, 2017 - 1 minute read -

Although it is easy to lie with statistics, it is easier to lie without

October 30, 2017

I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting – the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.

Quoting Tom Breur:

Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!

Happy dredging indeed.

October 30, 2017 - 1 minute read -

Gartner: More than 40% of data science tasks will be automated by 2020. So what?

October 25, 2017

Recently, I gave a data science career advice, in which I suggested the perspective data scientists not to study data science as a career move. Two of my main arguments were (and still are):

* The current shortage of data scientists will go away, as more and more general purpose tools are developed.
* When this happens, you’d better be an expert in the underlying domain, or in the research methods. The many programs that exist today are too shallow to provide any of these.

Recently, the research company Gartner published a press release in which they claim that “More than 40 percent of data science tasks will be automated by 2020, resulting in increased productivity and broader usage of data and analytics by citizen data scientists, according to Gartner, Inc.” Gartner’s main argument is similar to mine: the emergence of ready-to-use tools, algorithm-as-a-service platforms and the such will reduce the amount of the tedious work many data scientists perform for the majority of their workday: data processing, cleaning, and transformation. There are also more and more prediction-as-a-service platforms that provide black boxes that can perform predictive tasks with ever increasing complexity. Once good plug-and-play tools are available, more and more domain owners, who are not necessary data scientists, will be able to use them to obtain reasonably good results. Without the need to employ a dedicated data scientist.

Data scientists won’t disappear as an occupation. They will be more specialized.

I’m not saying that data scientists will disappear in the way coachmen disappeared from the labor market. My claim is that data scientists will cease to be perceived as a panacea by the typical CEO/CTO/CFO. Many tasks that are now performed by the data scientists will shift to business developers, programmers, accountants and other domain owners who will learn another skill – operating with numbers using ready to use tools. An accountant can use Excel to balance a budget, identify business strengths, and visualize trends. There is no reason he or she cannot use a reasonably simple black box to forecast sales, identify anomalies, or predict churn.

So, what is the future of data science occupation? Will the emergence of out-of-box data science tools make data scientists obsolete? The answer depends on the data scientists, and how sustainable his or her toolbox is. In the past, bookkeeping used to rely on manual computations. Has the emergence of calculators, and later, spreadsheet programs, result in the extinction of bookkeepers as a profession? No, but most of them are now busy with tasks that require more expertise than just adding the numbers.

The similar thing will happen, IMHO, with data scientists. Some of us will develop a specialization in a business domain – gain a better understanding of some aspect of a company activity. Others will specialize in algorithm optimization and development and will join the companies for which algorithm development is the core business. Others will have to look for another career. What will be the destiny of a particular person depends mostly on their ability to adapt. Basic science, solid math foundation, and good research methodology are the key factors the determine one’s career sustainability. The many “learn data science in 3 weeks” courses might be the right step towards a career in data science. A right, small step in a very long journey.

Featured image: Alex Knight on Unsplash

October 25, 2017 - 3 minute read -

1461

October 24, 2017

I teach data visualization to in Azrieli College of Engineering in Jerusalem. Yesterday, during my first lesson, I was talking about the different ways a chart design selection can lead to different conclusions, despite not affecting the actual data. One of the students hypothesized that the preception of a figure can change as a function of other graphs shown together. Which was exactly tested in a research I recently mentioned here. I felt very proud of that student, despite only meeting them one hour before that.

October 24, 2017 - 1 minute read -

Who doesn't like some merciless critique of others' work?

October 23, 2017

Stephen Few is the author of (among others) “Show Me The Numbers”. Besides writing about what should **be done, in the field of data visualization, Dr. Few also writes a lot about what should **not be done. He does that in a sharp, merciless way which makes it very interesting reading (although, sometimes Dr. Few can be too harsh). This time, it was the turn of the Tableau blog team to be at the center of Stephen Few’s attention, and not for the good reason.

If Tableau wishes to call this research, then I must qualify it as bad research. It produced no reliable or useful findings. Rather than a research study, it would be more appropriate to call this “someone having fun with an eye tracker.”

Reading merciless critique by knowledgeable experts is an excellent way to develop that “inner voice” that questions all your decisions and makes sure you don’t make too many mistakes. Despite the fear to be fried, I really that some day I’ll be able to know what Stephen Few things of my work.

http://www.perceptualedge.com/blog/?p=2718

Disclaimer: Stephen Few was very generous to allow me using the illustrations from his book in my teaching.

Featured image is Public domain image by Alan Levine from here

October 23, 2017 - 1 minute read -

Why is it (almost) impossible to set deadlines for data science projects?

October 19, 2017

In many cases, attempts to set a deadline to a data science project result in a complete fiasco. Why is that? Why, in many software projects, managers can have a reasonable time estimate for the completion but in most data science projects they can’t? The key points to answer this question are complexity and, to a greater extent, missing information. By “complexity” I don’t (only) mean the computational complexity. By “missing information” I don’t mean dirty data. Let us take a look at these two factors, one by one.

Complexity

Illustration: famous xkcd comic. Two programmers play during the compilation time

Think of this. Why most properly built bridges remain functional for decades and sometimes for centuries, while the rule in every non-trivial program is that “there is always another bug?”. I read this analogy in Joel Spolsky’s post written in 2001. The answer Joel provides is:

Once you’ve written a subroutine, you can call it as often as you want. This means that almost everything we do as software developers is something that has never been done before. This is very different than what construction workers do.

There was a substantial progress in the computer engineering theory since 2001 when Joel wrote its post. We have better static analysis tools, better coverage tools, and better standard practices. Nevertheless, bug-free software only exists in Programming 101 books.

What about data science projects? Aren’t they essentially a sort of software project? Yes, they are, and as such, the above quote is relevant for them too. However, we can add another statement:

Once you’ve collected data, you can process it as often as you want. This means that almost everything we do as data scientists is something that has never been done before.

You see, to account for project uncertainty, we need to multiply the number of uncertainty factors of a software project by the number of uncertainty factors associated with the data itself. The bottom line is an exponential complexity growth.

Missing information

Now, let’s talk about another, even bigger problem, the missing information. I’m not talking about “dirty data” – a situation where some values in the dataset are missing, input errors, or fields that change their meaning over the time. These are severe problems but not as tough as the problem I’m about to talk about in this post.

When a software engineer writes a plotting program, they know when it doesn’t work: the image is either created or not. And if the image isn’t created, the programmer knows that something wrong and has to be fixed. When a programmer writes a compression program, they know when they made a mistake: if the program does not compress a file, or if the result isn’t readable. The programmer knows that there must be a fixable bug in his or her code.

What about a data science project? Let’s say you’re starting an advertisement targetting project. The project manager gives you the information source and the performance metric. A successful model has to have a performance of 80 or more (the nature of the performance score isn’t important here). You start working. You clean your data, normalize it, build a nice decision tree, and get a score of 60, which is way too low. You explore your data, discover problems in it, retrain the tree and get 63. You talk to the team that collects the data, find more problems, build a random forest, train it and get a score of 66. You buy some computation time, create a deep learning network on AWS, train it for a week, and get 66 again.

Illustration: a blindfolded man wandering around

What do you do now? Is it possible that somewhere in your code there is a bug? It certainly is. Is it possible that you can improve the performance by deploying a better model? Probably. However, it is also possible that the data does not contain enough information. The problem, of course, is that you don’t know that. In practice, you hit your head against the wall until you get the results, or give up, or fired. And this is THE most significant problem with data science (and any research) project: your problem is a black box. You only know what you know, but you have no idea what you don’t. A research project is like exploring a forest with your eyes shut: when you hit a tree, you don’t know whether this is the last tree in the forest and you’re out, or you’re in the middle of a tropical jungle.

I hope that the theoretical data science research will narrow this gap. Meanwhile, the project managers will have to live with this great degree of uncertainty.

PS. As in any opinion post, I may be mistaken. If you think I am, please let me know in the comments section below.

The xckd image: https://xkcd.com/303/ under CC-nc. The wandering man image: Illustration: a blindfolded man wandering around. By Flickr user Molly under CC-by-nc-nd

October 19, 2017 - 4 minute read -

What is the best thing that can happen to your career?

October 19, 2017

Today, I’ve read a tweet by Sinan Aral (@sinanaral) from the MIT:

https://twitter.com/sinanaral/status/917162872362463232

I’ve just realized that Ikigai is what happened to my career as a data scientist. There was no point in my professional life where I felt boredom or lack of motivation. Some people think that I’m good at what I’m doing. If they are right (which I hope they are), It is due to my love for what I have been doing since 2001. I am so thankful for being able to do things that I love, I care about, and am good at. Not only that, I’m being paid for that! The chart shared by Sinan Aral in his tweet should be guiding anyone in their career choices.

Featured image is taken from this article. Original image credit: Toronto Star Graphic

October 19, 2017 - 1 minute read -

We're Reading About Bias in AI, SpaceX, and More

October 18, 2017

Reading list from the curators of data.blog

October 18, 2017 - 1 minute read -

Can the order in which graphs are shown change people's conclusions?

October 17, 2017

When I teach data visualization, I love showing my students how simple changes in the way one visualizes his or her data may drive the potential audience to different conclusions. When done correctly, such changes can help the presenters making their point. They also can be used to mislead the audience. I keep reminding the students that it is up to them to keep their visualizations honest and fair. In his recent post, Robert Kosara, the owner of https://eagereyes.org/, mentioned another possible way that may change the perceived conclusion. This time, not by changing a graph but by changing the order of graphs exposed to a person. Citing Robert Kosara:

Priming is when what you see first influences how you perceive what comes next. In a series of studies, [André Calero Valdez, Martina Ziefle, and Michael Sedlmair] showed that these effects also exist in the particular case of scatterplots that show separable or non-separable clusters. Seeing one kind of plot first changes the likelihood of you judging a subsequent plot as the same or another type.

via IEEE VIS 2017: Perception, Evaluation, Vision Science — eagereyes

As any tool, priming can be used for good or bad causes. Priming abuse can be a deliberate exposure to non-relevant information in order to manipulate the audience. A good way to use priming is to educate the listeners of its effect, and repeatedly exposing them to alternate contexts. Alternatively, reminding the audience of the “before” graph, before showing them the similar “after” situation will also create a plausible effect of context setting.

P.S. The paper mentioned by Kosara is noticeable not only by its results (they are not as astonishing as I expected from the featured image) but also by how the authors report their research, including the failures.

Featured image is Figure 1 from Calero Valdez et al. Priming and Anchoring Effects in Visualization

October 17, 2017 - 2 minute read -

Advice for aspiring data scientists and other FAQs — Yanir Seroussi

October 15, 2017

It seems that career in data science is the hottest topic many data scientists are asked about. To help an aspiring data scientist, I’m reposting here a FAQ by my teammate Yanir Seroussi

Aspiring data scientists and other visitors to this site often repeat the same questions. This post is the definitive collection of my answers to such questions (which may evolve over time). How do I become a data scientist? It depends on your situation. Before we get into it, have you thought about why you want […]

via Advice for aspiring data scientists and other FAQs — Yanir Seroussi

October 15, 2017 - 1 minute read -

How to be a better teacher?

October 12, 2017

If you know me in person or follow my blog, you know that I have a keen interest in teaching. Indeed, besides being a full-time data scientist at Automattic, I teach data visualization anywhere I can. Since I started teaching, I became much better in communication, which is one of the required skills of a good data scientist.
In my constant strive for improving what I do, I joined the Data Carpentry instructor training. Recently, I got my certification as a data carpentry instructor.

Certificate of achievement. Data Carpentry instructor

Software Carpentry (and it’s sibling project Data Carpentry) aims to teach researchers the computing skills they need to get more done in less time and with less pain. “Carpentry” instructors are volunteers who receive a pretty extensive training and who are committed to evidence-based teaching techniques. The instructor training had a powerful impact on how I approach teaching. If teaching is something that you do or plan to do, invest three hours of your life watching this video in which Greg Wilson, “Carpentries” founder, talks about evidence-based teaching and his “Carpentries” project.

https://www.youtube.com/watch?v=kmVKGxPlTvc

I also recommend reading these papers, which provide a brief overview of some evidence-based results in teaching:

* "[The Science of Learning](https://swcarpentry.github.io/instructor-training/files/papers/science-of-learning-2015.pdf)"
* "[Success in Introductory Programming: What Works?](https://swcarpentry.github.io/instructor-training/files/papers/porter-what-works-2013.pdf)"
* "[What Can I Do Today to Create a More Inclusive Community in CS?](https://swcarpentry.github.io/instructor-training/files/papers/lee-create-inclusive-community-2015.pdf)"

October 12, 2017 - 1 minute read -

What you need to know to start a career as a data scientist

October 11, 2017

It’s hard to overestimate how I adore StackOverflow. One of the recent blog posts on StackOverflow.blog is “What you need to know to start a career as a data scientist” by Julia Silge. Here are my reservations about that post:

1. It’s not that simple (part 1)

You might have seen my post “Don’t study data science as a career move; you’ll waste your time!”. Becoming a good data scientist is much more than making a decision and “studying it”.

2. Universal truths mean nothing

The first section in the original post is called “You’ll learn new things”. This is a universal truth. If you don’t “learn new things” every day, your professional career is stalling. Taken from the word of classification models, telling a universal truth has a very high sensitivity but very low specificity. In other words, it’s a useless waste of ink.

3. Not for developers only

The first section starts as follows: “When transitioning from a role as a developer to a position focused on data, …”. Most of the data scientists I know were never developers. I, for example, started as a pharmacist, computational chemist, and bioinformatician. I know several physicists, a historian and a math teacher who are now successful data scientists.

4. SQL skills are overrated

Another quote from the post: “Strong SQL skills are table stakes for data scientists and data engineers”. The thing is that in many cases, we use SQL mostly to retrieve data. Most of the “data scienc-y” work requires analytical tools and the flexibility that are not available in most of the SQL environments. Good familiarity with industry-standard tools and libraries are more important than knowing SQL. Statistics is way more important than knowing SQL. Julia Silge did indeed mention the tools (numpy/R) but didn’t emphasize them enough.

5. Communication importance is hard to overestimate

Again, quoting the post:

The ability to communicate effectively with people from diverse backgrounds is important.

Yes, Yes, and one thousand times Yes. Effective communication is a non-trivial task that is often overlooked by many professionals. Some people are born natural communicators. Some, like me, are not. If there’s one book that you can afford buying to improve your communication skills, I recommend buying “Trees, maps and theorems” by Jean-luc Doumont. This is a small, very expensive book that changed the way I communicate in my professional life.

6. It’s not that simple (part 2)

After giving some very general tips, Julia proceeds to suggest her readers checking out the data science jobs at StackOverflow Jobs site. The impression that’s made is that becoming a data scientist is a relatively simple task. It is not. At the bare minimum, I would mention several educational options that are designed for people trying to become data scientists. One such an option is Thinkful (I’m a mentor at Thinkful). Udacity and Coursera both have data science programs too. The point is that to become a data scientist, you have to study a lot. You might notice a potential contradiction between point 1 above and this paragraph. A short explanation is that becoming a data scientist takes a lot of time and effort. The post “Teach Yourself Programming in Ten Years” which was written in 2001 about programming is relevant in 2017 about data science.

Featured image is based on a photo by Jase Ess on Unsplash

October 11, 2017 - 3 minute read -

Graffiti from Chișinău, Moldova

October 10, 2017

I’ve stumbled upon a nice post by Jackie Hadel where she shared some graffiti pictures from - Chișinău, the town I was born at. I left Chișinău in 1990 and first visited it in this March. I also took several graffiti pictures which I will share here. Chișinău is also known by its Russian name Kishinev.

Graffiti in Chisinau. Kishinevers, put your all efforts to rebuild your native city

This is a partially restored post-WWII writing that says “Kishinevers, give your all efforts to rebuild [your] native town”. Kishinev was ruined almost completely during the World War II. Right now, after the USSR collapse more than 25 years ago, the city still looks as if it needs to be restored.

Graffiti in Chisinau. Pythagorean theorem.

Being a data scientist, I liked this graffiti for the maths. It’s the Pythagorean theorem, in case you missed it.

Swastika on a tombstone in Chisinau

Swastika on a tombstone in the old Jewish cemetery. One of the saddest places I visited in this city.

Graffiti in Chisinau. Building-size graffity.

A mega-graffiti?

Graffiti in Chisinau. Writing that says "I love Moldova" (in Romanian)

“I love Moldova”. I love it too.

See the original post that prompted me to share these pictures: CHISINAU, MOLDOVA GRAFFITI: LEFT IN RUIN, YOU MAKE ME HAPPY — TOKIDOKI (NOMAD)

15july17 Chisinau, Moldova 🇲🇩

October 10, 2017 - 2 minute read -

Identifying and overcoming bias in machine learning

October 8, 2017

Data scientists build models using data. Real-life data captures real-life injustice and stereotypes. Are data scientists observers whose job is to describe the world, no matter how unjust it is? Charles Earl, an excellent data scientist, and my teammate says that the answer to this question is a firm “NO.” Read the latest data.blog post to learn Charles’ arguments and best practices.

https://videopress.com/embed/jckHrKeF?hd=0&autoPlay=0&permalink=0&loop=0

Charles Earl on identifying and overcoming bias in machine learning.

via Data Speaker Series: Charles Earl on Discriminatory Artificial Intelligence — Data for Breakfast

October 8, 2017 - 1 minute read -

Before and after — the Hebrew holiday season chart

October 8, 2017

Sometimes, when I see a graph, I think “I could draw a better version.” From time to time, I even consider writing a blog post with the “before” and “after” versions of the plot. Last time I had this desire was when I read the repost of my own post about the crazy month of Hebrew holidays. I created this graph three years ago. Since then, I have learned A LOT. So I thought it would be a good opportunity to apply my over-criticism to my own work. This is the “before” version:

Graph: Tishrei is mostly a non-working month.

There are quite a few points worth fixing in that plot. Let’s review those problems:

* The point of the original post is to emphasize the amount of NON-working days in Tishrei. However, the largest points represent the working days. As the result, the emphasis goes to the working days, thus reversing the semantics.
* It is not absolutely clear what point I intended to make using this graph. A short and meaningful title is an effective way to lead the audience towards the desired conclusion.
* There are three distinct colors in my graph, representing working, half-working and non-working days. The category order is clear. The color order, on the other hand, is absolutely arbitrary. Moreover, green and red are never a good color combination due to the significantly high prevalence of impaired color vision.
* Y label is rotated. Rotated Y labels are the default option in all the plotting tools that I know. Why is that is beyond my understanding, given the numerous studies that show that reading rotated text takes more time and is more error-prone (for example, see [ref](http://journals.sagepub.com/doi/abs/10.1177/154193120204601722), [ref](http://jov.arvojournals.org/article.aspx?articleid=2121153), and [ref](http://psycnet.apa.org/record/1986-10970-001).)
* One interesting piece of information that one might expect to read from a graph is how many working days are there in year X. One can obtain this information either by counting the dots or by looking at a separate graph. It would be a good idea to make this information readily available to the observer.
* The frame around the plot is useless.

OK, now that we have identified the problems, let’s fix them

* **Emphasize the right things.** I will use bigger points for the non-working days and small ones for the working days. I will also use squares instead of circles. Placing several squares one next to the other creates solid areas with less white space in-between. This lack of whitespace will help further emphasizing non-working chunks. I will make to leave *some* whitespace between the points, to enable counting.
* **What's your point?** I will add an explanatory title. Having given some thought, I came up with "How productive can you be?". It is short, thought-provoking, and makes the point.
* **Reduce the number of colors. **My intention was to use red for non-working days, and blue for the working ones. What color should I use for the half-working ([Chol haMoed](https://en.wikipedia.org/wiki/Chol_HaMoed)) days? I don't want to introduce another color to the improved graph. Since in my case, those days are mostly non-working, I will use a shade of red for Chol haMoed.
* **Improve label readability. **One way to solve the rotated Y label problem is to remove the Y label at all! After all, most people will correctly assume that "2006", "2010", "2020" and other values represent the years. However, the original post mentions two different methods to count the years, using the Hebrew and Christian traditions. To make it absolutely clear that the graph talks about the Christian (common) calendar, I decided to keep the legend and format it properly.
* **Add more info. **I added the total number of working days as a separate column of properly aligned gray text labels. The gray color ensures that the labels don't compete with the graph.  I also highlighted the current year using a subtle background rectangle.
* **Data-ink ratio. **I removed the box around the graph and got rid of lines for the X and Y axes. I also removed the vertical grid lines. I wasn't sure about the horizontal ones but I decided to keep them in place.

This is the result:

I like it very much. I’m sure though, that if I revisit it in a year or two, I will find more ways to make it even better.

You may find the code that generates this figure here.

October 8, 2017 - 4 minute read -

Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP

October 2, 2017

Frequently, training a machine learning model in a single session is impossible. Most commonly, this happens when one needs to update a model with newly obtained observations. The generic term for such an update is “online learning.” In the scikit-learn world, this concept is also known as partial fit. The problem is that some models or their implementations don’t allow for partial fit. Even if the partial fitting is technically possible, the weight assigned to the new observations is may not be under your control. What happens when you re-train a model from scratch, or when the new observations are assigned too high weights? Recently, I stumbled upon an interesting concept of Pseudo-rehearsal that addresses this problem. Citing Matthew Honnibal:

Sometimes you want to fine-tune a pre-trained model to add a new label or correct some specific errors. This can introduce the “catastrophic forgetting” problem. Pseudo-rehearsal is a good solution: use the original model to label examples, and mix them through your fine-tuning updates.

This post is written by Matthew Honnibal from the team behind the excellent Spacy NLP library. This post is valuable in many aspects. First, it demonstrates a simple-to-implement technique. More importantly, it provides the True Name for a problem I encounter from time to time: Catastrophic forgetting.

Featured image is by Flickr user Herr Olsen under CC-by-nc-2.0

October 2, 2017 - 1 minute read -