Another set of ruthless critique pieces

You know that I like reading a ruthless critique of others’ work — I like telling myself that by doing so I learn good practices (in reality, I suspect I’m just a case what we call in Hebrew שמחה לאיד — the joy of some else’s failure).

Anyhow, I’d like to share a set of posts by Lior Patcher in which he calls bullshit on several reputable people and concepts. Calling bullshit is easy. Doing so with arguments is not so. Lior Patcher worked hard to justify his opinion.

 

Unfortunately, I don’t publish academic papers. But if I do, I will definitely want prof. Patcher read it, and let the world know what he thinks about it. For good and for bad.

Speaking of calling bullshit. Believe it or not, University of Washington has a course with this exact title. The course is available online http://callingbullshit.org/ and is worth watching. I watched all the course’s videos during my last flight from Canada to Israel. The featured image of this post is a screenshot of this course’s homepage.

 

 

 

We’re Reading About Simplifying Without Distortion and Adversarial Image Classification

Weekly reading list from the data.blog team

Data for Breakfast

Boris Gorelik

Recently, I heard an interview with Desmond Morris, the author of The Naked Ape. He reveals that the goal of his writing has always been to “simplify without distortion.” This interview reminded me of the (EXCELLENT) blog “Math with Bad Drawings” by Ben Orlin, a math teacher from Birmingham, England. At his blog, Ben Orlin does exactly that: He simplifies without distorting. I highly suggest following this blog. At the bare minimum, read the latest post, “5 Ways to Troll Your Neural Network.”

Do you know of other blogs that educate readers about mathematics, statistics, machine learning, and other, related fields? Share your favorites in the comments.

Carly Stambaugh

In image classification tasks, an adversarial example is one that has been altered in small ways which are imperceptible to the human eye, in a targeted manner with the intention of “fooling” the classifier. Over at…

View original post 43 more words

Good information + bad visualization = BAD

I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.

via Math required for machine learning — Youth Innovation

I would write a post about good practices in pie charts, but Robert Kosara, of https://eagereyes.org does this so well, I don’t really think I need to reinvent the weel. Pie charts are very powerful in conveying information. Make sure you use this tool well. I strongly suggest reading everything Robert Kosara has to say on this topic.

 

 

What are the best practices in planning & interpreting A/B tests?

Screenshots of the reading mentioned in this post

Compiled by my teammate Yanir Serourssi, the following is a reading list an A/B tests that you should read even if you don’t plan to perform an A/B test anytime soon. The list is Yanir’s. The reviews are mine. Collective intelligence in action 🙂

  • If you don’t pay attention, data can drive you off a cliff
    In this post, Yanir lists seven common mistakes that are common to any data-based analysis. At some point, you might think that this is a list of trivial truths. Maybe it is. The fact that Yanir’s points are trivial doesn’t make them less correct. Awareness doesn’t exist without knowledge. Unfortunately, knowledge doesn’t assure awareness. Which is why reading trivial truths is a good thing to do from time to time.
  • How to identify your marketing lies and start telling the truth
    This post was written by Tiberio Caetano, a data science professor at the University of Sidney. If I had to summarize this post with a single phrase, that would be “confounding factors”. A confounding variable is a variable hidden from your eye that influences a measured effect. One example of a confounding variable is when you start an ad campaign for ice cream, your sales go up, and you conclude that the ad campaign was effective. What you forgot was that the ad campaign started at the beginning of the summer, when people start buying more ice cream anyhow.
    See this link for a detailed textbook-quality review of confounding variables.
  • Seven rules of thumb for web site experimenters
    I read this review back in 2014, shortly after it was published by, among others, researchers from Microsoft and LinkedIn. Judging by the title, one would expect yet another list of trivial truths in a self-promoting product blog. This is not the case here. In this paper, you will find several real-life case studies, many references to marketing studies, and no advertising of shady products or schemes.
  • A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments
    Another academic paper by Microsoft researchers. This one lists a lot of “dont’s”. Like in the previous link, every advice the authors give is based on established theory and backed up by real data.

How to make a racist AI without really trying (a reblog)

Perhaps you heard about Tay, Microsoft’s experimental Twitter chat-bot, and how within a day it became so offensive that Microsoft had to shut it down and never speak of it again. And you assumed that you would never make such a thing, because you’re not doing anything weird like letting random jerks on Twitter re-train […]

via How to make a racist AI without really trying — ConceptNet blog

Data Science or Data Hype?

In his blog post Big Data Or Big Hype? , Rick Ciesla is asking a question whether the “Big Data” phenomenon is “a real thing” or just a hype? I must admit that, until recently, I was sure that the term “Data Science” was a hype too — an overbroad term to describe various engineering and scientific activities. As time passes by, I become more and more confident that Data Science matures into a separate profession. I haven’t’ yet decided whether the word “science” is fully appropriate in this case is.

We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning. Just how large of a data set constitutes Big Data? What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes? There’s a survey for that, courtesy of […]

via Big Data Or Big Hype? — VenaData

Do you REALLY need the colors?

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Look at this example from the seaborn documentation site

>>> import seaborn as sns
>>> sns.set_style("whitegrid")
>>> tips = sns.load_dataset("tips")
>>> ax = sns.barplot(x="day", y="total_bill", data=tips)

Barplot example with colored bars

This example shows the default barplot and is the first barplot. Can you see how easy it is to add colors to the different columns? But WHY? What do those colors represent? It looks like the only information that is encoded by the color is the bar category. We already have this information in the form of bar location. Having this colorful image adds nothing but a distraction. It is sad that this is the default behavior that seaborn developers decided to adopt.

Look at the same example, without the colors

>>> ax = sns.barplot(x="day", y="total_bill", color='gray', data=tips)

Barplot example with gray bars

Isn’t it much better? The sad thing is that a better version requires memorizing additional arguments and more typing.

This was my because you can rant.

 

Numpy vs. Pandas: functions that look the same, share the same code but behave differently

Screenshot that shows that two variables have different standard deviation despite being equal

I can’t imagine how my professional life would have looked like without pandas, THE data analysis library for Python. Pandas shares much of its functionality and syntax with numpy, a fundamental package for scientific computing with Python. The reason for that is that, under the hood, pandas uses numpy. This similarity is very convenient as it allows passing numpy arrays to many pandas functions and vice versa. However, sometimes it sabs you in the back. Here is a nice example that I discovered after hours (OK, minutes) of debugging.

Let’s create a numpy vector with a single element in it:

>>> import numpy as np

>>> v = np.array([3.14]) 

Now, let's compute the standard deviaiton of this vector. According to the definition, we expect it to be equal zero.
>>> np.std(v)
0.0

So far so good. No surprises.

Now, let’s make a pandas Series out of our vector. A Series is basically a vector in which the elements can be indexed by arbitrary labels. What do you expect the standard deviation should be now?

>>> import pandas as pd
>>> s = pd.Series(v)
>>> s.std()
nan

What? Not a number? What the hell? It’s not an empty vector! I didn’t ask to perform the corrected sample standard deviation. Wait a second…

>> s.std(ddof=0)
 0.0

Now I start getting it. Compare this

>>> print(np.std.__doc__)
Compute the standard deviation along the specified axis.
....
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.

… to this

>>> print(pd.Series.std.__doc__)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument
....
ddof : int, default 1
degrees of freedom

Formally, the pandas developers did nothing wrong. They decided that it makes sense to default for normalized standard deviation when working with data tables, unlike numpy that is supposedly meant to deal with arbitrary matrices of numbers. They made a decision, they wrote it at least three times in the documentation, and yet… I didn’t know that even after working with both the libraries for so long.

To sum up:

> s.std()
nan
>> v.std()
0.0
>> s == v
0 True
dtype: bool

 

Beware.