I should read more about procrastination. Maybe tomorrow.

Screenshot of "The Nature of Procrastination" paper

You’ve been there: you need to complete a project, submit a report, or document your code. You know how important all these tasks are, but you can’t find the power to do so. Instead, you’re researching those nice pictures the Opportunity rover sent to the Earth, type random letters in Google to see where they will lead you to, tidy up your desk, or make another cup of coffee. You are procrastinating.

Because I procrastinate a lot, and because I have several important tasks to complete, I decided to read more about the psychological background of procrastination. I went to Google Scholar and typed “procrastination.” One of the first results was a paper with a promising title. “The Nature of Procrastination: A Meta-Analytic and Theoretical Review of Quintessential Self-Regulatory Failure” by Piers Steel. Why was I intrigued by this paper? First of all, it’s a meta-analysis, meaning that it reviews many previous quantitative studies. Secondly, it promises a theoretical review, which is also a good thing. So, I decided to read it. I started from the abstract, and here’s what I see:

Strong and consistent predictors of procrastination were task aversiveness, task delay, selfefficacy, and impulsiveness, as well as conscientiousness and its facets of self-control, distractibility, organization, and achievement motivation.

Hmmm, isn’t this the very definition of procrastination? Isn’t this sentence similar to “A strong predictor of obesity is a high ratio between person’s weight to their height?”. Now, I’m really intrigued. I am sure that reading this paper will shed some light, not only on the procrastination itself but also on the self-assuring sentence. I definitely need to read this paper. Maybe tomorrow.

 

PS. After writing this post, I discovered that the paper author, Piers Steel, has a blog dedicated to “procrastination and science” https://procrastinus.com/. I will read that blog too. But not today

Lie factor in ad graphs

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

 

 

 

Never read reviews before reading a book (except for this one). On “Surely You’re Joking, Mr. Feynman!”

Several people suggested that I read “Surely You’re Joking, Mr. Feynman!“. That is why, when I got my new Kindle, “Surely You’re Joking, Mr. Feynman!” was the first book I bought.
Richard Feynman was a trained theoretical physics who co-won the Nobel Prize. From reading the book, I discovered that Feynman was also a drummer, a painter, an expert on Native American mathematics, safecracker, a samba player, and an educator. The more I read this book, the more astonished I was about Feynman’s personality and his story.

When I was half the way through the book, I decided to read the Amazone reviews. When reading reviews, I tend to look for the one- and two- stars, to seed my critical thinking. I wish I haven’t done that. The reviewers were talking about how arrogant and self-bragging man Feynman was, and how it must have been terrible to work with him. I almost stopped reading the book after being exposed to those reviews.

Admittedly, Richard Feynman never missed an opportunity to brag about himself and to emphasize how many achievements he made without meaning to do so, almost by accident. Every once in a while, he mentioned many people who were much better than him in that particular field that managed to conquer. I call this pattern a self-bragging modesty, and it is a pattern typical of many successful people. Nevertheless, given all his achievements, I think that Feynman deserves the right to be self-bragging. Being proud of your accomplishments isn’t arrogance, and is a natural thing to do. “Surely You’re Joking, Mr. Feynman!” is fun to read, is very informative and inspirational. I think that everyone who calls themselves a scientist or considers being a scientist should read this book.

P.S. After completing the book, I took some time to watch several Feynman’s lectures on YouTube. It turned out that besides being a good physicist, Feynman was also a great teacher.

Is Data Science a Science?

Richard Feynman in front of a blackboard that says: Guess ⟶ Compute Consequences ⟶ Compare to Experiment

Is Data Science a Science? I think that there is no data scientist who doesn’t ask his- or herself this question once in a while. I recalled this question today when I watched a fascinating lecture “Theory,  Prediction, Observation” made by Richard Feynman in 1964.  For those who don’t know, Richard Feynman was a physicist who won the Nobel Prize, and who is considered one of the greatest explainers. In that particular lecture, Prof. Feynman talked about science as a sequence of  Guess ⟶ Compute Consequences ⟶ Compare to Experiment

Richard Feynman in front of a blackboard that says: Guess ⟶ Compute Consequences ⟶ Compare to Experiment

This is exactly what we do when we build models: we first guess what the model should be, compute the consequences (i.e. fit the parameters). Finally, we evaluate our models against observations.

My favorite quote from that lecture is

… and therefore, experiment produces troubles, every once in a while …

I strongly recommend watching this lecture. It’s one hour long, so if you don’t have time, you may listen to it while commuting. Feynman is so clear, you can get most of the information by ear only.

 

 

Why deeply caring about the analysis isn’t always a good thing?

Illustration: a person looks at sheets of paper and thinks

Does Caring About the Analysis Matter?

The simplystatistics.org blog had an interesting discussion about podcast Roger Peng from simplystatistics.org recorded on A/B testing on Etsy. One of the late conclusions Roger Peng had is as follows
“Whether caring matters for data analysis also has implications for how to build a data analytic team. If you need your data analyst to be 100% committed to a product and to be fully invested, it’s difficult to achieve that with contractors or consultants, who are typically [not deeply invested].”

A hypothetical graph that show that $$ potential is lower as

Yes, deeply caring is very important. That is why I share Roger Peng’s skepticism about external contractors. On the other hand, too deep involvement is also a bad idea. Too deep involvement creates a bias. Such a bias, that can be conscious or subconscious, reduces critical thinking and increases the chances of false findings. If you don’t believe me, recall the last time you debugged a model after it produced satisfactory results. I bet you can’t. The reason is that we all tend to work hard, looking for errors and problems until we get the results we expect. But mostly, not long after that.

There are more mechanisms that may cause false findings. For a good review, I suggest reading  Why Most Published Research Findings Are False by John P. A. Ioannidis.
Image source: Data Analysis and Engagement – Does Caring About the Analysis Matter? — Simply Statistics

Does chart junk really damage the readability of your graph?

Screen Shot 2018-02-12 at 16.32.56Data-ink ratio is considered to be THE guiding principle in data visualization. Coined by Edward Tufte, data-ink is “the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” According to Tufte, the ratio of the data-ink out of all the “ink” in a graph should be as high as possible, preferably, 100%.
Everyone who considers themselves serious about data visualization knows (either formally, or intuitively) about the importance to keep the data-ink ratio high, the merits of high signal-to-noise ratio, the need to keep the “chart junk” out. Everybody knows it. But are there any empirical studies that corroborate this “knowledge”? One of such studies was published in 1988 by James D. Kelly in a report titled “The Data-Ink Ratio and Accuracy of Information Derived from Newspaper Graphs: An Experimental Test of the Theory.”

In the study presented by J.D. Kelly, the researchers presented a series of newspaper graphs to a group of volunteers. The participants had to look at the graphs and answer questions. A different group of participants was exposed to similar graphs that underwent rigorous removal of all the possible “chart junk.” One such an example is shown below

Two bar charts based on identical data. One - with "creative" illustrations. The other one only presents the data.

Unexpectedly enough, there was no difference between the error rate the two groups made. “Statistical analysis of results showed that control groups and treatment groups made a nearly identical number of errors. Further examination of the results indicated that no single graph produced a significant difference between the control and treatment conditions.”

I don’t remember how this report got into my “to read” folder. I am amazed I have never heard about it. So, what is my take out of this study? It doesn’t mean we need to abandon the data-ink ratio at all. It does not provide an excuse to add “chart junk” to your charts “just because”. It does, however, show that maximizing the data-ink ratio shouldn’t be followed zealously as a religious rule. The maximum data-ink ratio isn’t a goal, but rather a tool. Like any tool, it has some limitations. Edward Tufte said, “Above all, show data.” My advice is “Show data, enough data, and mostly data.” Your goal is to convey a message, if some decoration (a.k.a chart junk) makes your message more easily digestible, so be it.

On statistics and democracy, or why exposing a fraud may mean nothing

“stat” in the word “statistics” means “state”, as in “government/sovereignty”. Statistics was born as a state effort to use data to rule a country. Even today, every country I know has its own statistics authority. For many years, many governments, have been hiding the true statistics from the public, under the assumption that knowledge means power. I was reminded of this after reading Charles Earl’s (my teammate) post “Mathematicians, rock the vote!“, in which he encourages mathematicians to fight gerrymandering. Gerrymandering is a dubious practice in the American voting system, where a regulatory body forms voting districts in such a way that the party that appointed that body has the highest chance to win. Citing Charles:

It is really heartening that discrete geometry and other branches of advanced mathematics can be used to preserve democracy

I can’t share Charles’s optimism. In the past, statistics have been successfully used for several times to expose election frauds in Russia (see, for example, these two links, but there are much much more [one] [two]). People went to the streets, waving posters such as “We don’t believe Churov [a Russian politician], we believe Gauss.”

Demonstration in Russia. Poster: "We don't believe Churov. We believe Gauss"
“We don’t believe Churov. We believe Gauss”. Taken from Anatoly Karlin’s site http://akarlin.com/2011/12/measuring-churovs-beard/

Why, then, am I not optimistic? After all, even the great Terminator, one of my favorite Americans, Arnold Schwarzenegger fights gerrymandering.

schwarznegger-on-the-gerrymandering-problem-00025416-super-169.jpg

The problem is not that the American’s don’t know how to eliminate Gerrymandering. The information is there, the solution is known [ref, as an example]. In theory, it is a very easy problem. In practice, however,  power, even more than drugs and sex, is addictive. People don’t tend to give up their power easily. What happened in Russia, after an election fraud was exposed using statistics? Another election fraud. And then yet another. What will happen in the US? I’m afraid that nothing will change there either.

 

What is the best way to handle command line arguments in Python?

The best way to handle command line arguments with Python is defopt. It works like magic. You write a function, add a proper docstring using any standard format (I use [numpy doc]), and see the magic


import defopt

def main(greeting, *, count=1):
    """Display a friendly greeting.

    :param str greeting: Greeting to display
    :param int count: Number of times to display the greeting
    """
    for _ in range(count):
        print(greeting)

if __name__ == '__main__':
    defopt.run(main)

 

You have:

  • help string generation
  • data type conversion
  • default arguments
  • zero boilerplate code

Magic!

Illustration: the famous XKCD

Measuring the wall time in python programs

Illustration: a watch

Measuring the wall time of various pieces of code is a very useful technique for debugging, profiling, and computation babysitting.  The first time I saw a code that performs time measurement was many years ago when a university professor used Matlab’s tic-toc pair. Since then, whenever I learn a new language, the first “serious” code that I write is a tic-toc mechanism. This is my Python Tictoc class: [Github gist].

Why bar charts should always start at zero?

Illustration: a paper sheet with graphs in someone's hand

In the data visualization world, not starting a bar chart at zero is a “BIG NO”. Some people protest. “How come can anyone tell me how to start my bar chart? The Paper/Screen can handle anything! If I want to start a bar chart at 10, nobody can stop me!”

Data visualization is a language. Like any language, data visualization has its set of rules,  grammar if you wish. Like in any other language, you are free to break any rule, but if you do so, don’t be surprised if someone underestimates you. I’m not a native English speaker. I certainly break many English grammar rules when I write or speak. However, I never argue if someone knowledgeable corrects me. If you agree that one should try respecting grammar rules of a spoken language, you have to agree to respect the grammar of any other language, including data visualization.

Natan Yau from flowingdata.com has a very informative post

Screenshot of flowingdata.com post "Bar Chart Baselines Start at Zero"

that explores this exact point. Read it.

Another related discussion is called “When to use the start-at-zero rule” and is also worth reading.

Also, do remember is that the zero point has to be a meaningful one. That is why, you cannot use a bar chart to depict the weather because, unless you operate in Kelvin, the zero temperature is meaningless and changes according to the arbitrary choice the temperature scale.

Yet another thing to remember is that

It’s true that every rule has its exception. It’s just that with this particular rule, I haven’t seen a worthwhile reason to bend it yet.

(citing Natan Yau)