Boris Gorelik |

Engineering Data Science at Automattic

March 21, 2018

Data Scientist? Data Engineer? Analyst? My teammate, Yanir Seroussi writes about bridging the gaps between the different professions.

March 21, 2018 - 1 minute read -

Live in Barcelona. Three most common mistakes in data visualization.

March 2, 2018

On Thursday, March 20, I will give a talk titled “Three most common mistakes in data visualization and how to avoid them.” I will be a guest of the Barcelona Data Science and Machine Learning Meetup Group. Right now, less than twenty-four hours after the lecture announcement, there are already seventeen people on the waiting list. I feel a lot of responsibility and am very excited.

March 2, 2018 - 1 minute read -

Visiting the outer space isn't such a big deal

March 1, 2018

I know a lot of people who dreamt of being a cosmonaut or an astronaut. I was one of them. Did you know that visiting the outer space isn’t such a big deal? Since the Yuri Gagarin’s first flight to space in 1961, 557 more people flew to space. Unfortunately, not all of them survived the trip [ref].

On the other hand

There are 193 UN member countries. Do you know that, according to Wikipedia, there are only 13 (thirteen!) people who are confirmed to visit all of these countries? [ref] It’s 43 times less than the number of astronauts!

558 people visited space; 13 people visited all the countries in the world

March 1, 2018 - 1 minute read -

On algorithmic fairness & transparency

February 28, 2018

My teammate, Charles Earl has recently attended the Conference on Fairness, Accountability, and Transparency (FAT*). The conference site is full of very interesting material, including proceedings and video recording of lectures and tutorials.

Reading through the conference proceedings, I found a very interesting paper titled “The Cost of Fairness in Binary Classification.” This paper talks about the measures one needs to take in order not use sensitive features (such as race) as the means to discrimination, with a reasonable accuracy tradeoff.

Skimming through this paper, I recalled a conversation I had about a year ago with a chief data scientist in a startup that provides short-term loans to people who need some money now. The major job of the data science team in that company was to assess the risk of a customer. From the explanation the chief data scientist gave, and from the data sources she described, it was clear that they train their model on the information whether a person is likely to receive a loan from a financial institution. When I pointed out that they exclude categories of people that are rejected but are likely to return the money. “Yes?” she said in a tone as if she couldn’t see what the problem that I tried to raise was. “Well,” I said, it’s unfair for many customers, plus you’re missing the chance to recruit customers who were rejected by others”. “We have enough potential customers,” she said. She didn’t think fairness was an issue worth talking about.

The featured image is by Søren Astrup Jørgensen from Unsplash

February 28, 2018 - 2 minute read -

Five misconceptions about data science

February 27, 2018

One item on my todo list is to write a post about “three common misconceptions about data science. Today, I found this interesting post that lists misconceptions much better than I would have been able to do. Plus, they list five of them. That 67% more than I intended to do ;-)

I especially liked the section called “What is a Data Scientist” that presents six Venn diagrams of a dream data scientist.

The analogy between the data scientist and a purple unicorn is still apt – finding an individual that satisfies any one of the top four diagrams above is rare.

Enjoy reading Five Misconceptions About Data Science – Knowing What You Don’t Know — Track 2 Analytics

February 27, 2018 - 1 minute read -

Blogging isn't what it used to be

February 26, 2018

From time to time, I assume something, evaluate that assumption, and discover that the reality is opposite to what I thought it was. That’s exactly what happened when I thought about the dynamics of Google searches for “create a site,” compared to the searches for “create a blog.” I was sure that there would be much more searches for “create a site.” I was wrong

There are several interesting insights that one can drive from that small analysis.

The number of people who search for "create a site" is continuously dropping.
Ever since 2009, the number of searches for "create a site" is smaller than the number of searches for "create a blog." Why? I have no idea
Blog creation search dynamics is also interesting. Both "start a blog" and "create a blog" have been decreasing since January 2011. However, despite the fact that both the curves started at the same height, and reached the same peak, they did so in different trajectories. "Create a blog" reached a peak gradually, following a concave path. "Start a blog," on the other hand, reached the peak following a convex path that resembles exponential growth. For some reason, in January 2009 growth of both the searches stopped.

Usually, in posts like this, you would expect an analysis that explains the difference. I don’t have any answers. However, if you have any hypothesis, I will be glad to hear.

February 26, 2018 - 2 minute read -

ASCII histograms are quick, easy to use and implement

February 25, 2018

Screen Shot 2018-02-25 at 21.25.32

From time to time, we need to look at a distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when most of my work was done in the console, and when creating a plot from Python was required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

February 25, 2018 - 1 minute read -

Mammogram, breast cancer, and manipulative statistics

February 24, 2018

Here’s a quiz

A healthy woman with no risk factors gets a positive mammogram result during a routine annual check. What is the probability that she actually has a breast cancer?
Baseline data: The probability that a woman has breast cancer is 0.8%. If she has breast cancer, the probability that a mammogram will show a positive result is 90%. If a woman does not have breast cancer, the probability of a positive result is 7%.

Prof. Gerd Gigerenzer gave this quiz to numerous students, physicians, and professors. Most of them failed this quiz. The correct answer is 9%. The probability that a healthy woman has a breast cancer if she has a positive mammogram test is only nine percent! This means that ninety percent of women who get a positive result will undergo stressful and painful series of tests only to discover that that was a false alarm. In his book “Calculated Risks”, prof. Gigerenzer uses this low probability as a starting (but not the only) argument against the common practice of routine population-wide mammogram tests. However, I would like to propose another way to look at this problem.
To understand my concern, let me first explain how we get the 9% figure.
There are several ways to get to this result. One of them is as follows. Eighty out of 10,000 women have breast cancer. Of those women, 72 (90% of 80) will test positive during a mammogram. Of the remaining 9,920 healthy women, about 694 (7%) will also have a positive mammogram test. The total number of women with a positive test is 766. Of those 766 women, only 72 have breast cancer, which is about 9%. The following diagram will help you track the numbers.

Diagram that presents natural occurrence of breast cancer, and the statistics of mammogram tests

Nine percent is indeed a low number. If a woman gets ten mammogram tests in her lifetime, there is a 60+% chance that she will have at least one false positive test. This is not something that can be easily ignored.

However

Let’s think about another way to look at this problem. Yes, the probability of a woman to have a breast cancer given that she has a positive mammogram result is nine percent (72 out of 697+72=766). However, the probability of a woman to have a breast cancer given that she has a negative mammogram result is 8 out of (9,223+8)=9,231 which is approximately 0.09%. That means that a woman with a positive mammogram test is 100 times more likely to have a breast cancer, compared to the woman with a negative result. Increase by a factor of 100 sounds like a serious threat. Much more serious than the nine percent! Moreover, a woman with a negative mammogram result knows that she is approximately ten times less likely to have a breast cancer than an average woman who didn’t undergo the test (0.09% vs 0.8%).

Conclusion?

Frankly, I don’t know. One thing is for sure; one can use statistics to steer an “average person” towards the desired decision. If my goal is to increase reduce the number of women who undergo routine mammogram tests, I will talk in terms of absolute risk (9%). If, on the other hand, I’m selling mammogram equipment, I will definitely talk in terms of the odds ratio, i.e., the 100-times risk increase. Think about this every time someone is talking to you about hazards.

February 24, 2018 - 3 minute read -

One of the reasons I don't like R

February 23, 2018

I never liked R. I didn’t like it for the first time I tried to learn it, I didn’t like it when I had to switch to R as my primary work tool at my previous job. And didn’t like it one and a half year later, when I was comfortable enough to add R to my CV, right before leaving my previous job.

Today, I was reminded of one feature (out of so many) that made dislike R. It’s its import (or library, as they call it in R) feature. In Python, you can import a_module and then use its components by calling a_model.a_function. Simple and predictable. In R, you have to read the docs in order to understand what will happen to your namespace after you have library(a.module) (I know, those dots grrrr) in your code. This feature is so annoying that people write modules that help them using other modules. Like in [this blog post](http://(https://trinkerrstuff.wordpress.com/2018/02/22/minimal-explicit-python-style-package-loading-for-r/), which looks like an interesting thing to do, but … wouldn’t it be easier to use Python?

February 23, 2018 - 1 minute read -

Overfitting reading list

February 22, 2018

Overfitting is a situation in which a model accurately describes some data but not the phenomenon that generates that data. Overfitting was a huge problem in the good old times, where each data point was expensive, and researchers operated on datasets that could fit a single A4 sheet of paper. Today, with mega- giga- and tera-bytes datasets, overfitting is … still a problem. A very painful one. Following is a short reading list on overfitting.

I would like to start with Mehmet Suzen mllib.wordpress.com who treats overfitting as “inaccurate meme in supervised learning”

cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model.

Another blogger, whose name I couldn’t find, has two very detailed posts on overfitting:

Understanding overfitting from bias-variance trade-off and Understanding overfitting from Haussler 1988 theorem

Finally, Adrian from the “morning paper” (please don’t tell me you don’t follow that blog) has a summary of another paper, titled “Understanding deep learning requires re-thinking generalization” (I only read Adrian’s summary).

Conclusion

No conclusions here. It’s a reading list.

Featured image credit: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg

February 22, 2018 - 1 minute read -

Tips on making remote presentations

February 21, 2018

Today, I made a presentation to the faculty of the Chisinau
Institute of Mathematics and Computer Science. The audience gathered in a conference room in Chisinau, and I was in my home office in Israel.

Me presenting in front of the computer

Following is a list of useful tips for this kind of presentations.

* When presenting, it is very important to see your audience. Thus, use two monitors. Use one monitor for screen sharing, and the other one to see the audience
* Put the (Skype) window that shows your audience under the camera. This way you'll look most natural on the other side of the teleconference.
* Starting a presentation in Powerpoint or Keynote "kidnaps" all the displays. You will not be able to see the audience when that happens. I export the presentation to a PDF file and use Acrobat Reader in full-screen mode. The up- and down- buttons in my presentation remote control work with the Reader. The "make screen black" button doesn't.
* I open a "lightable view" of my presentation and put it next to the audience screen. It's not as useful as seeing the presenter's notes using a "real" presentation program, but it is good enough.
* Stand up! Usually, we stand up when we present in front of live audience. For some reason, when presenting remotely, people tend to sit. A sitting person is less dynamic and looks less engaging. I have a standing desk which allows me to stand up and to raise the camera to my face level. If you can't raise the camera, stay sitting. You don't want your audience staring at your groin.

Auditorium in Chisinau showing me on their screen

February 21, 2018 - 2 minute read -

The best productivity system I know

February 20, 2018

I am an awful procrastinator. I realized that, many years ago. Once I did, I started searching for productivity tips and systems. Of course, most of these searches are another form of procrastination. After all, it’s much more fun to read about productivity than writing that boring report. In 2012, I discovered a TiddlyWiki that implements AutoFocus – a system developed by Mark Forster (AutoFocus instructions: link, TiddlyWiki page link)

I loved the simplicity of that system and used it for a while. I also started following Mark Forster’s blog. Pretty soon after that, Mark published another, even simpler version of that system, which he called “The Final Version.” I loved it even better and readily adopted it. For many reasons, I moved from TiddlyWiki to Trello and made several personal adjustments to the system.

At some point, I read “59 seconds” in which the psychologist Richard Wiseman summarizes many psychological studies in the field of happiness, productivity, decision making, etc. From that book, I learned about the power of writing things down. It turns out, that when you write things down, your brain gets a better chance to analyze your thoughts and to make better decisions. I also learned from other sources about the importance to disconnect from the Internet several times a day. So, on November 2016, I made a transition from electronic productivity system to an old school notebook. In the beginning, I decided to keep that notebook as a month-long experiment, but I loved that very much. Since then, I have always had my analog productivity system and an introspection device with me. Today, I started my sixth notebook. I love my system so much, I actually consider writing a book about it.

[caption id=”attachment_2088” align=”alignnone” width=”4000”] Blank notebook page with #1 in the page corner

The first page of my new notebook. The notebook is left-to-right since I write in Hebrew[/caption]

February 20, 2018 - 2 minute read -

Once again on becoming a data scientist

February 19, 2018

My stand on learning data science is known: I think that learning “data science” as a career move is a mistake. You may read this long rant of mine to learn why I think so. This doesn’t mean that I think that studying data science, in general, is a waste of time.

Let me explain this confusion. Take this blogger for example https://thegirlyscientist.com/. As of this writing, “thegirlyscientst” has only two posts: “Is my finance degree useless?” and “How in the world do I learn data science?”. This person (whom I don’t know) seems to be a perfect example of someone may learn data science tools to solve problems in their professional domain. This is exactly how my professional career evolved, and I consider myself very lucky about that. I’m a strong believer that successful data scientists outside the academia should evolve either from domain knowledge to data skills or from statistical/CS knowledge to domain-specific skills. Learning “data science” as a collection of short courses, without deep knowledge in some domain, is in my opinion, a waste of time. I’m constantly doubting myself with this respect but I haven’t seen enough evidence to change my mind. If you think I miss some point, please correct me.

February 19, 2018 - 1 minute read -

The case of meaningless comparison

February 18, 2018

Exposé, an Australian-based data analytics company, published a use case in which they analyze the benefits of a custom-made machine learning solution. The only piece of data in their report [PDF] was a graph which shows the observed and the predicted

Screenshot that shows two time series curves: one for the observed and one for the predicted values

Graphs like this one provide an easy-to-digest overview of the data but are meaningless with respect to our ability to judge model accuracy. When predicting values of time series, it is customary to use all the available data to predict the next step. In cases like that, “predicting” the next value to be equal to the last available one will result in an impressive correlation. Below, for example, is my “prediction” of Apple stock price. In my model, I “predict” tomorrow’s prices to be equal to today’s closing price plus random noise.

Two curves representing two time series - Apple stock price and the same data shifted by one day

Look how impressive my prediction is!

I’m not saying that Exposé constructed a nonsense model. I have no idea what their model is. I do say, however, that their communication is meaningless. In many time series, such as consumption dynamics, stock price, etc, each value is a function of the previous ones. Thus, the “null hypothesis” of each modeling attempt should be that of a random walk, which means that we should not compare the actual values but rather the changes. And if we do that, we will see the real nature of the model. Below is such a graph for my pseudo-model (zoomed to the last 20 points)

diff_series

Suddenly, my bluff is evident.

To sum up, a direct comparison of observed and predicted time series can only be used as a starting point for a more detailed analysis. Without such an analysis, this comparison is nothing but a meaningless illustration.

February 18, 2018 - 2 minute read -

I should read more about procrastination. Maybe tomorrow.

February 17, 2018

You’ve been there: you need to complete a project, submit a report, or document your code. You know how important all these tasks are, but you can’t find the power to do so. Instead, you’re researching those nice pictures the Opportunity rover sent to the Earth, type random letters in Google to see where they will lead you to, tidy up your desk, or make another cup of coffee. You are procrastinating.

Because I procrastinate a lot, and because I have several important tasks to complete, I decided to read more about the psychological background of procrastination. I went to Google Scholar and typed “procrastination.” One of the first results was a paper with a promising title. “The Nature of Procrastination: A Meta-Analytic and Theoretical Review of Quintessential Self-Regulatory Failure” by Piers Steel. Why was I intrigued by this paper? First of all, it’s a meta-analysis, meaning that it reviews many previous quantitative studies. Secondly, it promises a theoretical review, which is also a good thing. So, I decided to read it. I started from the abstract, and here’s what I see:
Strong and consistent predictors of procrastination were task aversiveness, task delay, selfefficacy, and impulsiveness, as well as conscientiousness and its facets of self-control, distractibility, organization, and achievement motivation.

Hmmm, isn’t this the very definition of procrastination? Isn’t this sentence similar to “A strong predictor of obesity is a high ratio between person’s weight to their height?”. Now, I’m really intrigued. I am sure that reading this paper will shed some light, not only on the procrastination itself but also on the self-assuring sentence. I definitely need to read this paper. Maybe tomorrow.

PS. After writing this post, I discovered that the paper author, Piers Steel, has a blog dedicated to “procrastination and science” https://procrastinus.com/. I will read that blog too. But not today

February 17, 2018 - 2 minute read -

Lie factor in ad graphs

February 16, 2018

What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

Screen Shot 2018-02-16 at 18.34.38

The problem?

If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

Screen Shot 2018-02-16 at 18.32.53

I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

February 16, 2018 - 1 minute read -

Never read reviews before reading a book (except for this one). On "Surely You're Joking, Mr. Feynman!"

February 15, 2018

Several people suggested that I read “Surely You’re Joking, Mr. Feynman!”. That is why, when I got my new Kindle, “Surely You’re Joking, Mr. Feynman!” was the first book I bought.
Richard Feynman was a trained theoretical physics who co-won the Nobel Prize. From reading the book, I discovered that Feynman was also a drummer, a painter, an expert on Native American mathematics, safecracker, a samba player, and an educator. The more I read this book, the more astonished I was about Feynman’s personality and his story.

When I was half the way through the book, I decided to read the Amazone reviews. When reading reviews, I tend to look for the one- and two- stars, to seed my critical thinking. I wish I haven’t done that. The reviewers were talking about how arrogant and self-bragging man Feynman was, and how it must have been terrible to work with him. I almost stopped reading the book after being exposed to those reviews.

Admittedly, Richard Feynman never missed an opportunity to brag about himself and to emphasize how many achievements he made without meaning to do so, almost by accident. Every once in a while, he mentioned many people who were much better than him in that particular field that managed to conquer. I call this pattern a self-bragging modesty, and it is a pattern typical of many successful people. Nevertheless, given all his achievements, I think that Feynman deserves the right to be self-bragging. Being proud of your accomplishments isn’t arrogance, and is a natural thing to do. “Surely You’re Joking, Mr. Feynman!” is fun to read, is very informative and inspirational. I think that everyone who calls themselves a scientist or considers being a scientist should read this book.

P.S. After completing the book, I took some time to watch several Feynman’s lectures on YouTube. It turned out that besides being a good physicist, Feynman was also a great teacher.

February 15, 2018 - 2 minute read -

Is Data Science a Science?

February 14, 2018

Is Data Science a Science? I think that there is no data scientist who doesn’t ask his- or herself this question once in a while. I recalled this question today when I watched a fascinating lecture “Theory, Prediction, Observation” made by Richard Feynman in 1964. For those who don’t know, Richard Feynman was a physicist who won the Nobel Prize, and who is considered one of the greatest explainers. In that particular lecture, Prof. Feynman talked about science as a sequence of Guess ⟶ Compute Consequences ⟶ Compare to Experiment

Richard Feynman in front of a blackboard that says: Guess ⟶ Compute Consequences ⟶ Compare to Experiment

This is exactly what we do when we build models: we first guess what the model should be, compute the consequences (i.e. fit the parameters). Finally, we evaluate our models against observations.

My favorite quote from that lecture is

… and therefore, experiment produces troubles, every once in a while …

I strongly recommend watching this lecture. It’s one hour long, so if you don’t have time, you may listen to it while commuting. Feynman is so clear, you can get most of the information by ear only.

https://www.youtube.com/watch?v=OX1EK5IBSdw

February 14, 2018 - 1 minute read -

Why deeply caring about the analysis isn't always a good thing?

February 13, 2018

Does Caring About the Analysis Matter?

The simplystatistics.org blog had an interesting discussion about podcast Roger Peng from simplystatistics.org recorded on A/B testing on Etsy. One of the late conclusions Roger Peng had is as follows
“Whether caring matters for data analysis also has implications for how to build a data analytic team. If you need your data analyst to be 100% committed to a product and to be fully invested, it’s difficult to achieve that with contractors or consultants, who are typically [not deeply invested].”

A hypothetical graph that show that $$ potential is lower as

Yes, deeply caring is very important. That is why I share Roger Peng’s skepticism about external contractors. On the other hand, too deep involvement is also a bad idea. Too deep involvement creates a bias. Such a bias, that can be conscious or subconscious, reduces critical thinking and increases the chances of false findings. If you don’t believe me, recall the last time you debugged a model after it produced satisfactory results. I bet you can’t. The reason is that we all tend to work hard, looking for errors and problems until we get the results we expect. But mostly, not long after that.

There are more mechanisms that may cause false findings. For a good review, I suggest reading Why Most Published Research Findings Are False by John P. A. Ioannidis.
Image source: Data Analysis and Engagement - Does Caring About the Analysis Matter? — Simply Statistics

February 13, 2018 - 2 minute read -

Does chart junk really damage the readability of your graph?

February 12, 2018

Screen Shot 2018-02-12 at 16.32.56

Data-ink ratio is considered to be THE guiding principle in data visualization. Coined by Edward Tufte, data-ink is “the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” According to Tufte, the ratio of the data-ink out of all the “ink” in a graph should be as high as possible, preferably, 100%.
Everyone who considers themselves serious about data visualization knows (either formally, or intuitively) about the importance to keep the data-ink ratio high, the merits of high signal-to-noise ratio, the need to keep the “chart junk” out. Everybody knows it. But are there any empirical studies that corroborate this “knowledge”? One of such studies was published in 1988 by James D. Kelly in a report titled “The Data-Ink Ratio and Accuracy of Information Derived from Newspaper Graphs: An Experimental Test of the Theory.”

In the study presented by J.D. Kelly, the researchers presented a series of newspaper graphs to a group of volunteers. The participants had to look at the graphs and answer questions. A different group of participants was exposed to similar graphs that underwent rigorous removal of all the possible “chart junk.” One such an example is shown below

Two bar charts based on identical data. One - with "creative" illustrations. The other one only presents the data.

Unexpectedly enough, there was no difference between the error rate the two groups made. “Statistical analysis of results showed that control groups and treatment groups made a nearly identical number of errors. Further examination of the results indicated that no single graph produced a significant difference between the control and treatment conditions.”

I don’t remember how this report got into my “to read” folder. I am amazed I have never heard about it. So, what is my take out of this study? It doesn’t mean we need to abandon the data-ink ratio at all. It does not provide an excuse to add “chart junk” to your charts “just because”. It does, however, show that maximizing the data-ink ratio shouldn’t be followed zealously as a religious rule. The maximum data-ink ratio isn’t a goal, but rather a tool. Like any tool, it has some limitations. Edward Tufte said, “Above all, show data.” My advice is “Show data, enough data, and mostly data.” Your goal is to convey a message, if some decoration (a.k.a chart junk) makes your message more easily digestible, so be it.

February 12, 2018 - 2 minute read -

On statistics and democracy, or why exposing a fraud may mean nothing

February 11, 2018

“stat” in the word “statistics” means “state”, as in “government/sovereignty”. Statistics was born as a state effort to use data to rule a country. Even today, every country I know has its own statistics authority. For many years, many governments, have been hiding the true statistics from the public, under the assumption that knowledge means power. I was reminded of this after reading Charles Earl’s (my teammate) post “Mathematicians, rock the vote!”, in which he encourages mathematicians to fight gerrymandering. Gerrymandering is a dubious practice in the American voting system, where a regulatory body forms voting districts in such a way that the party that appointed that body has the highest chance to win. Citing Charles:

It is really heartening that discrete geometry and other branches of advanced mathematics can be used to preserve democracy

I can’t share Charles’s optimism. In the past, statistics have been successfully used for several times to expose election frauds in Russia (see, for example, these two links, but there are much much more [one] [two]). People went to the streets, waving posters such as “We don’t believe Churov [a Russian politician], we believe Gauss.”

[caption id=”attachment_2045” align=”alignnone” width=”538”] Demonstration in Russia. Poster: "We don't believe Churov. We believe Gauss"

“We don’t believe Churov. We believe Gauss”. Taken from Anatoly Karlin’s site http://akarlin.com/2011/12/measuring-churovs-beard/[/caption]

Why, then, am I not optimistic? After all, even the great Terminator, one of my favorite Americans, Arnold Schwarzenegger fights gerrymandering.

schwarznegger-on-the-gerrymandering-problem-00025416-super-169.jpg

The problem is not that the American’s don’t know how to eliminate Gerrymandering. The information is there, the solution is known [ref, as an example]. In theory, it is a very easy problem. In practice, however, power, even more than drugs and sex, is addictive. People don’t tend to give up their power easily. What happened in Russia, after an election fraud was exposed using statistics? Another election fraud. And then yet another. What will happen in the US? I’m afraid that nothing will change there either.

February 11, 2018 - 2 minute read -

What is the best way to handle command line arguments in Python?

February 10, 2018

The best way to handle command line arguments with Python is [defopt](http://evanunderscore/defopt: Effortless argument parser). It works like magic. You write a function, add a proper docstring using any standard format (I use [numpy doc]), and see the magic

[code language=”python”]

import defopt

def main(greeting, *, count=1): “"”Display a friendly greeting.

:param str greeting: Greeting to display
:param int count: Number of times to display the greeting
"""
for _ in range(count):
    print(greeting)

if name == ‘main’: defopt.run(main)
[/code]

You have:

* help string generation
* data type conversion
* default arguments
* zero boilerplate code

Magic!

Illustration: the famous XKCD

February 10, 2018 - 1 minute read -

Measuring the wall time in python programs

February 9, 2018

Measuring the wall time of various pieces of code is a very useful technique for debugging, profiling, and computation babysitting. The first time I saw a code that performs time measurement was many years ago when a university professor used Matlab’s tic-toc pair. Since then, whenever I learn a new language, the first “serious” code that I write is a tic-toc mechanism. This is my Python Tictoc class: [Github gist].

February 9, 2018 - 1 minute read -

Why bar charts should always start at zero?

February 8, 2018

In the data visualization world, not starting a bar chart at zero is a “BIG NO”. Some people protest. “How come can anyone tell me how to start my bar chart? The Paper/Screen can handle anything! If I want to start a bar chart at 10, nobody can stop me!”

Data visualization is a language. Like any language, data visualization has its set of rules, grammar if you wish. Like in any other language, you are free to break any rule, but if you do so, don’t be surprised if someone underestimates you. I’m not a native English speaker. I certainly break many English grammar rules when I write or speak. However, I never argue if someone knowledgeable corrects me. If you agree that one should try respecting grammar rules of a spoken language, you have to agree to respect the grammar of any other language, including data visualization.

Natan Yau from flowingdata.com has a very informative post

Screenshot of flowingdata.com post "Bar Chart Baselines Start at Zero"

that explores this exact point. Read it.

Another related discussion is called “When to use the start-at-zero rule” and is also worth reading.

Also, do remember is that the zero point has to be a meaningful one. That is why, you cannot use a bar chart to depict the weather because, unless you operate in Kelvin, the zero temperature is meaningless and changes according to the arbitrary choice the temperature scale.

Yet another thing to remember is that

It’s true that every rule has its exception. It’s just that with this particular rule, I haven’t seen a worthwhile reason to bend it yet.

(citing Natan Yau)

February 8, 2018 - 2 minute read -

Gender salary gap in the Israeli high-tech — now the code

February 7, 2018

Several people have asked me about the technology I used to create the graphs in my recent post about the gender salary gap in the Israeli high-tech. Like 99% of the graphs I create, I used matplotlib. I have uploaded the notebook that I used for that post to Github. Here’s the link. The published version uses seaborn style settings. The original one uses a slightly customized style.

February 7, 2018 - 1 minute read -