Against A/B tests

Traditional A/B testsing rests on a fundamentally flawed premise. Most of the time, version A will be better for some subgroups, and version B will be better for others. Choosing either A or B is inherentlyinferior to choosing a targeted mix of A and B.

Michael Kaminsky locallyoptimistic.com

The quote above is from a post by Michael Kaminsky “Against A/B tests“. I’m still not fully convinced by Michael’s thesis but it is very interesting and thought-provoking. 

Can error correction cause more error? (The answer is yes)

This is an interesting thought experiment. Suppose that you have some appliance that acts in a normally distributed way. For example, a nerf gun. Let’s say now that you aim and fire the gun. What happens if you miss by some amount of X? Should you correct your aim in the opposite direction? My intuition says “yes.” So does the intuition of many other people with whom I talked about this problem. However, when we start thinking about this problem, we realize that the intuition is wrong. Since we aim the gun, our assumption should be that the deviation is zero. A single observation is not sufficient to reject this assumption. By continually adjusting the data generating process based on a single observation, we reduce the precision (increase the dispersion).
Below is a simulation of adjusted and non-adjusted processes (the code is here). The broader spread of the adjusted data (blue line) is evident.

Two curves. Blues: high dispersion of values when adjustments are performed after every observation. Orange: smaller dispersion when no adjustments are done.

Due to the nature of the normal random variable, a single large accidental deviation can cause an extreme “correction,” which in turn will create a prolonged period of highly inaccurate points. This is precisely what you see in my simulation.
The moral of this simple experiment is that you shouldn’t let a single affect your actions.

 

Mammogram, breast cancer, and manipulative statistics

Here’s a quiz

A healthy woman with no risk factors gets a positive mammogram result during a routine annual check. What is the probability that she actually has a breast cancer?
Baseline data: The probability that a woman has breast cancer is 0.8%. If she has breast cancer, the probability that a mammogram will show a positive result is 90%. If a woman does not have breast cancer, the probability of a positive result is 7%.

Prof. Gerd Gigerenzer gave this quiz to numerous students, physicians, and professors. Most of them failed this quiz. The correct answer is 9%. The probability that a healthy woman has a breast cancer if she has a positive mammogram test is only nine percent! This means that ninety percent of women who get a positive result will undergo stressful and painful series of tests only to discover that that was a false alarm. In his book “Calculated Risks“, prof. Gigerenzer uses this low probability as a starting (but not the only) argument against the common practice of routine population-wide mammogram tests. However, I would like to propose another way to look at this problem.
To understand my concern, let me first explain how we get the 9% figure.
There are several ways to get to this result. One of them is as follows. Eighty out of 10,000 women have breast cancer. Of those women, 72 (90% of 80) will test positive during a mammogram. Of the remaining 9,920 healthy women, about 694 (7%) will also have a positive mammogram test. The total number of women with a positive test is 766. Of those 766 women, only 72 have breast cancer, which is about 9%. The following diagram will help you track the numbers.

Diagram that presents natural occurrence of breast cancer, and the statistics of mammogram tests

Nine percent is indeed a low number. If a woman gets ten mammogram tests in her lifetime, there is a 60+% chance that she will have at least one false positive test. This is not something that can be easily ignored.

However

Let’s think about another way to look at this problem. Yes, the probability of a woman to have a breast cancer given that she has a positive mammogram result is nine percent (72 out of 697+72=766). However, the probability of a woman to have a breast cancer given that she has a negative mammogram result is 8 out of (9,223+8)=9,231 which is approximately 0.09%. That means that a woman with a positive mammogram test is 100 times more likely to have a breast cancer, compared to the woman with a negative result. Increase by a factor of 100 sounds like a serious threat. Much more serious than the nine percent! Moreover, a woman with a negative mammogram result knows that she is approximately ten times less likely to have a breast cancer than an average woman who didn’t undergo the test (0.09% vs 0.8%).

Conclusion?

Frankly, I don’t know. One thing is for sure; one can use statistics to steer an “average person” towards the desired decision. If my goal is to increase reduce the number of women who undergo routine mammogram tests, I will talk in terms of absolute risk (9%). If, on the other hand, I’m selling mammogram equipment, I will definitely talk in terms of the odds ratio, i.e., the 100-times risk increase. Think about this every time someone is talking to you about hazards.

Overfitting reading list

Overfitting is a situation in which a model accurately describes some data but not the phenomenon that generates that data. Overfitting was a huge problem in the good old times, where each data point was expensive, and researchers operated on datasets that could fit a single A4 sheet of paper. Today, with mega- giga- and tera-bytes datasets, overfitting is … still a problem. A very painful one. Following is a short reading list on overfitting.

I would like to start with Mehmet Suzen mllib.wordpress.com who treats overfitting as “inaccurate meme in supervised learning

cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model.

Another blogger, whose name I couldn’t find, has two very detailed posts on overfitting:

Understanding overfitting from bias-variance trade-off and Understanding overfitting from Haussler 1988 theorem

Finally, Adrian from the “morning paper” (please don’t tell me you don’t follow that blog) has a summary of another paper, titled “Understanding deep learning requires re-thinking generalization” (I only read Adrian’s summary).

Conclusion

No conclusions here. It’s a reading list.

Featured image credit: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg

Why deeply caring about the analysis isn’t always a good thing?

Illustration: a person looks at sheets of paper and thinks

Does Caring About the Analysis Matter?

The simplystatistics.org blog had an interesting discussion about podcast Roger Peng from simplystatistics.org recorded on A/B testing on Etsy. One of the late conclusions Roger Peng had is as follows
“Whether caring matters for data analysis also has implications for how to build a data analytic team. If you need your data analyst to be 100% committed to a product and to be fully invested, it’s difficult to achieve that with contractors or consultants, who are typically [not deeply invested].”

A hypothetical graph that show that $$ potential is lower as

Yes, deeply caring is very important. That is why I share Roger Peng’s skepticism about external contractors. On the other hand, too deep involvement is also a bad idea. Too deep involvement creates a bias. Such a bias, that can be conscious or subconscious, reduces critical thinking and increases the chances of false findings. If you don’t believe me, recall the last time you debugged a model after it produced satisfactory results. I bet you can’t. The reason is that we all tend to work hard, looking for errors and problems until we get the results we expect. But mostly, not long after that.

There are more mechanisms that may cause false findings. For a good review, I suggest reading  Why Most Published Research Findings Are False by John P. A. Ioannidis.
Image source: Data Analysis and Engagement – Does Caring About the Analysis Matter? — Simply Statistics

On statistics and democracy, or why exposing a fraud may mean nothing

“stat” in the word “statistics” means “state”, as in “government/sovereignty”. Statistics was born as a state effort to use data to rule a country. Even today, every country I know has its own statistics authority. For many years, many governments, have been hiding the true statistics from the public, under the assumption that knowledge means power. I was reminded of this after reading Charles Earl’s (my teammate) post “Mathematicians, rock the vote!“, in which he encourages mathematicians to fight gerrymandering. Gerrymandering is a dubious practice in the American voting system, where a regulatory body forms voting districts in such a way that the party that appointed that body has the highest chance to win. Citing Charles:

It is really heartening that discrete geometry and other branches of advanced mathematics can be used to preserve democracy

I can’t share Charles’s optimism. In the past, statistics have been successfully used for several times to expose election frauds in Russia (see, for example, these two links, but there are much much more [one] [two]). People went to the streets, waving posters such as “We don’t believe Churov [a Russian politician], we believe Gauss.”

Demonstration in Russia. Poster: "We don't believe Churov. We believe Gauss"
“We don’t believe Churov. We believe Gauss”. Taken from Anatoly Karlin’s site http://akarlin.com/2011/12/measuring-churovs-beard/

Why, then, am I not optimistic? After all, even the great Terminator, one of my favorite Americans, Arnold Schwarzenegger fights gerrymandering.

schwarznegger-on-the-gerrymandering-problem-00025416-super-169.jpg

The problem is not that the American’s don’t know how to eliminate Gerrymandering. The information is there, the solution is known [ref, as an example]. In theory, it is a very easy problem. In practice, however,  power, even more than drugs and sex, is addictive. People don’t tend to give up their power easily. What happened in Russia, after an election fraud was exposed using statistics? Another election fraud. And then yet another. What will happen in the US? I’m afraid that nothing will change there either.

 

The Monty Hall Problem simulator

Illustration: a cute goat

A couple of days ago, I told to my oldest daughter about the Monty Hall problem, the famous probability puzzle with a counter-intuitive solution. My daughter didn’t believe me. Even when I told her all about the probabilities, the added information, and the other stuff, she still couldn’t “feel” it. I looked for an online simulator and couldn’t find anything that I liked. So, I decided to create a simulation Jupyter notebook.

Illustration: Screenshot of a Jupyter notebook that shows the output of one round of Monty Hall simulation

I’ve uploaded the notebook to GitHub, in case someone else wants to play with it [link].

What are the best practices in planning & interpreting A/B tests?

Screenshots of the reading mentioned in this post

Compiled by my teammate Yanir Serourssi, the following is a reading list an A/B tests that you should read even if you don’t plan to perform an A/B test anytime soon. The list is Yanir’s. The reviews are mine. Collective intelligence in action 🙂

  • If you don’t pay attention, data can drive you off a cliff
    In this post, Yanir lists seven common mistakes that are common to any data-based analysis. At some point, you might think that this is a list of trivial truths. Maybe it is. The fact that Yanir’s points are trivial doesn’t make them less correct. Awareness doesn’t exist without knowledge. Unfortunately, knowledge doesn’t assure awareness. Which is why reading trivial truths is a good thing to do from time to time.
  • How to identify your marketing lies and start telling the truth
    This post was written by Tiberio Caetano, a data science professor at the University of Sidney. If I had to summarize this post with a single phrase, that would be “confounding factors”. A confounding variable is a variable hidden from your eye that influences a measured effect. One example of a confounding variable is when you start an ad campaign for ice cream, your sales go up, and you conclude that the ad campaign was effective. What you forgot was that the ad campaign started at the beginning of the summer, when people start buying more ice cream anyhow.
    See this link for a detailed textbook-quality review of confounding variables.
  • Seven rules of thumb for web site experimenters
    I read this review back in 2014, shortly after it was published by, among others, researchers from Microsoft and LinkedIn. Judging by the title, one would expect yet another list of trivial truths in a self-promoting product blog. This is not the case here. In this paper, you will find several real-life case studies, many references to marketing studies, and no advertising of shady products or schemes.
  • A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments
    Another academic paper by Microsoft researchers. This one lists a lot of “dont’s”. Like in the previous link, every advice the authors give is based on established theory and backed up by real data.