Bootstrapping the right way?

Many years ago, I terribly overfit a model which caused losses of a lot of shekels (a LOT). It’s not that I wasn’t aware of the potential overfitting. I was. Among other things, I used several bootstrapping simulations. It turns out that I applied the bootstrapping in a wrong way. My particular problem was that I “forgot” about confounding parameters and that I “forgot” that peeping into the future is a bad thing.

Anyhow, Yanir Seroussi, my coworker data scientist, gave a very good talk on bootstrapping.

Yanir Seroussi

Bootstrapping the right way is a talk I gave earlier this year at the YOW! Data conference in Sydney. You can now watch the video of the talk and have a look through the slides. The content of the talk is similar to a post I published on bootstrapping pitfalls, with some additional simulations.

The main takeaways shared in the talk are:

  • Don’t compare single-sample confidence intervals by eye
  • Use enough resamples (15K?)
  • Use a solid bootstrapping package (e.g., Python ARCH)
  • Use the right bootstrap for the job
  • Consider going parametric Bayesian
  • Test all the things

Testing all the things typically requires writing code, which I did for the talk. You can browse through it in this notebook. The most interesting findings from my tests are summarised by the following figure.

Revenue confidence intervals

The figure shows how the accuracy of confidence interval estimation varies by algorithm, sample size…

View original post 405 more words

Overfitting reading list

Overfitting is a situation in which a model accurately describes some data but not the phenomenon that generates that data. Overfitting was a huge problem in the good old times, where each data point was expensive, and researchers operated on datasets that could fit a single A4 sheet of paper. Today, with mega- giga- and tera-bytes datasets, overfitting is … still a problem. A very painful one. Following is a short reading list on overfitting.

I would like to start with Mehmet Suzen mllib.wordpress.com who treats overfitting as “inaccurate meme in supervised learning

cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model.

Another blogger, whose name I couldn’t find, has two very detailed posts on overfitting:

Understanding overfitting from bias-variance trade-off and Understanding overfitting from Haussler 1988 theorem

Finally, Adrian from the “morning paper” (please don’t tell me you don’t follow that blog) has a summary of another paper, titled “Understanding deep learning requires re-thinking generalization” (I only read Adrian’s summary).

Conclusion

No conclusions here. It’s a reading list.

Featured image credit: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg

Why deeply caring about the analysis isn’t always a good thing?

Illustration: a person looks at sheets of paper and thinks

Does Caring About the Analysis Matter?

The simplystatistics.org blog had an interesting discussion about podcast Roger Peng from simplystatistics.org recorded on A/B testing on Etsy. One of the late conclusions Roger Peng had is as follows
“Whether caring matters for data analysis also has implications for how to build a data analytic team. If you need your data analyst to be 100% committed to a product and to be fully invested, it’s difficult to achieve that with contractors or consultants, who are typically [not deeply invested].”

A hypothetical graph that show that $$ potential is lower as

Yes, deeply caring is very important. That is why I share Roger Peng’s skepticism about external contractors. On the other hand, too deep involvement is also a bad idea. Too deep involvement creates a bias. Such a bias, that can be conscious or subconscious, reduces critical thinking and increases the chances of false findings. If you don’t believe me, recall the last time you debugged a model after it produced satisfactory results. I bet you can’t. The reason is that we all tend to work hard, looking for errors and problems until we get the results we expect. But mostly, not long after that.

There are more mechanisms that may cause false findings. For a good review, I suggest reading  Why Most Published Research Findings Are False by John P. A. Ioannidis.
Image source: Data Analysis and Engagement – Does Caring About the Analysis Matter? — Simply Statistics

Although it is easy to lie with statistics, it is easier to lie without

I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting — the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.

Quoting Tom Breur:

Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!

Happy dredging indeed.