Many years ago, I terribly overfit a model which caused losses of a lot of shekels (a LOT). It’s not that I wasn’t aware of the potential overfitting. I was. Among other things, I used several bootstrapping simulations. It turns out that I applied the bootstrapping in a wrong way. My particular problem was that I “forgot” about confounding parameters and that I “forgot” that peeping into the future is a bad thing.
Anyhow, Yanir Seroussi, my coworker data scientist, gave a very good talk on bootstrapping.
Overfitting is a situation in which a model accurately describes some data but not the phenomenon that generates that data. Overfitting was a huge problem in the good old times, where each data point was expensive, and researchers operated on datasets that could fit a single A4 sheet of paper. Today, with mega- giga- and tera-bytes datasets, overfitting is … still a problem. A very painful one. Following is a short reading list on overfitting.
I would like to start with Mehmet Suzen mllib.wordpress.com who treats overfitting as “inaccurate meme in supervised learning”
cross-validation does not prevent your model to overfit and good out-of-sample performance does not guarantee not-overfitted model.
Another blogger, whose name I couldn’t find, has two very detailed posts on overfitting:
Understanding overfitting from bias-variance trade-off and Understanding overfitting from Haussler 1988 theorem
Finally, Adrian from the “morning paper” (please don’t tell me you don’t follow that blog) has a summary of another paper, titled “Understanding deep learning requires re-thinking generalization” (I only read Adrian’s summary).
No conclusions here. It’s a reading list.
Featured image credit: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg
Does Caring About the Analysis Matter?
The simplystatistics.org blog had an interesting discussion about podcast Roger Peng from simplystatistics.org recorded on A/B testing on Etsy. One of the late conclusions Roger Peng had is as follows
“Whether caring matters for data analysis also has implications for how to build a data analytic team. If you need your data analyst to be 100% committed to a product and to be fully invested, it’s difficult to achieve that with contractors or consultants, who are typically [not deeply invested].”
Yes, deeply caring is very important. That is why I share Roger Peng’s skepticism about external contractors. On the other hand, too deep involvement is also a bad idea. Too deep involvement creates a bias. Such a bias, that can be conscious or subconscious, reduces critical thinking and increases the chances of false findings. If you don’t believe me, recall the last time you debugged a model after it produced satisfactory results. I bet you can’t. The reason is that we all tend to work hard, looking for errors and problems until we get the results we expect. But mostly, not long after that.
There are more mechanisms that may cause false findings. For a good review, I suggest reading Why Most Published Research Findings Are False by John P. A. Ioannidis.
Image source: Data Analysis and Engagement – Does Caring About the Analysis Matter? — Simply Statistics
I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting — the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.
Quoting Tom Breur:
Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!
Happy dredging indeed.