Reading through the conference proceedings, I found a very interesting paper titled “The Cost of Fairness in Binary Classification.” This paper talks about the measures one needs to take in order not use sensitive features (such as race) as the means to discrimination, with a reasonable accuracy tradeoff.
Skimming through this paper, I recalled a conversation I had about a year ago with a chief data scientist in a startup that provides short-term loans to people who need some money now. The major job of the data science team in that company was to assess the risk of a customer. From the explanation the chief data scientist gave, and from the data sources she described, it was clear that they train their model on the information whether a person is likely to receive a loan from a financial institution. When I pointed out that they exclude categories of people that are rejected but are likely to return the money. “Yes?” she said in a tone as if she couldn’t see what the problem that I tried to raise was. “Well,” I said, it’s unfair for many customers, plus you’re missing the chance to recruit customers who were rejected by others”. “We have enough potential customers,” she said. She didn’t think fairness was an issue worth talking about.
Overfitting is a situation in which a model accurately describes some data but not the phenomenon that generates that data. Overfitting was a huge problem in the good old times, where each data point was expensive, and researchers operated on datasets that could fit a single A4 sheet of paper. Today, with mega- giga- and tera-bytes datasets, overfitting is … still a problem. A very painful one. Following is a short reading list on overfitting.
My stand on learning data science is known: I think that learning “data science” as a career move is a mistake. You may read this long rant of mine to learn why I think so. This doesn’t mean that I think that studying data science, in general, is a waste of time.
Let me explain this confusion. Take this blogger for example https://thegirlyscientist.com/. As of this writing, “thegirlyscientst” has only two posts: “Is my finance degree useless?” and “How in the world do I learn data science?“. This person (whom I don’t know) seems to be a perfect example of someone may learn data science tools to solve problems in their professional domain. This is exactly how my professional career evolved, and I consider myself very lucky about that. I’m a strong believer that successful data scientists outside the academia should evolve either from domain knowledge to data skills or from statistical/CS knowledge to domain-specific skills. Learning “data science” as a collection of short courses, without deep knowledge in some domain, is in my opinion, a waste of time. I’m constantly doubting myself with this respect but I haven’t seen enough evidence to change my mind. If you think I miss some point, please correct me.
It’s such a joy to work with smart and interesting people. My teammate, Charles Earl, wrote a post about machine learning and poverty. It’s not short, but it’s worth reading.
A.I. and Big Data Could Power a New War on Poverty is the title of on op-ed in today’s New York Times by Elisabeth Mason. I fear that AI and Big Data is more likely to fuel a new War on the Poor unless a radical rethinking occurs. In fact this algorithmic War on the Poor […]
Many years ago, I tried to build something that today would have been called “Google Trends for Pubmed”. One thing that I’ve found during that process was how the emergence of HIV-related research reduced the number of cancer studies and how, several years later, the HIV research boom settled down, while letting the cancer research back.
I recalled about that project of mine when I took a look at the Google Trends data for, a once popular buzz-phrases, “data mining” and pattern recognition. Sic transit gloria mundi.
It’s not surprising that “Data Science” was the less popular term in 2004. As I already mentioned, “Data Science” is a relatively new term. What does surprise me is the fact that in the past, “Machine Learning” was so less popular that “Data Mining”. Even more surprising is the fact that Google Trends ranks “Machine Learning” almost twice as high, as “Data Science”. I was expecting to see the opposite.
“Pattern Recognition,” that, in 2004, was as (not) popular as “Machine Learning” become even less popular today. Does that mean that nobody is searching for patterns anymore? Not at all. The 2004 pattern recognition experts are now machine learning professors senior data scientists or if they work in academia, machine learning professors.
PS: does anybody know the reason behind the apparent seasonality in “Data Mining” trends?
I developed an anomaly detection system for Automattic internal dashboard. When presenting this system (“When good enough is just good enough“), I used to tell that in our particular case, the cost of false alerts was almost zero. I used to explain this claim by the fact that no automatic decisions were made based on the alerts, and that the only subscribers of the alert messages were a limited group of colleagues of mine. Automattic CFO, Stu West, who was the biggest stakeholder in this project, asked me not to stop claiming the “zero cost” claim. When the CFO of the company you work for asks you to do something, you comply. So, I stopped saying “zero cost” but I still listed the error costs as a problem I can safely ignore for the time being. I didn’t fully believe Stu, which is evident from the speaker notes of my presentation deck:
I recalled about Stu’s request to stop talking about “zero cost” of false alerts today. Today, I noticed more than 10 unread messages in the Slack channel that receives my anomaly alerts. The oldest unread message was two weeks old. The only reason this could happen is that I stopped caring about the alerts because there were too many of them. I witnessed the classical case of “alert fatigue”, described in “The Boy Who Cried Wolf”, many centuries ago.
The lesson of this story is that there is no such a thing as zero-cost false alarms. Lack of specificity is a serious problem.
If you are a data scientist, I am sure you wondered whether deep neural networks will replace you at your job one day. Every time I read about reports of researchers who managed to trick neural networks, I wonder whether the researchers were thinking about their job security, or their professional pride while performing the experiments. I think that the first example of such a report is a 2014 paper by Christian Szegedy and his colleagues called “Intriguing properties of neural networks“. The main goal of this paper, so it seems, was to peek into the black box of neural networks. In one of the experiments, the authors designed minor, invisible perturbation of the original images. These perturbations diminished the classification accuracy of a trained model.
I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.
I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting — the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.
Quoting Tom Breur:
Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!
Frequently, training a machine learning model in a single session is impossible. Most commonly, this happens when one needs to update a model with newly obtained observations. The generic term for such an update is “online learning.” In the scikit-learn world, this concept is also known as partial fit. The problem is that some models or their implementations don’t allow for partial fit. Even if the partial fitting is technically possible, the weight assigned to the new observations is may not be under your control. What happens when you re-train a model from scratch, or when the new observations are assigned too high weights? Recently, I stumbled upon an interesting concept of Pseudo-rehearsal that addresses this problem. Citing Matthew Honnibal:
Sometimes you want to fine-tune a pre-trained model to add a new label or correct some specific errors. This can introduce the “catastrophic forgetting” problem. Pseudo-rehearsal is a good solution: use the original model to label examples, and mix them through your fine-tuning updates.
This post is written by Matthew Honnibal from the team behind the excellent Spacy NLP library. This post is valuable in many aspects. First, it demonstrates a simple-to-implement technique. More importantly, it provides the True Name for a problem I encounter from time to time: Catastrophic forgetting.
Featured image is by Flickr user Herr Olsen under CC-by-nc-2.0