Tag: machine learning
-
Interview 27: Racial discrimination and fair machine learning
-
Online data science conference on May, 28
NDR is a family of machine learning/data science conferences. Their next conference will be held online on May, 28 and the agenda looks great.
-
Is security through obscurity back?
HBR published an opinion post by Andrew Burt, called “The AI Transparency Paradox.” This post talks about the problems that were created by tools that open up the “black box” of a machine learning model.
-
On algorithmic fairness & transparency
My teammate, Charles Earl has recently attended the Conference on Fairness, Accountability, andTransparency (FAT*). The conference site is full of very interesting material, including proceedings and video recording of lectures and tutorials.
-
Overfitting reading list
Overfitting is a situation in which a model accurately describes some data but not the phenomenon that generates that data. Overfitting was a huge problem in the good old times, where each data point was expensive, and researchers operated on datasets that could fit a single A4 sheet of paper. Today, with mega- giga- and tera-bytes datasets, overfitting is … still a problem. A very painful one. Following is a short reading list on overfitting.
-
Once again on becoming a data scientist
My stand on learning data science is known: I think that learning “data science” as a career move is a mistake. You may read this long rant of mine to learn why I think so. This doesn’t mean that I think that studying data science, in general, is a waste of time.
-
AI and the War on Poverty, by Charles Earl
It’s such a joy to work with smart and interesting people. My teammate, Charles Earl, wrote a post about machine learning and poverty. It’s not short, but it’s worth reading.
-
Buzzword shift
Many years ago, I tried to build something that today would have been called “Google Trends for Pubmed”. One thing that I’ve found during that process was how the emergence of HIV-related research reduced the number of cancer studies and how, several years later, the HIV research boom settled down, while letting the cancer research back.
-
On alert fatigue
I developed an anomaly detection system for Automattic internal dashboard. When presenting this system (“When good enough is just good enough”), I used to tell that in our particular case, the cost of false alerts was almost zero. I used to explain this claim by the fact that no automatic decisions were made based on the alerts, and that the only subscribers of the alert messages were a limited group of colleagues of mine. Automattic CFO, Stu West, who was the biggest stakeholder in this project, asked me not to stop claiming the “zero cost” claim. When the CFO of the company you work for asks you to do something, you comply. So, I stopped saying “zero cost” but I still listed the error costs as a problem I can safely ignore for the time being. I didn’t fully believe Stu, which is evident from the speaker notes of my presentation deck:
-
On machine learning, job security, professional pride, and network trolling
If you are a data scientist, I am sure you wondered whether deep neural networks will replace you at your job one day. Every time I read about reports of researchers who managed to trick neural networks, I wonder whether the researchers were thinking about their job security, or their professional pride while performing the experiments. I think that the first example of such a report is a 2014 paper by Christian Szegedy and his colleagues called “Intriguing properties of neural networks”. The main goal of this paper, so it seems, was to peek into the black box of neural networks. In one of the experiments, the authors designed minor, invisible perturbation of the original images. These perturbations diminished the classification accuracy of a trained model.
-
Good information + bad visualization = BAD
I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.
-
Although it is easy to lie with statistics, it is easier to lie without
I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting – the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.
-
Pseudo-rehearsal: A simple solution to catastrophic forgetting for NLP
Frequently, training a machine learning model in a single session is impossible. Most commonly, this happens when one needs to update a model with newly obtained observations. The generic term for such an update is “online learning.” In the scikit-learn world, this concept is also known as partial fit. The problem is that some models or their implementations don’t allow for partial fit. Even if the partial fitting is technically possible, the weight assigned to the new observations is may not be under your control. What happens when you re-train a model from scratch, or when the new observations are assigned too high weights? Recently, I stumbled upon an interesting concept of Pseudo-rehearsal that addresses this problem. Citing Matthew Honnibal:
Sometimes you want to fine-tune a pre-trained model to add a new label or correct some specific errors. This can introduce the “catastrophic forgetting” problem. Pseudo-rehearsal is a good solution: use the original model to label examples, and mix them through your fine-tuning updates.
-
Time Series Analysis: When “Good Enough” is Good Enough
My today’s talk at PyCon Israel in a post format.
-
Come to PyData at the Bar Ilan University to hear me talking about anomaly detection
On June 12th, I’ll be talking about anomaly detection and future forecasting when “good enough” is good enough. This lecture is a part of PyCon Israel that takes place between June 11 and 14 in the Bar Ilan University. The conference agenda is very impressive. If “python” or “data” are parts of your professional life, come to this conference!