Dispute for the sake of Heaven, or why it’s OK to have a loud argument with your co-worker

Any dispute that is for the sake of Heaven is destined to endure; one that is not for the sake of Heaven is not destined to endure
Chapters of the Fathers 5:27

One day, I had an intense argument with a colleague at my previous place of work, Automattic. Since most of the communication in Automattic happens in internal blogs that are visible to the entire company, this was a public dispute. In a matter of a couple of hours, some people contacted me privately on Slack. They told me that the message exchange sounded aggressive, both from my side and from the side of my counterpart. I didn’t feel that way. In this post, I want to explain why it is OK to have a loud argument with your co-workers.

How it all began?

I’m a data scientist and algorithm developer. I like doing data science and developing algorithms. Sometimes, to be better at my job, I need to show my work to my colleagues. In a “regular” company, I would ask my colleagues to step into my office and play with my models. Automattic isn’t a “regular” company. At Automattic, people from more than sixty countries from in every possible time zone. So, I wanted to start a server that will be visible by everyone in the company (and only by them), that will have access to the relevant data, and that will be able to run any software I install on it.

Two bees fighting

X is a system administrator. He likes administrating the systems that serve more than 2000,000,000 unique visitors in the US alone. To be good at his job, X needs to make sure no bad things happen to the systems. That’s why when X saw my request for the new setup (made on a company-visible blog page), his response was, more or less, “Please tell me why do you think you need this, and why can’t you manage with what you already have.”

Frankly, I was furious. Usually, they tell you to count to ten before answering to someone who made you angry. Instead, I went to my mother-in-law’s birthday party, and then I wrote an answer (again, in a company-visible blog). The answer was, more or less, “because I know what I’m doing.” For which, X replied, more or less, “I know what I do too.”

How it got resolved?

At this point, I started realizing that X is not expected to jeopardize his professional reputation for the sake of my professional aspirations. It was true that I wanted to test a new algorithm that will bring a lot of value to the company for which I work. It is also true that X doesn’t resent to comply with every developers’ request out of caprice. His job is to keep the entire system working. Coincidentally, X contacted me over Slack, so I took the opportunity to apologize for something that sounded as aggression from my side. I was pleased to hear that X didn’t notice any hostility, so we were good.

What eventually happened and was the dispute avoidable?

I don’t know whether it was possible to achieve the same or a better result without the loud argument. I admit: I was angry when I wrote some of the things that I wrote. However, I wasn’t mad at X as a person. I was angry because I thought I knew what was best for the company, and someone interfered with my plans.

I assume that X was angry when he wrote some of the things he wrote. I also believe that he wasn’t angry at me as a person but because he knew what was best for the company, and someone tried to interfere with his plans.

I’m sure though that it was this argument that enabled us to define the main “pain” points for both sides of the dispute. As long as the dispute was about ideas, not personas, and as long as the dispute’s goal was for the sake of the common good, it was worth it. To my current and future colleagues: if you hear me arguing loudly, please know that this is a “dispute that is for the sake of Heaven [that] is destined to endure.”

Featured image: Source: http://mimiandeunice.com/; Bees image: Photo by Flickr user silangel, modified. Under the CC-BY-NC license.

The Keys to Effective Data Science Projects — Operationalize

Recently, I’ve stumbled upon an interesting series of posts about effective management of data science projects.  One of the posts in the series says:

 “Operationalization” – a term only a marketer could love. It really just means “people using your solution”.

The main claim of that post is that, at some point, bringing actual users to your data science project may be more important than improving the model. This is exactly what I meant in my “when good enough is good enough” post (also available on YouTube)

Why is it (almost) impossible to set deadlines for data science projects?

In many cases, attempts to set a deadline to a data science project result in a complete fiasco. Why is that? Why, in many software projects, managers can have a reasonable time estimate for the completion but in most data science projects they can’t? The key points to answer this question are complexity and, to a greater extent, missing information. By “complexity” I don’t (only) mean the computational complexity. By “missing information” I don’t mean dirty data. Let us take a look at these two factors, one by one.


Illustration: famous xkcd comic. Two programmers play during the compilation time
Think of this. Why most properly built bridges remain functional for decades and sometimes for centuries, while the rule in every non-trivial program is that “there is always another bug?”. I read this analogy in Joel Spolsky’s post written in 2001. The answer Joel provides is:

Once you’ve written a subroutine, you can call it as often as you want. This means that almost everything we do as software developers is something that has never been done before. This is very different than what construction workers do.

There was a substantial progress in the computer engineering theory since 2001 when Joel wrote its post. We have better static analysis tools, better coverage tools, and better standard practices. Nevertheless, bug-free software only exists in Programming 101 books.

What about data science projects? Aren’t they essentially a sort of software project? Yes, they are, and as such, the above quote is relevant for them too. However, we can add another statement:

Once you’ve collected data, you can process it as often as you want. This means that almost everything we do as data scientists is something that has never been done before.

You see, to account for project uncertainty, we need to multiply the number of uncertainty factors of a software project by the number of uncertainty factors associated with the data itself. The bottom line is an exponential complexity growth.

Missing information

Now, let’s talk about another, even bigger problem, the missing information. I’m not talking about “dirty data” — a situation where some values in the dataset are missing, input errors, or fields that change their meaning over the time. These are severe problems but not as tough as the problem I’m about to talk about in this post.

When a software engineer writes a plotting program, they know when it doesn’t work: the image is either created or not. And if the image isn’t created, the programmer knows that something wrong and has to be fixed. When a programmer writes a compression program, they know when they made a mistake: if the program does not compress a file, or if the result isn’t readable. The programmer knows that there must be a fixable bug in his or her code.

What about a data science project? Let’s say you’re starting an advertisement targetting project. The project manager gives you the information source and the performance metric. A successful model has to have a performance of 80 or more (the nature of the performance score isn’t important here). You start working. You clean your data, normalize it, build a nice decision tree, and get a score of 60, which is way too low. You explore your data, discover problems in it, retrain the tree and get 63. You talk to the team that collects the data, find more problems, build a random forest, train it and get a score of 66. You buy some computation time, create a deep learning network on AWS, train it for a week, and get 66 again.

Illustration: a blindfolded man wandering around

What do you do now? Is it possible that somewhere in your code there is a bug? It certainly is. Is it possible that you can improve the performance by deploying a better model? Probably. However, it is also possible that the data does not contain enough information. The problem, of course, is that you don’t know that. In practice, you hit your head against the wall until you get the results, or give up, or fired. And this is THE most significant problem with data science (and any research) project: your problem is a black box. You only know what you know, but you have no idea what you don’t. A research project is like exploring a forest with your eyes shut: when you hit a tree, you don’t know whether this is the last tree in the forest and you’re out, or you’re in the middle of a tropical jungle.

I hope that the theoretical data science research will narrow this gap. Meanwhile, the project managers will have to live with this great degree of uncertainty.


PS. As in any opinion post, I may be mistaken. If you think I am, please let me know in the comments section below.

The xckd image: https://xkcd.com/303/ under CC-nc. The wandering man image: Illustration: a blindfolded man wandering around. By Flickr user Molly under CC-by-nc-nd