Boris Gorelik |

Dispute for the sake of Heaven, or why it's OK to have a loud argument with your co-worker

February 6, 2020

Any dispute that is for the sake of Heaven is destined to endure; one that is not for the sake of Heaven is not destined to endure

Chapters of the Fathers 5:27

One day, I had an intense argument with a colleague at my previous place of work, Automattic. Since most of the communication in Automattic happens in internal blogs that are visible to the entire company, this was a public dispute. In a matter of a couple of hours, some people contacted me privately on Slack. They told me that the message exchange sounded aggressive, both from my side and from the side of my counterpart. I didn’t feel that way. In this post, I want to explain why it is OK to have a loud argument with your co-workers.

How it all began?

I’m a data scientist and algorithm developer. I like doing data science and developing algorithms. Sometimes, to be better at my job, I need to show my work to my colleagues. In a “regular” company, I would ask my colleagues to step into my office and play with my models. Automattic isn’t a “regular” company. At Automattic, people from more than sixty countries from in every possible time zone. So, I wanted to start a server that will be visible by everyone in the company (and only by them), that will have access to the relevant data, and that will be able to run any software I install on it.

Two bees fighting

X is a system administrator. He likes administrating the systems that serve more than 2000,000,000 unique visitors in the US alone. To be good at his job, X needs to make sure no bad things happen to the systems. That’s why when X saw my request for the new setup (made on a company-visible blog page), his response was, more or less, “Please tell me why do you think you need this, and why can’t you manage with what you already have.”

Frankly, I was furious. Usually, they tell you to count to ten before answering to someone who made you angry. Instead, I went to my mother-in-law’s birthday party, and then I wrote an answer (again, in a company-visible blog). The answer was, more or less, “because I know what I’m doing.” For which, X replied, more or less, “I know what I do too.”

How it got resolved?

At this point, I started realizing that X is not expected to jeopardize his professional reputation for the sake of my professional aspirations. It was true that I wanted to test a new algorithm that will bring a lot of value to the company for which I work. It is also true that X doesn’t resent to comply with every developers’ request out of caprice. His job is to keep the entire system working. Coincidentally, X contacted me over Slack, so I took the opportunity to apologize for something that sounded as aggression from my side. I was pleased to hear that X didn’t notice any hostility, so we were good.

What eventually happened and was the dispute avoidable?

I don’t know whether it was possible to achieve the same or a better result without the loud argument. I admit: I was angry when I wrote some of the things that I wrote. However, I wasn’t mad at X as a person. I was angry because I thought I knew what was best for the company, and someone interfered with my plans.

I assume that X was angry when he wrote some of the things he wrote. I also believe that he wasn’t angry at me as a person but because he knew what was best for the company, and someone tried to interfere with his plans.

I’m sure though that it was this argument that enabled us to define the main “pain” points for both sides of the dispute. As long as the dispute was about ideas, not personas, and as long as the dispute’s goal was for the sake of the common good, it was worth it. To my current and future colleagues: if you hear me arguing loudly, please know that this is a “dispute that is for the sake of Heaven [that] is destined to endure.”

Featured image: Source: http://mimiandeunice.com/; Bees image: Photo by Flickr user silangel, modified. Under the CC-BY-NC license.

February 6, 2020 - 3 minute read -

The difference between python decorators and inheritance that cost me three hours of hair-pulling

February 3, 2020

I don’t have much hair on my head, but recently, I encountered a funny peculiarity in Python due to which I have been pulling my hair for a couple of hours. In retrospect, this feature makes a lot of sense. In retrospect.

First, let’s start with the mental model that I had in my head: inheritance.

Let’s say you have a base class that defines a function f

Now, you inherit from that class and rewrite f

What happens? The fact that you defined f in ClassB means that, to a rough approximation, the old definition of f from ClassA does not exist in all the ClassB objects.

Now, let’s go to decorators.

@dataclass_json
@dataclass
class Message2:
    message: str
    weight: int
    def to_dict(self, encode_json=False):
        print('Custom to_dict')
        ret = {'MESSAGE': self.message, 'WEIGHT': self.weight}
        return ret
m2 = Message2('m2', 2)

What happened here? I used a decorator dataclass_json that, among other things, provides a to_dict function to Python’s data classes. I created a class Message2, but I needed s custom to_dict definition. So, naturally, I defined a new version of to_dict only to discover several hours later that the new to_dict doesn’t exist.

Do you get the point already? In inheritence, the custom implementations are added ON TOP of the base class. However, when you apply a decorator to a class, your class’s custom code is BELOW the one provided by the decorator. Therefore, you don’t override the decorating code but rather “underride” it (i.e., give it something it can replace).

As I said, it makes perfect sense, but still, I missed it. I don’t know whether I would have managed to find the solution without Stackoverflow.

February 3, 2020 - 2 minute read -

In playing cards, the Queen is worth less than the King? Is it time for a change?

January 29, 2020

Queeng is an ambitious project to change the way we play cards.

January 29, 2020 - 1 minute read -

Does Zipf's Law Apply to Alzheimer's Patients?

January 28, 2020

Today, I read a post about Ziph’s law and Alzheimer’s disease. I liked the post very much and decided to press the “like” button only to discover that I already “liked” this post more than two years ago.

Indeed this is an interesting post.

January 28, 2020 - 1 minute read -

The first things a statistical consultant needs to know — AnnMaria's Blog

January 27, 2020

You know that I’m a data science consultant now, don’t you? You know that AnnMaria De Mars, Ph.D. (the statistician, game developer, the world Judo champion) is one of my favorite bloggers, and her blog is the second blog I started to follow don’t you?

A couple of months ago, AnnMaria wrote an extensive post about 30 things she learned in 30 years as a statistical consultant. One week ago, she wrote another great piece of advice.

I’ll be speaking about being a statistical consultant at SAS Global Forum in D.C. in March/ April. While I will be talking a little bit about factor analysis, repeated measures ANOVA and logistic regression, that is the end of my talk. The first things a statistical consultant should know don’t have much to do with…

The first things a statistical consultant needs to know — AnnMaria’s Blog

January 27, 2020 - 1 minute read -

Book review. Replay by Ken Grimwood

January 20, 2020

TL;DR: excellent fiction reading, makes you think about your life choices. 5/5

book cover of "Replay" by Ken Grimwood

“Replay” by Ken Grimwood is the first fiction book that I read in ages. The book is about a forty-three years old man with a failing family and a boring career. The man suddenly dies and re-appears in his own eighteen-years old body. He then lives his life again, using the knowledge of his future self. Then he dies again, and again, and again.
I liked the concept (reminded me of the Groundhog Day movie). The book managed to “suck me in,” and I finished it in two days. It also made me think hard about my life choices. I think that my decision to quit and become a freelancer was partially affected by this book.

What did I not like? Some parts of the book are somewhat pornographic. It doesn’t bother me per se, but I think the plot would stay as good as it is without those parts. Also, I find it a little bit sad that every reincarnation in “Replay” starts with making easy money. Not that I don’t like money; it just makes me sad.

Photo of my kindle with text from "Replay" by Ken Grimwood

Bottom line: Read! 5/5

(Read in Nov 2019)

January 20, 2020 - 1 minute read -

ASCII histograms are quick, easy to use and to implement

January 16, 2020

Screen Shot 2018-02-25 at 21.25.32

From time to time, we need to look at the distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when we did most of our work in the console, and when creating a plot from Python required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

January 16, 2020 - 1 minute read -

The tombs of the righteous

January 15, 2020

Some people, in face of important changes visit tombs of the righteous for a blessing. I went to see WEIZAC – Israel’s first computer (and one of the first ones in the world) that was built in 1955.

Me in front of the memory unit of WEIZAC

January 15, 2020 - 1 minute read -

How I got a dream job in a distributed company and why I am leaving it

January 13, 2020

One night, in January 2014, I came back home from work after spending two hours commuting in each direction. I was frustrated and started Googling for “work from home” companies. After a couple of minutes, I arrived at https://automattic.com/work-with-us/. Surprisingly to me, I couldn’t find any job postings for data scientists, and a quick LinkedIn search revealed no data scientists at Automattic. So I decided to write a somewhat arrogant letter titled “Why you should call me?”. After reading the draft, I decided that it was too arrogant and kept it in my Drafts folder so that I can sleep over it. A couple of days later, I decided to delete that mail. HOWEVER, entirely unintentionally, I hit the send button. That’s how I became the first data scientist hired by Automattic (Carly Staumbach, the data scientist and the musician, was already Automattician, but she arrived there by an acquisition).

Screenshot of my email The email is pretty long. I even forgot to remove a link that I planned to read BEFORE sending that email.

The past five and a half years have been the best five and a half years in my professional life. I met a TON of fascinating people from different cultural and professional backgrounds. I re-discovered blogging. My idea of what a workplace is has changed tremendously and for good.

What happened?

Until now, every time I left a workplace, I did that for external reasons. I simply had to. I left either due to company’s poor financial situation, due to long commute time, or both. Now, it’s the first time I am leaving a place of work entirely for internal reasons: despite, and maybe a little bit because, the fact that everything was so good. (Of course, there are some problems and disruptions, but nothing is ideal, right?)

What happened? In June, I left for a sabbatical. The sabbatical was so good that I already started making plans for another one. However, I also started thinking about my professional growth, the opportunities I have, and the opportunities I previously missed. I realized that right now, I am in the ideal position to exit the comfort zone and to take calculated professional risks. That’s how, after about four sleepless weeks, I decided to quit my dream job and to start a freelance career.

On January 22, I will become an Automattic alumnus.

BTW, Automattic is constantly looking for new people. Visit their careers page and see whether there is something for you. And if not, find the chutzpah and write them anyhow.

A group photo of about 600 people -- Automattic 2018 grand meetup 2018 Grand Meetup.

A group photo of about 800 people. 2019 Automattic Grand Meetup 2019 Grand Meetup. I have no idea where I am at this picture

January 13, 2020 - 2 minute read -

Software commodities are eating interesting data science work — Yanir Seroussi

January 12, 2020

If you read my shortish post about staying employable as a data scientist, you might like a longer post by a colleague, Yanir Seroussi. In his post, Yanir lists four possible paths for a data scientist: (1) become an engineer; (2) reinvent the wheel; (3) search for niches; and (4) expand the cutting edge.

To this list, I would also add two other options.

(5) Manage. Managing is not developing, it’s a different profession. However, some developers and data scientists that I know choose this path. I am not a manager myself, so I hope I don’t insult the managers who read these lines, but I think that it is much easier for a good manager to stay good, than for a good developer or data scientist.

(6) Teach. I teach as a part-time job. One reason for teaching is that I sometimes enjoy it. Another reason is that I feel that at some point, I might not be good enough to stay on the cutting edge but still sharp enough to teach the new generations the basics.

Anyhow, read Yanir’s post linked below.

The passage of time makes wizards of us all. Today, any dullard can make bells ring across the ocean by tapping out phone numbers, cause inanimate toys to march by barking an order, or activate remote devices by touching a wireless screen. Thomas Edison couldn’t have managed any of this at his peak—and shortly before […]

Software commodities are eating interesting data science work — Yanir Seroussi

January 12, 2020 - 2 minute read -

Career advice. A research pharmacist wants to become a data scientist.

January 9, 2020

Recently, I received an email from a pharmacist who considers becoming a data scientist. Since this is not a first (or last) similar email that I receive, I think others will find this message exchange interesting.

Here’s the original email with minor edits, followed by my response.

The question

Hi Boris,

My name is XXXXX, and I came across your information and your advice on data science as I was researching career opportunities.

I currently work at a hospital as a research pharmacist, mainly involved in managing drugs for clinical trials.
Initially, I wanted to become a clinical pharmacist and pursued 1-year post-graduate residency training. However, it was not something I could envision myself enjoying for the rest of my career.

I then turned towards obtaining a Ph.D. in translational research, bridging the benchwork research to the bedside so that I could be at the forefront of clinical trial development and benefit patients from the rigorous stages of pre-clinical research outcomes. I much appreciate learning all the meticulous work dedicated before the development of Phase I clinical trials. However, Ph.D. in pharmaceutical sciences was overkill for what I wanted to achieve in my career (in my opinion), and I ended up completing with master’s in pharmaceutical sciences.

Since I wanted to be involved in both research and pharmacy areas in my career, I ended up where I am now, a research pharmacist.

My main job description is not any different from typical hospital pharmacists. I do have a chance of handling investigational medications, learning about new medications and clinical protocols, overseeing side effects that may be a crucial barrier in marketing the trial medications, and sometimes participating in development of drug preparation and handling for investigator-initiated trials. This does keep my job interesting and brings variety in what I do. However, I do still feel I am merely following the guidelines to prepare medications and not critically thinking to make interventions or manipulate data to see the outcomes. At this point, I am preparing to find career opportunities in the pharmaceutical industry where I will be more actively involved in clinical trial development, exchanging information about targeting the diseases and analyzing data. I believe gaining knowledge and experiences in critical characteristics for the data science field would broaden my career opportunities and interest. Still, unfortunately, I only have pharmacy background and have little to no experience in computer science, bioinformatics, or machine learning.

The answer

First of all, thank you for asking me. I’m genuinely flattered. I assume that you found me through my blog posts, and if not, I suggest that you read at least the following posts

“Don’t study data science as a career move” – the most read post on my blog
“What you need to know to start a career as a data scientist”
“Staying employable as a data scientist”

All my thoughts on the career path of a data scientist appear in this page https://gorelik.net/category/career-advice/

Now, specifically to your questions.

My path towards data science was through gradual evolution. Every new phase in my career used my previous experience and knowledge. From B.Sc studies in pharmacy to doctorate studies in computational drug design, from computational drug design to biomathematical modeling, from that to bioinformatics, and from that to cybersecurity. Of course, my path is not unique. I know at least three people who followed a similar career from pharmacy to data science. Maybe other people made different choices and are even more successful than I am. My first advice to everyone who wants to transition into data science is not to (see the first link in the list above). I was lucky to enter the field before it was a field, but today, we live in the age of specialization. Today we have data analysts, data engineers, machine learning engineers, NLP scientists, image processing specialists, etc. If computational modeling is something that a person likes and sees themselves doing for living, I suggest pursuing a related advanced degree with a project that involves massive modeling efforts. Examples of such degrees for a pharmacist are computational chemistry, pharmacoepidemiology, pharmacovigilance, bioinformatics. This way, one can utilize the knowledge that they already have to expand the expertise, build a reputation, and gain new knowledge. If staying in academia is not an option, consider taking a relevant real-life project. For example, if you work in a hospital, you could try identifying patterns in antibiotics usage, a correlation between demographics and hospital re-admission, … you get the idea.

Whatever you do, you will not be able to work as a data scientist if you can’t write computer programs. Modifying tutorial scripts is not enough; knowing how to feed data into models is not enough.

Also, my most significant knowledge gap is in maths. If you do go back to academia, I strongly suggest taking advantage of the opportunity and taking several math classes: at least calculus and linear algebra and, of course, statistics.

Do you have a question for me?

If you have questions, feel free writing them here, in the comments section or writing to boris@gorelik.net

January 9, 2020 - 4 minute read -

Athens, Greece

January 8, 2020

January 8, 2020 - 1 minute read -

New year, new notebook

January 1, 2020

On November 7, 2016, I started an experiment in personal productivity. I decided to use a notebook for thirty days to manage all of my tasks. The thirty days ended more than three years ago, and I still use notebooks to manage myself. Today, I started the thirteenth notebook.

Read about my time management system here.

January 1, 2020 - 1 minute read -

Don't we all like a good contradiction?

December 31, 2019

I am a huge fan of Gerd Gigerenzer who preaches numeracy and uncertainty education. One of Prof. Gigerenzer’s pivotal theses is “Fast and Frugal Heuristics” which is also popularized in his book “Gut Feelings” (listen to this podcast if you don’t want to read the book). I like this approach.

Today, I listened to the latest episode of the Brainfluence podcast that hosted the psychologist Dr. Gleb Tsipursky who wrote an extensive book called “Never Trust your Gut” with a seemingly contradicting thesis. I added this book to my TOREAD list.

December 31, 2019 - 1 minute read -

Staying employable and relevant as a data scientist

December 23, 2019

One common wisdom is that creative jobs are immune to becoming irrelevant. This is what Brian Solis, the author of “Lifescale” says on this matter

On the positive side, historically, with every technological advancement, new jobs are created. Incredible opportunity opens up for individuals to learn new skills and create in new ways. It is your mindset, the new in-demand skills you learn, and your creativity that will assure you a bright future in the age of automation. This is not just my opinion. A thoughtful article in Harvard Business Review by Joseph Pistrui was titled, “The Future of Human Work Is Imagination, Creativity, and Strategy.” He cites research by McKinsey […]. In their research, they discovered that the more technical the work, the more replaceable it is by technology. However, work that requires imagination, creative thinking, analysis, and strategic thinking is not only more difficult to automate; it is those capabilities that are needed to guide and govern the machines.

Many people think that data science falls into the category of “creative thinking and analysis”. However, as time passes by this becomes less true. Here’s why.

As time passes by, tools become stronger, smarter, and faster. This means that a problem that could have been solved using cutting edge algorithms running by cutting edge scientists on cutting edge computers, will be solvable using a commodity product. “All you have to do” is to apply domain knowledge, select a “good enough” tool, get the results and act upon them. You’ll notice that I included two phases in quotation marks. First, “all you have to do”. I know that it’s not that simple as “just add water” but it gets simpler.

“Good enough” is also a tricky part. Selecting the right algorithm for a problem has dramatic effect on tough cases but is less important with easy ones. Think of a sorting algorithm. I remember my algorithm class professor used to talk how important it was to select the right sorting algorithm to the right problem. That was almost twenty years ago. Today, I simply write list.sort() and I’m done. Maybe, one day I will have to sort billions of data points in less than a second on a tiny CPU without RAM, which will force me into developing a specialized solution. But in 99.999% of cases, list.sort() is enough.

Back to data science. I think that in the near future, we will see more and more analogs of list.sort(). What does that mean to us, data scientists? I am not sure. What I’m sure is that in order to stay relevant we have to learn and evolve.

Featured image by Héctor López on Unsplash

December 23, 2019 - 2 minute read -

Is security through obscurity back?

December 15, 2019

HBR published an opinion post by Andrew Burt, called “The AI Transparency Paradox.” This post talks about the problems that were created by tools that open up the “black box” of a machine learning model.

“Black box” refers to the situation where one can’t explain why a machine learning model predicted whatever it predicted. Predictability is not only important when one wants to improve the model or to pinpoint mistakes, but it is also an essential feature in many fields. For example, when I was developing a cancer detection model, every physician requested to know why we thought a particular patient had cancer. That is why I’m happy, so many people develop tools that allow peeking into the black box.

I was very surprised to read the “transparency paradox” post. Not because I couldn’t imagine that people will use the insights to hack the models. I was surprised because the post reads like a case for security through obscurity – an ancient practice that was mostly eradicated from the mainstream.

Yes, ML transparency opens opportunities for hacking and abuse. However, this is EXACTLY the reason why such openness is needed. Hacking attempts will not disappear with transparency removal; they will be harder to defend.

December 15, 2019 - 1 minute read -

I will speak at the NDR conference in Bucharest

December 11, 2019

NDR is a family of machine learning conferences in Romania. Last year, I attended the Iași edition of that conference, gave a data visualization talk, and enjoyed every moment. All the lectures (including mine, obviously) were interesting and relevant. That is why, when Vlad Iliescu, one of the NDR organizers, asked me whether I wanted to talk in Bucharest at NDR 2020, I didn’t think twice.

Since the organizers didn’t publish the talk topics yet, I will not ruin the surprise for you, but I promise to be interesting and relevant. I definitely think that NDR is worth the trip to Bucharest to many data practitioners, even the ones who don’t live in Romania. Visit the conference site to register.

December 11, 2019 - 1 minute read -

Book review. A Short History of Nearly Everything by Bill Bryson

December 2, 2019

TL;DR: a nice popular science book that covers many aspects of the modern science

A Short History of Nearly Everything by Bill Bryson is a popular science book. I didn’t learn anything fundamental out of this book, but it was worth reading. I was particularly impressed by the intrigues, lies, and manipulations behind so many scientific discoveries and discoverers.

The main “selling point” of this book is that it answers the question, “how do the scientists know what they know”? How, for example, do we know the age of Earth or the skin color of the dinosaurs? The author indeed provides some insight. However, because the book tries to talk about “nearly everything,” the answer isn’t focused enough. Simon Singh’s book “Big Bang” concentrates on the cosmology and provides a better insight into the question of “how do we know what we know.”

Interesting takeaways and highlights

Of the problem that our Universe is unlikely to be created by chance: “Although the creation of Universe is very unlikely, nobody knows about failed attempts.”
The Universe is unlimited but finite (think of a circle)
Developments in chemistry were the driving force of the industrial revolution. Nevertheless, chemistry wasn’t recognized as a scientific field in its own for several decades

The bottom line: Read if you have time 3.5/5.

December 2, 2019 - 1 minute read -

Cow shit, virtual patient, big data, and the future of the human species

November 28, 2019

Yesterday, a new episode was published in the Popcorn podcast, where the host, Lior Frenkel, interviewed me. Everyone who knows me knows how much I love talking about myself and what I do. I definitely used this opportunity to talk about the world of data. Some people who listened to this episode told me that they enjoyed it a lot. If you know Hebrew, I recommend that you listen to this episode

https://soundcloud.com/hamutsi/142-boris-gorelik

November 28, 2019 - 1 minute read -

Data visualization as an engineering task - a methodological approach towards creating effective data visualization

November 20, 2019

In June 2019, I attended the NDR AI conference in Iași, Romania where I also gave a talk. Recently, the organizers uploaded the video recording to YouTube.

November 20, 2019 - 1 minute read -

A tangible productivity tool (and a book review)

November 11, 2019

One month ago, I stumbled upon a book called “[Personal Kanban: Mapping Work

Navigating Life](https://amzn.to/33DM4l4)” by Jim Benson (all the book links use my affiliate code). Never before, I saw a more significant discrepancy between the value that the book gave me and its actual content.

Even before finishing the first chapter of this book, I realized that I wanted to incorporate “personal kanban” into my productivity system. The problem was that the entire book could be summarized by a blog post or by a Youtube video (such as this one). The rest of the book contains endless repetitions and praises. I recommend not reading this book, even though it strongly affected the way I work

So, what is Personal Kanban anyhow? Kanban is a productivity approach that puts all the tasks in front of a person on a whiteboard. Usually, Kanban boards are physical boards with post-it notes, but software Kanban boards are also widely known (Trello is one of them). Following are the claims that Jim Benson makes in his book that resonated with me

Many productivity approaches view personal and professional life separately. The reality is that these two aspects of our lives are not separate at all. Therefore, a productivity method needs to combine them.
Having all the critical tasks in front of your eyes helps to get the global picture. It also helps to group the tasks according to their contexts.
The act of moving notes from one place to another gives valuable tangible feedback. This feedback has many psychological benefits.
One should limit the number of work-in-progress tasks.
There are three different types of “productivity.” You are Productive when you work hard. You are Efficient when your work is actually getting done. Finally, you are Effective when you do the right job at the right time, and can repeat this process if needed.

I’m a long user of a productivity method that I adopted from Mark Forster. You may read about my process here. Having read Personal Kanban, I decided to combine it with my approach. According to the plan, I have more significant tasks on my Kanban board, which I use to make daily, weekly, and long-term plans. For the day-to-day (and hour-to-hour) taks, I still use my notebooks.

Initially, I used my whiteboard for this purpose, but something wasn’t right about it.

Having my Kanban on my home office whiteboard had two significant drawbacks. First, the whiteboard isn’t with me all the time. And what is the point of putting your tasks on board if you can’t see it? Secondly, listing everything on a whiteboard has some privacy issues. After some thoughts, I decided to migrate the Kanban to my notebook.

In this notebook, I have two spreads. The first spread is used for the backlog, and “this week” taks. The second spread has the “today,” “doing,” “wait,” and “done” columns. The fact that the notebook is smaller than the whiteboard turned out to be a useful feature. This physical limitation limits the number of tasks I put on my “today” and “doing” lists.

I organize the tasks at the beginning of my working day. The rest of the system remains unchanged. After more than a month, I’m happy with this new tangible productivity method.

November 11, 2019 - 3 minute read -

Knowledge Graphs & NLP @ EMNLP

November 10, 2019

I stumbled upon a very detailed and useful summary of a recent conference on empirical methods in natural language processing. I have to say, Michael Galkin, the author of this review, did an excellent job. His blog, https://medium.com/@mgalkin, is worth following.

https://medium.com/@mgalkin/knowledge-graphs-nlp-emnlp-2019-part-i-e4e69fd7957c

November 10, 2019 - 1 minute read -

Data science tools with a graphical user interface

November 5, 2019

A Quora user asked about data science tools with a graphical user interface. Here’s my answer. I should mention though that I don’t usually use GUI for data science. Not that I think GUIs are bad, I simply couldn’t find a tool that works well for me.

Of the many tools that exist, I like the most Orange (https://orange.biolab.si/). Orange allows the user creating data pipelines for exploration, visualization, and production but also allows editing the “raw” python code. The combination of these features makes is a powerful and flexible tool.

The major drawback of Orange (in my opinion) is that is uses its own data format and its own set of models that are not 100% compatible with the Numpy/Pandas/Sklearn ecosystem.

I have made a modest contribution to Orange by adding a six-lines function that computes Matthews correlation coefficient.

Other tools are KNIME and Weka (none of them is natively Python).

There is also RapidMinder but I never used it.

November 5, 2019 - 1 minute read -

Working in a distributed company. Communication styles

October 30, 2019

I work at Automattic, one of the largest distributed companies in the world. Working in a distributed company means that everybody in this company works remotely. There are currently about one thousand people working in this company from about seventy countries. As you might expect, the international nature of the company poses a communication challenge. Recently, I had a fun experience that demonstrates how different people are.

Remote work means that we use text as our primary communication tool. Moreover, since the company spans over all the time zones in the world, we mostly use asynchronous communication, which takes the form of posts in internal blogs. A couple of weeks ago, I completed a lengthy analysis and summarized it in a post that was meant to be read by the majority of the company. Being a responsible professional, I asked several people to review the draft of my report.

To my embarrassment, I discovered that I made a typo in the report title, and not just a typo: I misspelled the company name :-(. A couple of minutes after asking for a review, two of my coworkers pinged me on Slack and told me about the typo. One message was, “There is a typo in the title.” Short, simple, and concise.

The second message was much longer.

Do you want to guess what the difference between the two coworkers is? . . . . . Here’s the answer . . . . The author of the first (short) message grew up and lives in Germany. The author of the second message is American. Germany, United States, and Israel (where I am from) have very different cultural codes. Being an Israeli, I tend to communicate in a more direct and less “sweetened” way. For me, the American communication style sounds a little bit “artificial,” even though I don’t doubt the sincerity of this particular American coworker. I think that the opposite situation is even more problematic. It happened several times: I made a remark that, in my opinion, was neutral and well-intended, and later I heard comments about how I sounded too aggressive. Interestingly, all the commenters were Americans.

To sum up. People from different cultural backgrounds have different communication styles. In theory, we all know that these differences exist. In practice, we usually are unaware of them.

Featured photo by Stock Photography on Unsplash

October 30, 2019 - 2 minute read -

Sometimes, you don't really need a legend

October 28, 2019

This is another “because you can” rant, where I claim that the fact that you can do something doesn’t mean that you necessarily need to.

This time, I will claim that sometimes, you don’t really need a legend in your graph. Let’s take a look at an example. We will plot the GDP per capita for three countries: Israel, France, and Italy. Plotting three lines isn’t a tricky task. Here’s how we do this in Python

plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.legend()

The last line in the code above does a small magic and adds a nice legend

This image has an empty alt attribute; its file name is image.png

In Excel, we don’t even need to do anything, the legend is added for us automatically.

This image has an empty alt attribute; its file name is image-1.png

So, what is the problem?

What happens when a person wants to know which line represents which country? That person needs to compare the line color to the colors in the legend. Since our working memory has a limited capacity, we do one of the following. We either jump from the graph to the legends dozens of times, or we try to find a heuristic (a shortcut). Human brains don’t like working hard and always search for shortcuts (I recommend reading Daniel Kahneman’s “Think Fast and Slow” to learn more about how our brain works).

What would be the shortcut here? Well, note how the line for Israel lies mostly below the line for Italy which lies mostly below the line for France. The lines in the legend also lie one below the other. However, the line order in these two pieces of information isn’t conserved. This results in a cognitive mess; the viewer needs to work hard to decipher the graph and misses the point that you want to convey.

And if we have more lines in the graph, the situation is even worse.

This image has an empty alt attribute; its file name is image-2.png

Can we improve the graph?

Yes we can. The simplest way to improve the graph is to keep the right order. In Python, we do that by reordering the plotting commands.

plt.plot(gdp.Year, gdp.Australia, '-', label='Australia')
plt.plot(gdp.Year, gdp.Belgium, '-', label='Belgium')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.legend()

This image has an empty alt attribute; its file name is image-3.png

We still have to work hard but at least we can trust our brain’s shortcut.

If we have more time

If we have some more time, we may get rid of the (classical) legend altogether.

countries = [c for c in gdp.columns if c != 'Year']
fig, ax = plt.subplots()
for i, c in enumerate(countries):
    ax.plot(gdp.Year, gdp[c], '-', color=f'C{i}')
    x = gdp.Year.max()
    y = gdp[c].iloc[-1]
    ax.text(x, y, c, color=f'C{i}', va='center')
seaborn.despine(ax=ax)

(if you don’t understand the Python in this code, I feel your pain but I won’t explain it here)

This image has an empty alt attribute; its file name is image-4.png

Isn’t it better? Now, the viewer doesn’t need to zap from the lines to the legend; we show them all the information at the same place. And since we already invested three minutes in making the graph prettier, why not add one more minute and make it even more awesome.

This image has an empty alt attribute; its file name is image-5.png

This graph is much easier to digest, compared to the first one and it also provides more useful information.

.

This image has an empty alt attribute; its file name is image-6.png

I agree that this is a mess. The life is tough. But if you have time, you can fix this mess too. I don’t, so I won’t bother, but Randy Olson had time. Look what he did in a similar situation.

percent-bachelors-degrees-women-usa

I also recommend reading my older post where I compared graph legends to muttonchops.

In conclusion

Sometimes, no legend is better than legend.

This post, in Hebrew: [link]

October 28, 2019 - 3 minute read -