A super-important read on the COVID-19 situation. I'm finally convinced

Until now I was very sceptical about the COVID-19 measures taken by many the governments around the world, especially the Israeli one. Today, finally, I read a post that addressed the three issues I was pointing to:

  1. This first lockdown will last for months, which seems unacceptable for many people.
  2. A months-long lockdown would destroy the economy.
  3. It wouldn’t even solve the problem, because we would be just postponing the epidemic: later on, once we release the social distancing measures, people will still get infected in the millions and die.
  4. My biggest concern: Either a lot of people die soon and we don’t hurt the economy today, or we hurt the economy today, just to postpone the deaths.

There’s no point rephrasing here the original post, just go and read it. I’m convinced. Thank you, Tomas Pueyo

Go and read. The image is clickable

The single most important thing about remove 1:1 meetings

The COVID-19 lockdown forced many organizations to a remote work mode. Recently, I spoke with three managers from three “conventional” companies and all the three told me how surprisingly efficient their 1:1 meetings became. This is how one of them described the situation “I prepare the agenda, we log in, boom, boom, boom, and we are done”.

The effectiveness of distributed work doesn’t surprise me, after all, I have been working in a distributed mode for about six years now. However, this super-efficiency has its own problems that one needs to know. Here’s the thing. We, humans, are social creatures. We depend on social interactions for our mental and physical well being. When people share the same physical office, they have enough social interactions “in-between” — in the hallway, next to the watercooler and in the parking lot. However, working in a distributed team creates isolation. That is why it is very important to start and end every meeting with a personal conversation. It is also important to make sure that the meeting feels as personal as possible. To do so, place the chat window below the camera, so that the person feels as if you are looking at them. During the conversation, resist the urge to check emails, read your Facebook feed or check my blog. Make the personal meeting personal, even if it’s remote.

I have been working in distributed teams for about six years. If you need advice on how to make the transition easier for your organization, I’ll be glad to give one (or two).

An interesting solution of the data giraffe problem

Photo by Pixabay on Pexels.com

A data giraffe is a situation where a very prominent data point shades everything else. I learned this term from a post by Pini Yakuel and immediately liked it a lot.

Taken from https://www.optimove.com/blog/beware-the-giraffes-in-your-data
Taken from https://www.optimove.com/blog/beware-the-giraffes-in-your-data

Dealing with data giraffes is hard, especially when dealing with bar charts. Today I saw one interesting approach to this problem

Katherine S. Rowell is a co-funder of a Boston firm that specializes in data visualization. In December, she published a post dedicated to one of the most popular but also most abused graph types, the bar charts. One of the examples in her post demonstrates a nice treatment of data giraffes

http://ksrowell.com/blog-visualizing-data/2019/12/18/bar-humbug/

In this example, Katherine draws the graph twice. The zoomed-out version shows the giraffes in all their glory, while the zoomed-in one gives the spotlight to the foxes, hyenas, and mice.
Also, note how these graphs respect the rules that every bar chart has to include the zero.

Another piece of career advice

Here’s another email that I got with the question about switching to the data science career

Hello, my name is X. I saw your blog, and to be honest, I said, “Wow, is this me :)” I’m a pharmacist 5th-grade student currently working on a project in computational drug design. I started programming, and I loved it. After that, I heard the term “Data Science” and started to do some research […]

Basically, I loved being on a computer and solving problems its a good career option for me (at least for now, you can’t predict future) my mom has a pharmacy I worked there (internship), and it is not for me (i am counting the time when I’m in a pharmacy.) so I have a few questions for you

I don’t have any degree in statistics or CS or something equivalent I am determined to learn these topics, but some people want to see the degree, and probably no one accept a pharmacist to a master degree in statistics (I also wish to do my Ms in computational drug design because, in the end, I don’t want to be a data scientist in social sciences or economics, at least for now, I want to use that knowledge in my field which is drugs and pharmaceuticals)

Ph.D. on Bioinformatics would help ? or Biostatistics ( is it easier for us to be accepted in biostatistics rather than statistics? To be honest, I don’t know the difference much, I took a biostatistics class, but it was just one semester and probably not enough for Ph.D. :))

Do I really need a degree in CS or statistics to be a pharmaceutical data scientist? I want to do my Ph.D. but also want to be realistic, it sounds amazing doing online masters in statistics while you are doing Computational drug design or Bİoinformatics Ph.D., but it is very hard and frustrating and also decrease your productivity in both fields.

I asked a lot of questions, sorry, but I have many :). You can reply when you have time. Thank you, and I loved your blog. I read and watched tons of things, but yours was the best suited for me because being a pharmacist, computational drug design, considering bioinformatics, it is all fits. By the way, I also considering cybersecurity (not working in a company but learning). I see that as a “martial arts of the future,” maybe I am wrong, but a person should know it to protect him/her self. Thank you again 🙂

Indeed, X’s background sounds very much like mine.
I’m not sure I have too much to add to what I already wrote here, in this blog. The only thing that I have to say is that in my biased opinion, a Ph.D. is something worth pursuing. The more time passes by, the more Ph.Ds there are, and the lack of a degree might be a problem in the future job market. On the other hand, there are many smart and rich people who claim that university degrees are a waste of time. Go figure 🙂

I hope that this helps.

No signs (yet?) of the COVID-19 pandemic on StackOverflow job postings

I suppose that you knot that THE software developement Q&A site has its own job board. I suspected that the Corona pandemic would lead to a sharp decrease in the number of job postings on that board. I scraped the data, and it looks like for now, there are no drastic changes in the amount of postings published in the last couple of days.

The cardiovascular safety of antiobesity drugs—analysis of signals in the FDA Adverse Event Report System Database

I am glad and proud to announce that a paper which I helped to prepare and publish is available on the Nature’s group site.

The paper, The cardiovascular safety of antiobesity drugs—analysis of signals in the FDA Adverse Event Report System Database, by Einat Gorelik et al. (including myself) analyzes the data in the FDA Adverse Event Reporting System (FAERS). In this study, we found interesting and relevant safety information about the long-term safety of the antiobesity drug Lorcaserin. Due to the interdisciplinary nature of the paper, the review process took about a year. Interestingly enough, the FDA requested the withdrawal of Lorcaserin due to long-term safety issues but not the ones we studied.

https://doi.org/10.1038/s41366-020-0544-4
https://doi.org/10.1038/s41366-020-0544-4

Please leave a comment to this post

Photo by Pixabay on Pexels.com

Please leave a comment to this post. It doesn’t matter what, it can be a simple Hi or an interesting link. It doesn’t matter when or where you see it. I want to see how many real people are actually reading this blog.

close up of text
Photo by Pixabay on Pexels.com

Tips for making remote presentations

Before becoming a freelancer data scientist, I used to work in a distributed company. Remote communication, including remote presentations were the norm for me, long before the remote work experiment no one asked for. In this post, I share some tips for delivering better presentations remotely.

Me presenting in front of the computer

  • Stand up! Usually, we stand up when we present in front of live audience. For some reason, when presenting remotely, people tend to sit. A sitting person is less dynamic and looks less engaging. I have a standing desk which allows me to stand up and to raise the camera to my face level.
  • If you can’t raise the camera, stay sitting. You don’t want your audience staring at your groin.
  • I always use a presentation remote control. It frees me up and lets me move more naturally. My remote is almost ten years old and I have a strong emotional attachment to it
  • When presenting, it is very important to see your audience. Use two monitors. Use one monitor for screen sharing, and the other one to see the audience.
  • Put the Skype/Zoom/whatever window that shows your audience under the camera. This way you’ll look most natural on the other side of the teleconference.
  • Starting a presentation in Powerpoint or Keynote “kidnaps” all the displays. You will not be able to see the audience when that happens. I export the presentation to a PDF file and use Acrobat Reader in full-screen mode. The up- and down- buttons in my presentation remote control work with the Reader. The “make screen black” button doesn’t.
  • I open a “lightable view” of my presentation and put it next to the audience screen. It’s not as useful as seeing the presenter’s notes using a “real” presentation program, but it is good enough.

Auditorium in Chisinau showing me on their screen

  • Make a dry run. Ideally, the try run should be a day or two before the event, to make sure all the technical problems are fixed.
  • Go online at least five minutes before the schedule. Be in front of the camera, don’t let the audience stare at your empty room
  • Make sure nothing in your background will embarrass you. This risk is especially high if you present from home or a hotel. Nobody needs to see your bed during a business meeting.

תרשים עוגה כחלופה הולמת לגרף עמודות

קראתי היום פוסט המדגים איך תרשימי עוגה יכולים להיות יותר יעילים מחלופות. מעניין שהפוסט משתמש פרלמנט הגרמני כמקרה דוגמה.
https://serialmentor.com/dataviz/visualizing-proportions.html

בוריס גורליק

תרשים עוגה כחלופה הולמת לגרף עמודות

במהלך חיי המקצועיים שמעתי רבות בגנות תרשימי עוגה. הסיבה לכך נעוצה בעובדה שקל מאוד לייצר זוועות עם תרשימים אלו. לא עזרה העובדה שבמשך המון זמן ברירת המחדל של תרשימי עוגה, בכל כלי ההדמיה העיקריים, ייצרה תרשימים מעוותים לגמרי. מצדדי החרם על תרשים עוגה מציעים את גרף העמודות כחלופה ראשונה. יחד עם זאת צריך לזכור שלא מתקנים עוול אחד בעוול אחר. לפעמים גרף עוגה דווקא מתאים יותר מגרף העמודות. בואו נראה דוגמה למקרה כזה.

הכנסת המתפקדת האחרונה שהייתה לנו במדינת ישראל הייתה הכנסת ה־20. בואו נראה איך התפלגו מושבי הכנסת בין המפלגות השונות.

זה הקוד ליצירת תרשים עוגה בסיסי בשפת פייתון עם שימוב בספריות סטנדרטיות

fig, ax = plt.subplots()
ax.pie(
    x=tbl_knesset20['מושבים'],
    labels=tbl_knesset20['מפלגה']
)

עד לא מזמן, התוצאה הייתה נראית ככה:

עזבו לרגע את העברית ההפוכה (נטפל בזה בהמשך), הגרף הזה פשוט נורא. הבעיה הגדולה ביותר שלו זה עווית המציאות. כאשר אנחנו מסתכלים על תרשים…

View original post 343 more words

Published
Categorized as blog

One idea per slide. It’s not that complicated

A lot of texts that talk about presentation design cite a very clear rule: each slide has to contain only one idea. Here’s a slide from a presentation deck that says just that.

And here’s the next slide in the same presentation

Can you count how many ideas there are on this slide? I see four of them.

Can we do better?

First of all, we need to remember that most of the time, the slides accompany the presenters and not replace them. This means that you don’t have to put everything you say as a slide. In our case, you can simply show the first slide and give more details orally. On the other hand, let’s face it, the presenters often use slides to remined themselves of what they want to say. 

So, if you need to expand your idea, split the sub-ideas into slides.

You can add some nice illustrations to connect the information and emotion. 

Making it more technical

“Yo!”, I can hear you saying, “Motivational slides are one thing, and technical presentation is a completely different thing! Also,” you continue, “We have things to do, we don’t have time searching the net for cute pics”. I hear you. So let me try improving a fairly technical slide, a slide that presents different types of machine learning.
Does slide like this look familiar to you?

First of all, the easiest solution is to split the ideas into individual slides.

It was simple, wasn’t it. The result is so much more digestible! Plus, the frequent changes of slides help your audience stay awake.

Here’s another, more graphical attempt

When I show the first slide in the deck above, I tell my audience that I am about to talk about different machine learning algorithms. Then, I switch to the next slide, talk about the first algorithm, then about the next one, and then mention the “others”. In this approach, each slide has only one idea. Notice also how the titles in these last slides are smaller than the contents. In these slides, they are used for navigation and are therefore less important.  In the last slide, I got a bit crazy and added so much information that everybody understands that this information isn’t meant to be read but rather serves as an illustration. This is a risky approach, I admit, but it’s worth testing.

To sum up

“One idea per slide” means one idea per slide. The simplest way to enforce this rule is to devote one slide per a sentence. Remember, adding slides is free, the audience attention is not.

5 Basics of Consulting Success: Part 1

Being a data science freelancer, and a long-time AnnMaria’s fan, I HAVE to repost here latest post on consulting success

Last week, I mentioned that successful consultants have five categories of skills; communication, testing, statistics, programming and generalist. COMMUNICATION Communication is the number one most important skill. All five are necessary to some extent, but a terrific communicator with mediocre statistical analysis skills will get more business than a stellar statistician that can’t communicate. Communication…

5 Basics of Consulting Success: Part 1 — AnnMaria’s Blog

Career advice. A clinical pharmacist, epidemiologist, and a Ph.D. student wants to become a data scientist.

Photo by Pixabay on Pexels.com

From time to time, I get emails from people who seek advice in their career paths. If I have time, I write them an extended reply and if they agree, I publish the questions and my replies here, in my blog. Here’s one such email exchange. All similar pieces of advice, as well as other rants about a career in data science, can be found here.

“Hi Boris 🙂
My name is XXXXX. I came across your blog while searching for people with a mix of pharmacy and data science skillsets. Your blog has been so informative to me so far but I was compelled to write to you to ask for your advice.
I am a clinical pharmacist by background but decided to leave the clinical pharmacy to pursue public health. Whilst doing my MPH, I fell in love with epidemiology and statistics and am now doing a Ph.D. in biostatistics. Your blog has made me feel very happy that I made this career move <…>  I feel better about my decision to leave the pharmacy and pursue a quant Ph.D. I have gone from pharmacy, to internships at <YYYY> as I wanted to pursue a career in <ZZZZZ> and now I am thinking of data science in the tech industry…my background is a bit confusing!”

In the past, I also felt that the pharmacy degree was confusing many potential employers, and since I wanted to leave the bio/pharma world and move to “pure data” positions, I omitted the B.Pharm title & studies from my CV. Ten years ago, the salaries in the bio sector, here in Israel, were much lower than the salaries in the “high tech” field. I think that today this situation is more or less normalized and that the people got used to the fact that a typical “data scientist” can have a very wide range of degrees.

“I was just wondering if I could get your opinion on the three questions I have. 
1. I work part-time as a clinical pharmacist to not forget my clinical skills. What do you think about the future of the pharmacy career overall?”

My last shift as a pharmacist

This is a huge question and I don’t have answers to it. Moreover, the answer depends heavily on legal regulations in the given country. I say that if you enjoy treating people, and can afford this time, why not? I, personally, was a very lousy pharmacist 🙂 so I was very happy to leave the pharmacy.

“I am wondering if I should keep up my pharmacist title or pursue data science full-time.”

Again, it depends. For many years, I didn’t have my pharmacy title in my CV because it felt unrelated to what I was doing. It was also a nice icebreaker to tell people with whom I worked “by the way, I’m a pharmacist” and it was fun to see their reactions. If I were you, I would ask two-three HR people or people who recruit employees what they think. Different countries may behave differently. 

“2. At what point can someone call themselves a data scientist?”

In my opinion, as long as you are comfortable enough to call yourself a data scientist, you are good to go. Note that unlike many people who got their data science “title” after taking some online courses, you already have a very strong theoretical base. Not only are your Master’s and the future Ph.D. degree relevant to data science, but they also give you strong and unique advantages. 

“I am looking at DS jobs at large tech companies. I am not sure how qualified and experienced I have to be for these jobs. I code in R using regression, clustering and time series methods and I am quite fluent in this language. I have just started to learn ML algorithms. I have a basic foundation in Python and SQL. I use Tableau for visualization and love communicating my research at any opportunity I get. I was wondering…how good do I have to be able to apply to DS jobs? What are the methods that data scientists use mostly? Would I be able to learn on the job?”

It sounds like a good combination of techniques. I am not recruiting but if I would, I would definitely like this list of skills. Personally, I don’t like R too much and prefer Python. But once you program one language, moving to another one is a doable task. As to what methods do data scientists use mostly, this hugely depends on your job. Most of my time, I clean data and write wrapper functions around known algorithms. The task that I have been facing during my professional life required regression, classification, network analysis. I never did real deep learning stuff, but I know people who only do deep learning for image and sound analysis. Also, in many cases, the data science part takes only 10% of your time because the “customer” doesn’t care about an algorithm, they want a solution. See this post for a nice example.

“3. If you had the opportunity to start your career again, say you were in your early twenties, what would you study and why? What advice would you have for your younger self? I would be so keen to hear what you think.”

It’s a philosophical task which I never like doing. What is done is done. The fact that I am a pretty successful data scientist may mean that I took the right decisions or that I was super lucky. 

Not a wasted time

Photo by Pixabay on Pexels.com

Being a freelancer data scientist, I get to talk to people about proposals that don’t materialize into projects. These conversations take time, but strangely enough, I enjoy them very much, I also find these conversations educating. How else could I have learned about a business model X, or what really happens behind the scenes of company Y?

Which coffee is this?

Photo by samer daboul on Pexels.com

Gilad Almosnino is an internationalization expert. I’m reading his post “Eight emojis that will create a more inclusive experience for Middle Eastern markets,” in which he mentions “Turkish or Arabic Coffee,” which reminded me of my last visit to Athens. When, in one restaurant, I asked for a Turkish coffee, the waiter looked at me harshly and said: “It’s not Turkish coffee; it’s Greek coffee!”

Turkish, Arabic, or Greek

Further Research is Needed

Do you believe in telepathy? Yesterday, I submitted final proofs of a paper in which I actively participated. During the proofreading, I noticed that our abstract ends with “further research is needed” and scratched my head. I submitted the proofs and then then, I saw this pearl in my blog feed

Further Research is Needed — xkcd.com

Book review: Great mental models by Shane Parrish

TL;DR shallow and disappointing

The Great Mental Models by Shane Parrish was highly praised by Automattic’s CEO Matt Mullenweg. Since I appreciate Matt’s opinion a lot, I decided to buy the book. I read it and was disappointed.

Image result for great mental models

This book is very ambitious but yet shallow and non-engaging. If you consider reading a book on mental models, then chances are you already know some of them. I expected the book to shed light on aspects I didn’t know or didn’t think of. Nothing like that happened. I didn’t learn new facts, neither was I impressed by a new way of thinking. I also think that this book won’t do the job with teenagers who still don’t have the arsenal of mental models, for them this book is full of unclear shortcuts.

The book is based on the materials of a highly praised blog fs.blog and is a good example how some stuff can work well as a blog post but feel bad as a book.

The bottom line: 2/5 Skip it.

TicToc — a flexible and straightforward stopwatch library for Python.

Photo by Brett Sayles on Pexels.com

Many years ago, I needed a way to measure execution times. I didn’t like the existing solutions so I wrote my own class. As time passed by, I added small changes and improvements, and recently, I decided to publish the code on GitHub, first as a gist, and now as a full-featured Github repository, and a pip package.

TicToc – a simple way to measure execution time

TicToc provides a simple mechanism to measure the wall time (a stopwatch) with reasonable accuracy.

Crete an object. Run tic() to start the timer, toc() to stop it. Repeated tic-toc’s will accumulate time. The tic-toc pair is useful in interactive environments such as the shell or a notebook. Whenever toc is called, a useful message is automatically printed to stdout. For non-interactive purposes, use start and stop, as they are less verbose.

Following is an example of how to use TicToc:

Usage examples

def leibniz_pi(n):
    ret = 0
    for i in range(n * 1000000):
        ret += ((4.0 * (-1) ** i) / (2 * i + 1))
    return ret

tt_overall = TicToc('overall')  # started  by default
tt_cumulative = TicToc('cumulative', start=False)
for iteration in range(1, 4):
    tt_cumulative.start()
    tt_current = TicToc('current')
    pi = leibniz_pi(iteration)
    tt_current.stop()
    tt_cumulative.stop()
    time.sleep(0.01)  # this inteval will not be accounted for by `tt_cumulative`
    print(
        f'Iteration {iteration}: pi={pi:.9}. '
        f'The computation took {tt_current.running_time():.2f} seconds. '
        f'Running time is {tt_overall.running_time():.2} seconds'
    )
tt_overall.stop()
print(tt_overall)
print(tt_cumulative)

TicToc objects are created in a “running” state, i.e you don’t have to start them using tic. To change this default behaviour, use

tt = TicToc(start=False)
# do some stuff
# when ready
tt.tic()

Installation

Install the package using pip

pip install tictoc-borisgorelik

Dispute for the sake of Heaven, or why it’s OK to have a loud argument with your co-worker

Any dispute that is for the sake of Heaven is destined to endure; one that is not for the sake of Heaven is not destined to endure
Chapters of the Fathers 5:27

One day, I had an intense argument with a colleague at my previous place of work, Automattic. Since most of the communication in Automattic happens in internal blogs that are visible to the entire company, this was a public dispute. In a matter of a couple of hours, some people contacted me privately on Slack. They told me that the message exchange sounded aggressive, both from my side and from the side of my counterpart. I didn’t feel that way. In this post, I want to explain why it is OK to have a loud argument with your co-workers.

How it all began?

I’m a data scientist and algorithm developer. I like doing data science and developing algorithms. Sometimes, to be better at my job, I need to show my work to my colleagues. In a “regular” company, I would ask my colleagues to step into my office and play with my models. Automattic isn’t a “regular” company. At Automattic, people from more than sixty countries from in every possible time zone. So, I wanted to start a server that will be visible by everyone in the company (and only by them), that will have access to the relevant data, and that will be able to run any software I install on it.

Two bees fighting

X is a system administrator. He likes administrating the systems that serve more than 2000,000,000 unique visitors in the US alone. To be good at his job, X needs to make sure no bad things happen to the systems. That’s why when X saw my request for the new setup (made on a company-visible blog page), his response was, more or less, “Please tell me why do you think you need this, and why can’t you manage with what you already have.”

Frankly, I was furious. Usually, they tell you to count to ten before answering to someone who made you angry. Instead, I went to my mother-in-law’s birthday party, and then I wrote an answer (again, in a company-visible blog). The answer was, more or less, “because I know what I’m doing.” For which, X replied, more or less, “I know what I do too.”

How it got resolved?

At this point, I started realizing that X is not expected to jeopardize his professional reputation for the sake of my professional aspirations. It was true that I wanted to test a new algorithm that will bring a lot of value to the company for which I work. It is also true that X doesn’t resent to comply with every developers’ request out of caprice. His job is to keep the entire system working. Coincidentally, X contacted me over Slack, so I took the opportunity to apologize for something that sounded as aggression from my side. I was pleased to hear that X didn’t notice any hostility, so we were good.

What eventually happened and was the dispute avoidable?

I don’t know whether it was possible to achieve the same or a better result without the loud argument. I admit: I was angry when I wrote some of the things that I wrote. However, I wasn’t mad at X as a person. I was angry because I thought I knew what was best for the company, and someone interfered with my plans.

I assume that X was angry when he wrote some of the things he wrote. I also believe that he wasn’t angry at me as a person but because he knew what was best for the company, and someone tried to interfere with his plans.

I’m sure though that it was this argument that enabled us to define the main “pain” points for both sides of the dispute. As long as the dispute was about ideas, not personas, and as long as the dispute’s goal was for the sake of the common good, it was worth it. To my current and future colleagues: if you hear me arguing loudly, please know that this is a “dispute that is for the sake of Heaven [that] is destined to endure.”


Featured image: Source: http://mimiandeunice.com/; Bees image: Photo by Flickr user silangel, modified. Under the CC-BY-NC license.

The difference between python decorators and inheritance that cost me three hours of hair-pulling

Photo by Genaro Servu00edn on Pexels.com

I don’t have much hair on my head, but recently, I encountered a funny peculiarity in Python due to which I have been pulling my hair for a couple of hours. In retrospect, this feature makes a lot of sense. In retrospect.

First, let’s start with the mental model that I had in my head: inheritance.

Let’s say you have a base class that defines a function `f`

Now, you inherit from that class and rewrite f

What happens? The fact that you defined f in ClassB means that, to a rough approximation, the old definition of f from ClassA does not exist in all the ClassB objects.

Now, let’s go to decorators.

@dataclass_json
@dataclass
class Message2:
    message: str
    weight: int
    def to_dict(self, encode_json=False):
        print('Custom to_dict')
        ret = {'MESSAGE': self.message, 'WEIGHT': self.weight}
        return ret
m2 = Message2('m2', 2)

What happened here? I used a decorator `dataclass_json` that, among other things, provides a `to_dict` function to Python’s data classes. I created a class `Message2`, but I needed s custom `to_dict` definition. So, naturally, I defined a new version of `to_dict` only to discover several hours later that the new `to_dict` doesn’t exist.

Do you get the point already? In inheritence, the custom implementations are added ON TOP of the base class. However, when you apply a decorator to a class, your class’s custom code is BELOW the one provided by the decorator. Therefore, you don’t override the decorating code but rather “underride” it (i.e., give it something it can replace).

As I said, it makes perfect sense, but still, I missed it. I don’t know whether I would have managed to find the solution without Stackoverflow.

Does Zipf’s Law Apply to Alzheimer’s Patients?

Today, I read a post about Ziph’s law and Alzheimer’s disease. I liked the post very much and decided to press the “like” button only to discover that I already “liked” this post more than two years ago.

Indeed this is an interesting post.

Akshay Budhkar

Introduction

I was fascinated by Zipf’s Law when I came across it on a VSauce video. It is an empirical law that states that the frequency of occurrence of a word in a large text corpus is inversely proportional to its rank in its frequency table. The frequency distribution will resemble a Pareto distribution so that the 2nd word will occur 1/2 times the first, the 3rd word 1/3 times and the nth word 1/n times. The law applies to all languages, even the ones which we do not understand yet. Curious, I decided to test it out on a text corpus of Alzheimer’s patients describing a picture.

Alzheimer’s Disease (AD) is a neurodegenerative condition that usually occurs in older people over 65 years old and worsens over time.

AD kills more people than breast and prostate cancer combined.

There is no cure for AD yet, and it…

View original post 902 more words

Published
Categorized as blog

The first things a statistical consultant needs to know — AnnMaria’s Blog

You know that I’m a data science consultant now, don’t you? You know that AnnMaria De Mars, Ph.D. (the statistician, game developer, the world Judo champion) is one of my favorite bloggers, and her blog is the second blog I started to follow don’t you? 

A couple of months ago, AnnMaria wrote an extensive post about 30 things she learned in 30 years as a statistical consultant. One week ago, she wrote another great piece of advice.

I’ll be speaking about being a statistical consultant at SAS Global Forum in D.C. in March/ April. While I will be talking a little bit about factor analysis, repeated measures ANOVA and logistic regression, that is the end of my talk. The first things a statistical consultant should know don’t have much to do with…

The first things a statistical consultant needs to know — AnnMaria’s Blog

Book review. Replay by Ken Grimwood

TL;DR: excellent fiction reading, makes you think about your life choices. 5/5

book cover of "Replay" by Ken Grimwood

“Replay” by Ken Grimwood is the first fiction book that I read in ages. The book is about a forty-three years old man with a failing family and a boring career. The man suddenly dies and re-appears in his own eighteen-years old body. He then lives his life again, using the knowledge of his future self. Then he dies again, and again, and again.
I liked the concept (reminded me of the Groundhog Day movie). The book managed to “suck me in,” and I finished it in two days. It also made me think hard about my life choices. I think that my decision to quit and become a freelancer was partially affected by this book.

What did I not like? Some parts of the book are somewhat pornographic. It doesn’t bother me per se, but I think the plot would stay as good as it is without those parts. Also, I find it a little bit sad that every reincarnation in “Replay” starts with making easy money. Not that I don’t like money; it just makes me sad.

Photo of my kindle with text from "Replay" by Ken Grimwood

Bottom line: Read! 5/5

(Read in Nov 2019)

ASCII histograms are quick, easy to use and to implement

Screen Shot 2018-02-25 at 21.25.32From time to time, we need to look at the distribution of a group of values. Histograms are, I think, the most popular way to visualize distributions. “Back in the old days,” when we did most of our work in the console, and when creating a plot from Python required too many boilerplate code lines, I found a neat function that produced histograms using ASCII characters.

Surely, today, when most of us work in a notebook environment, ASCII histograms aren’t as useful as they used to be. However, they are still helpful. One scenario in which ASCII diagrams are useful is when you write a log file for an iterative process. A quick glimpse at the log file will let you know when the distribution of some scoring function reached convergence.

That is why I keep my version of asciihist updated since 2005. You may find it on Github here.

The tombs of the righteous

Some people, in face of important changes visit tombs of the righteous for a blessing. I went to see WEIZAC — Israel’s first computer (and one of the first ones in the world) that was built in 1955.

Me in front of the memory unit of WEIZAC
Published
Categorized as blog Tagged

How I got a dream job in a distributed company and why I am leaving it

One night, in January 2014, I came back home from work after spending two hours commuting in each direction. I was frustrated and started Googling for “work from home” companies. After a couple of minutes, I arrived at https://automattic.com/work-with-us/. Surprisingly to me, I couldn’t find any job postings for data scientists, and a quick LinkedIn search revealed no data scientists at Automattic. So I decided to write a somewhat arrogant letter titled “Why you should call me?”. After reading the draft, I decided that it was too arrogant and kept it in my Drafts folder so that I can sleep over it. A couple of days later, I decided to delete that mail. HOWEVER, entirely unintentionally, I hit the send button. That’s how I became the first data scientist hired by Automattic (Carly Staumbach, the data scientist and the musician, was already Automattician, but she arrived there by an acquisition).

Screenshot of my email
The email is pretty long.
I even forgot to remove a link that I planned to read BEFORE sending that email.

The past five and a half years have been the best five and a half years in my professional life. I met a TON of fascinating people from different cultural and professional backgrounds. I re-discovered blogging. My idea of what a workplace is has changed tremendously and for good.

What happened?

Until now, every time I left a workplace, I did that for external reasons. I simply had to. I left either due to company’s poor financial situation, due to long commute time, or both. Now, it’s the first time I am leaving a place of work entirely for internal reasons: despite, and maybe a little bit because, the fact that everything was so good. (Of course, there are some problems and disruptions, but nothing is ideal, right?)

What happened? In June, I left for a sabbatical. The sabbatical was so good that I already started making plans for another one. However, I also started thinking about my professional growth, the opportunities I have, and the opportunities I previously missed. I realized that right now, I am in the ideal position to exit the comfort zone and to take calculated professional risks. That’s how, after about four sleepless weeks, I decided to quit my dream job and to start a freelance career.

On January 22, I will become an Automattic alumnus.

BTW, Automattic is constantly looking for new people. Visit their careers page and see whether there is something for you. And if not, find the chutzpah and write them anyhow.

A group photo of about 600 people -- Automattic 2018 grand meetup
2018 Grand Meetup.
A group photo of about 800 people. 2019 Automattic Grand Meetup
2019 Grand Meetup. I have no idea where I am at this picture

Software commodities are eating interesting data science work — Yanir Seroussi

If you read my shortish post about staying employable as a data scientist, you might like a longer post by a colleague, Yanir Seroussi. In his post, Yanir lists four possible paths for a data scientist: (1) become an engineer; (2) reinvent the wheel; (3) search for niches; and (4) expand the cutting edge.

To this list, I would also add two other options.

(5) Manage. Managing is not developing, it’s a different profession. However, some developers and data scientists that I know choose this path. I am not a manager myself, so I hope I don’t insult the managers who read these lines, but I think that it is much easier for a good manager to stay good, than for a good developer or data scientist.

(6) Teach. I teach as a part-time job. One reason for teaching is that I sometimes enjoy it. Another reason is that I feel that at some point, I might not be good enough to stay on the cutting edge but still sharp enough to teach the new generations the basics.

Anyhow, read Yanir’s post linked below.

The passage of time makes wizards of us all. Today, any dullard can make bells ring across the ocean by tapping out phone numbers, cause inanimate toys to march by barking an order, or activate remote devices by touching a wireless screen. Thomas Edison couldn’t have managed any of this at his peak—and shortly before […]

Software commodities are eating interesting data science work — Yanir Seroussi

Career advice. A research pharmacist wants to become a data scientist.

Recently, I received an email from a pharmacist who considers becoming a data scientist. Since this is not a first (or last) similar email that I receive, I think others will find this message exchange interesting.

Here’s the original email with minor edits, followed by my response.

The question

Hi Boris, 


My name is XXXXX, and I came across your information and your advice on data science as I was researching career opportunities.

I currently work at a hospital as a research pharmacist, mainly involved in managing drugs for clinical trials.
Initially, I wanted to become a clinical pharmacist and pursued 1-year post-graduate residency training. However, it was not something I could envision myself enjoying for the rest of my career.

I then turned towards obtaining a Ph.D. in translational research, bridging the benchwork research to the bedside so that I could be at the forefront of clinical trial development and benefit patients from the rigorous stages of pre-clinical research outcomes. I much appreciate learning all the meticulous work dedicated before the development of Phase I clinical trials. However, Ph.D. in pharmaceutical sciences was overkill for what I wanted to achieve in my career (in my opinion), and I ended up completing with master’s in pharmaceutical sciences.

Since I wanted to be involved in both research and pharmacy areas in my career, I ended up where I am now, a research pharmacist.

My main job description is not any different from typical hospital pharmacists. I do have a chance of handling investigational medications, learning about new medications and clinical protocols, overseeing side effects that may be a crucial barrier in marketing the trial medications, and sometimes participating in development of drug preparation and handling for investigator-initiated trials. This does keep my job interesting and brings variety in what I do. However, I do still feel I am merely following the guidelines to prepare medications and not critically thinking to make interventions or manipulate data to see the outcomes. At this point, I am preparing to find career opportunities in the pharmaceutical industry where I will be more actively involved in clinical trial development, exchanging information about targeting the diseases and analyzing data. I believe gaining knowledge and experiences in critical characteristics for the data science field would broaden my career opportunities and interest. Still, unfortunately, I only have pharmacy background and have little to no experience in computer science, bioinformatics, or machine learning.

The answer

First of all, thank you for asking me. I’m genuinely flattered. I assume that you found me through my blog posts, and if not, I suggest that you read at least the following posts

All my thoughts on the career path of a data scientist appear in this page https://gorelik.net/category/career-advice/

Now, specifically to your questions.

My path towards data science was through gradual evolution. Every new phase in my career used my previous experience and knowledge. From B.Sc studies in pharmacy to doctorate studies in computational drug design, from computational drug design to biomathematical modeling, from that to bioinformatics, and from that to cybersecurity. Of course, my path is not unique. I know at least three people who followed a similar career from pharmacy to data science. Maybe other people made different choices and are even more successful than I am. My first advice to everyone who wants to transition into data science is not to (see the first link in the list above). I was lucky to enter the field before it was a field, but today, we live in the age of specialization. Today we have data analysts, data engineers, machine learning engineers, NLP scientists, image processing specialists, etc. If computational modeling is something that a person likes and sees themselves doing for living, I suggest pursuing a related advanced degree with a project that involves massive modeling efforts. Examples of such degrees for a pharmacist are computational chemistry, pharmacoepidemiology, pharmacovigilance, bioinformatics. This way, one can utilize the knowledge that they already have to expand the expertise, build a reputation, and gain new knowledge. If staying in academia is not an option, consider taking a relevant real-life project. For example, if you work in a hospital, you could try identifying patterns in antibiotics usage, a correlation between demographics and hospital re-admission, … you get the idea.

Whatever you do, you will not be able to work as a data scientist if you can’t write computer programs. Modifying tutorial scripts is not enough; knowing how to feed data into models is not enough.

Also, my most significant knowledge gap is in maths. If you do go back to academia, I strongly suggest taking advantage of the opportunity and taking several math classes: at least calculus and linear algebra and, of course, statistics. 

Do you have a question for me?

If you have questions, feel free writing them here, in the comments section or writing to boris@gorelik.net

New year, new notebook

On November 7, 2016, I started an experiment in personal productivity. I decided to use a notebook for thirty days to manage all of my tasks. The thirty days ended more than three years ago, and I still use notebooks to manage myself. Today, I started the thirteenth notebook.

Read about my time management system here.

Don’t we all like a good contradiction?

I am a huge fan of Gerd Gigerenzer who preaches numeracy and uncertainty education. One of Prof. Gigerenzer’s pivotal theses is “Fast and Frugal Heuristics” which is also popularized in his book “Gut Feelings” (listen to this podcast if you don’t want to read the book). I like this approach.

Today, I listened to the latest episode of the Brainfluence podcast that hosted the psychologist Dr. Gleb Tsipursky who wrote an extensive book called “Never Trust your Gut” with a seemingly contradicting thesis. I added this book to my TOREAD list.

Staying employable and relevant as a data scientist

One common wisdom is that creative jobs are immune to becoming irrelevant. This is what Brian Solis, the author of “Lifescale” says on this matter

On the positive side, historically, with every technological advancement, new jobs are created. Incredible opportunity opens up for individuals to learn new skills and create in new ways. It is your mindset, the new in-demand skills you learn, and your creativity that will assure you a bright future in the age of automation. This is not just my opinion. A thoughtful article in Harvard Business Review by Joseph Pistrui was titled, “The Future of Human Work Is Imagination, Creativity, and Strategy.” He cites research by McKinsey […]. In their research, they discovered that the more technical the work, the more replaceable it is by technology. However, work that requires imagination, creative thinking, analysis, and strategic thinking is not only more difficult to automate; it is those capabilities that are needed to guide and govern the machines.

Many people think that data science falls into the category of “creative thinking and analysis”. However, as time passes by this becomes less true. Here’s why.

As time passes by, tools become stronger, smarter, and faster. This means that a problem that could have been solved using cutting edge algorithms running by cutting edge scientists on cutting edge computers, will be solvable using a commodity product. “All you have to do” is to apply domain knowledge, select a “good enough” tool, get the results and act upon them. You’ll notice that I included two phases in quotation marks. First, “all you have to do”. I know that it’s not that simple as “just add water” but it gets simpler.

“Good enough” is also a tricky part. Selecting the right algorithm for a problem has dramatic effect on tough cases but is less important with easy ones. Think of a sorting algorithm. I remember my algorithm class professor used to talk how important it was to select the right sorting algorithm to the right problem. That was almost twenty years ago. Today, I simply write list.sort() and I’m done. Maybe, one day I will have to sort billions of data points in less than a second on a tiny CPU without RAM, which will force me into developing a specialized solution. But in 99.999% of cases, list.sort() is enough.

Back to data science. I think that in the near future, we will see more and more analogs of list.sort(). What does that mean to us, data scientists? I am not sure. What I’m sure is that in order to stay relevant we have to learn and evolve.

Featured image by Héctor López on Unsplash

Is security through obscurity back?

HBR published an opinion post by Andrew Burt, called “The AI Transparency Paradox.” This post talks about the problems that were created by tools that open up the “black box” of a machine learning model.

“Black box” refers to the situation where one can’t explain why a machine learning model predicted whatever it predicted. Predictability is not only important when one wants to improve the model or to pinpoint mistakes, but it is also an essential feature in many fields. For example, when I was developing a cancer detection model, every physician requested to know why we thought a particular patient had cancer. That is why I’m happy, so many people develop tools that allow peeking into the black box.

I was very surprised to read the “transparency paradox” post. Not because I couldn’t imagine that people will use the insights to hack the models. I was surprised because the post reads like a case for security through obscurity — an ancient practice that was mostly eradicated from the mainstream. 

Yes, ML transparency opens opportunities for hacking and abuse. However, this is EXACTLY the reason why such openness is needed. Hacking attempts will not disappear with transparency removal; they will be harder to defend. 

I will speak at the NDR conference in Bucharest

NDR is a family of machine learning conferences in Romania. Last year, I attended the Iași edition of that conference, gave a data visualization talk, and enjoyed every moment. All the lectures (including mine, obviously) were interesting and relevant. That is why, when Vlad Iliescu, one of the NDR organizers, asked me whether I wanted to talk in Bucharest at NDR 2020, I didn’t think twice. 

Since the organizers didn’t publish the talk topics yet, I will not ruin the surprise for you, but I promise to be interesting and relevant. I definitely think that NDR is worth the trip to Bucharest to many data practitioners, even the ones who don’t live in Romania. Visit the conference site to register.

Book review. A Short History of Nearly Everything by Bill Bryson

TL;DR: a nice popular science book that covers many aspects of the modern science

A Short History of Nearly Everything by Bill Bryson is a popular science book. I didn’t learn anything fundamental out of this book, but it was worth reading. I was particularly impressed by the intrigues, lies, and manipulations behind so many scientific discoveries and discoverers. 

The main “selling point” of this book is that it answers the question, “how do the scientists know what they know”? How, for example, do we know the age of Earth or the skin color of the dinosaurs? The author indeed provides some insight. However, because the book tries to talk about “nearly everything,” the answer isn’t focused enough. Simon Singh’s book “Big Bang” concentrates on the cosmology and provides a better insight into the question of “how do we know what we know.” 

Interesting takeaways and highlights

  • Of the problem that our Universe is unlikely to be created by chance: “Although the creation of Universe is very unlikely, nobody knows about failed attempts.”
  • The Universe is unlimited but finite (think of a circle)
  • Developments in chemistry were the driving force of the industrial revolution. Nevertheless, chemistry wasn’t recognized as a scientific field in its own for several decades

The bottom line: Read if you have time 3.5/5. 

Cow shit, virtual patient, big data, and the future of the human species

Yesterday, a new episode was published in the Popcorn podcast, where the host, Lior Frenkel, interviewed me. Everyone who knows me knows how much I love talking about myself and what I do. I definitely used this opportunity to talk about the world of data. Some people who listened to this episode told me that they enjoyed it a lot. If you know Hebrew, I recommend that you listen to this episode

Data visualization as an engineering task – a methodological approach towards creating effective data visualization

In June 2019, I attended the NDR AI conference in Iași, Romania where I also gave a talk. Recently, the organizers uploaded the video recording to YouTube.

That was a very interesting conference, tight with interesting talks.

Next year, I plan to attend the Bucharest edition of NDR, where I will also give a talk with the working title “The biggest missed opportunity in data visualization”

A tangible productivity tool (and a book review)

One month ago, I stumbled upon a book called “Personal Kanban: Mapping Work | Navigating Life” by Jim Benson (all the book links use my affiliate code). Never before, I saw a more significant discrepancy between the value that the book gave me and its actual content. 

Even before finishing the first chapter of this book, I realized that I wanted to incorporate “personal kanban” into my productivity system. The problem was that the entire book could be summarized by a blog post or by a Youtube video (such as this one). The rest of the book contains endless repetitions and praises. I recommend not reading this book, even though it strongly affected the way I work

So, what is Personal Kanban anyhow? Kanban is a productivity approach that puts all the tasks in front of a person on a whiteboard. Usually, Kanban boards are physical boards with post-it notes, but software Kanban boards are also widely known (Trello is one of them). Following are the claims that Jim Benson makes in his book that resonated with me

  • Many productivity approaches view personal and professional life separately. The reality is that these two aspects of our lives are not separate at all. Therefore, a productivity method needs to combine them.
  • Having all the critical tasks in front of your eyes helps to get the global picture. It also helps to group the tasks according to their contexts. 
  • The act of moving notes from one place to another gives valuable tangible feedback. This feedback has many psychological benefits.
  • One should limit the number of work-in-progress tasks.
  • There are three different types of “productivity.” You are Productive when you work hard. You are Efficient when your work is actually getting done. Finally, you are Effective when you do the right job at the right time, and can repeat this process if needed. 

I’m a long user of a productivity method that I adopted from Mark Forster. You may read about my process here. Having read Personal Kanban, I decided to combine it with my approach. According to the plan, I have more significant tasks on my Kanban board, which I use to make daily, weekly, and long-term plans. For the day-to-day (and hour-to-hour) taks, I still use my notebooks. 

Initially, I used my whiteboard for this purpose, but something wasn’t right about it.

Having my Kanban on my home office whiteboard had two significant drawbacks. First, the whiteboard isn’t with me all the time. And what is the point of putting your tasks on board if you can’t see it? Secondly, listing everything on a whiteboard has some privacy issues. After some thoughts, I decided to migrate the Kanban to my notebook.

In this notebook, I have two spreads. The first spread is used for the backlog, and “this week” taks. The second spread has the “today,” “doing,” “wait,” and “done” columns. The fact that the notebook is smaller than the whiteboard turned out to be a useful feature. This physical limitation limits the number of tasks I put on my “today” and “doing” lists. 

I organize the tasks at the beginning of my working day. The rest of the system remains unchanged. After more than a month, I’m happy with this new tangible productivity method.

Data science tools with a graphical user interface

A Quora user asked about data science tools with a graphical user interface. Here’s my answer. I should mention though that I don’t usually use GUI for data science. Not that I think GUIs are bad, I simply couldn’t find a tool that works well for me.

Of the many tools that exist, I like the most Orange (https://orange.biolab.si/). Orange allows the user creating data pipelines for exploration, visualization, and production but also allows editing the “raw” python code. The combination of these features makes is a powerful and flexible tool.

The major drawback of Orange (in my opinion) is that is uses its own data format and its own set of models that are not 100% compatible with the Numpy/Pandas/Sklearn ecosystem.

I have made a modest contribution to Orange by adding a six-lines function that computes Matthews correlation coefficient.

Other tools are KNIME and Weka (none of them is natively Python).

There is also RapidMinder but I never used it.

Working in a distributed company. Communication styles

I work at Automattic, one of the largest distributed companies in the world. Working in a distributed company means that everybody in this company works remotely. There are currently about one thousand people working in this company from about seventy countries. As you might expect, the international nature of the company poses a communication challenge. Recently, I had a fun experience that demonstrates how different people are.

Remote work means that we use text as our primary communication tool. Moreover, since the company spans over all the time zones in the world, we mostly use asynchronous communication, which takes the form of posts in internal blogs. A couple of weeks ago, I completed a lengthy analysis and summarized it in a post that was meant to be read by the majority of the company. Being a responsible professional, I asked several people to review the draft of my report.

To my embarrassment, I discovered that I made a typo in the report title, and not just a typo: I misspelled the company name :-(. A couple of minutes after asking for a review, two of my coworkers pinged me on Slack and told me about the typo. One message was, “There is a typo in the title.” Short, simple, and concise.

The second message was much longer.

Do you want to guess what the difference between the two coworkers is?
.
.
.
.
.
Here’s the answer
.
.
.
.
The author of the first (short) message grew up and lives in Germany. The author of the second message is American. Germany, United States, and Israel (where I am from) have very different cultural codes. Being an Israeli, I tend to communicate in a more direct and less “sweetened” way. For me, the American communication style sounds a little bit “artificial,” even though I don’t doubt the sincerity of this particular American coworker. I think that the opposite situation is even more problematic. It happened several times: I made a remark that, in my opinion, was neutral and well-intended, and later I heard comments about how I sounded too aggressive. Interestingly, all the commenters were Americans.

To sum up. People from different cultural backgrounds have different communication styles. In theory, we all know that these differences exist. In practice, we usually are unaware of them.

Featured photo by Stock Photography on Unsplash

Sometimes, you don’t really need a legend

This is another “because you can” rant, where I claim that the fact that you can do something doesn’t mean that you necessarily need to.

This time, I will claim that sometimes, you don’t really need a legend in your graph. Let’s take a look at an example. We will plot the GDP per capita for three countries: Israel, France, and Italy. Plotting three lines isn’t a tricky task. Here’s how we do this in Python

plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.legend()

The last line in the code above does a small magic and adds a nice legend

This image has an empty alt attribute; its file name is image.png

In Excel, we don’t even need to do anything, the legend is added for us automatically.

This image has an empty alt attribute; its file name is image-1.png

So, what is the problem?

What happens when a person wants to know which line represents which country? That person needs to compare the line color to the colors in the legend. Since our working memory has a limited capacity, we do one of the following. We either jump from the graph to the legends dozens of times, or we try to find a heuristic (a shortcut). Human brains don’t like working hard and always search for shortcuts (I recommend reading Daniel Kahneman’s “Think Fast and Slow” to learn more about how our brain works).

What would be the shortcut here? Well, note how the line for Israel lies mostly below the line for Italy which lies mostly below the line for France. The lines in the legend also lie one below the other. However, the line order in these two pieces of information isn’t conserved. This results in a cognitive mess; the viewer needs to work hard to decipher the graph and misses the point that you want to convey.

And if we have more lines in the graph, the situation is even worse.

This image has an empty alt attribute; its file name is image-2.png

Can we improve the graph?

Yes we can. The simplest way to improve the graph is to keep the right order. In Python, we do that by reordering the plotting commands.

plt.plot(gdp.Year, gdp.Australia, '-', label='Australia')
plt.plot(gdp.Year, gdp.Belgium, '-', label='Belgium')
plt.plot(gdp.Year, gdp.France, '-', label='France')
plt.plot(gdp.Year, gdp.Italy, '-', label='Italy')
plt.plot(gdp.Year, gdp.Israel, '-', label='Israel')
plt.legend()
This image has an empty alt attribute; its file name is image-3.png

We still have to work hard but at least we can trust our brain’s shortcut.

If we have more time

If we have some more time, we may get rid of the (classical) legend altogether.

countries = [c for c in gdp.columns if c != 'Year']
fig, ax = plt.subplots()
for i, c in enumerate(countries):
    ax.plot(gdp.Year, gdp[c], '-', color=f'C{i}')
    x = gdp.Year.max()
    y = gdp[c].iloc[-1]
    ax.text(x, y, c, color=f'C{i}', va='center')
seaborn.despine(ax=ax)

(if you don’t understand the Python in this code, I feel your pain but I won’t explain it here)

This image has an empty alt attribute; its file name is image-4.png

Isn’t it better? Now, the viewer doesn’t need to zap from the lines to the legend; we show them all the information at the same place. And since we already invested three minutes in making the graph prettier, why not add one more minute and make it even more awesome.

This image has an empty alt attribute; its file name is image-5.png

This graph is much easier to digest, compared to the first one and it also provides more useful information.

.

This image has an empty alt attribute; its file name is image-6.png

I agree that this is a mess. The life is tough. But if you have time, you can fix this mess too. I don’t, so I won’t bother, but Randy Olson had time. Look what he did in a similar situation.

percent-bachelors-degrees-women-usa

I also recommend reading my older post where I compared graph legends to muttonchops.

In conclusion

Sometimes, no legend is better than legend.

This post, in Hebrew: [link]

What do we see when we look at slices of a pie chart?

What do we see when we look at slices of a pie chart? Angles? Areas? Arc length? The answer to this question isn’t clear and thus “experts” recommend avoiding pie charts at all.

Robert Kosara is a Senior Research Scientist at Tableau Software (you should follow his blog https://eagereyes.org), who is very active in studying pie charts. In 2016, Robert Kosara and his collaborators published a series of studies about pie charts. There is a nice post called “An Illustrated Tour of the Pie Chart Study Results” that summarizes these studies. 

Last week, Robert published another paper with a pretty confident title (“Evidence for Area as the Primary Visual Cue in Pie Charts”) and a very inconclusive conclusion

While this study suggests that the charts are read by area, itis not conclusive. In particular, the possibility of pie chart usersre-projecting the chart to read them cannot be ruled out. Furtherexperiments are therefore needed to zero in on the exact mechanismby which this common chart type is read.

Kosara. “Evidence for Area as the Primary Visual Cue in Pie Charts.” OSF, 17 Oct. 2019. Web.

The previous Kosara’s studies had strong practical implications, the most important being that pie charts are not evil provided they are done correctly. However, I’m not sure what I can take from this one. As far as I understand the data, the answer to the questions in the beginning of this post are still unclear. Maybe, the “real answer” to these questions is “a combination of thereof”.

The problem with citation count as an impact metric

Inspired by A citation is not a citation is not a citation by Lior Patcher, this rant is about metrics.

Lior Patcher is a researcher in Caltech. As many other researchers in the academy, Dr. Patcher is measured by, among other things, publications and their impact as measured by citations. In his post, Lior Patcher criticised both the current impact metrics and also their effect on citation patterns in the academic community.

PROBLEM POINTED: citations don’t really measure “actual” citations. Most of the appeared citations are “hit and run citations” i.e: people mention other people’s research without taking anything from that research.

In fact this author has cited [a certain] work in exactly the same way in several other papers which appear to be copies of each other for a total of 7 citations all of which are placed in dubious “papers”. I suppose one may call this sort of thing hit and run citation.

via A citation is not a citation is not a citation — Bits of DNA

I think that the biggest problem with citation counts is that it costs nothing to cite a paper. When you add a research (or a post, for that matter) to your reference list, you know that most probably nobody will check whether actually read it, that nobody will check whether you got that publication correctly and that nobody will that the chances are super (SUUPER) low nobody will check whether you conclusions are right. All it takes is to click a button.

Book review. The War of Art by S. Pressfield

TL;DR: This is a long motivational book that is “too spiritual” for the cynic materialist that I am.

The War of Art by [Pressfield, Steven]

The War of Art is a strange book. I read it because “everybody” recommended it. This is what Derek Sivers’ book recommendation page says about this book

Have you experienced a vision of the person you might become, the work you could accomplish, the realized being you were meant to be? Are you a writer who doesn’t write, a painter who doesn’t paint, an entrepreneur who never starts a venture? Then you know what “Resistance” is.

As a known procrastinator, I was intrigued and started reading. In the beginning, the book was pretty promising. The first (and, I think, the biggest) part of the book is about “Resistance” — the force behind the procrastination. I immediately noticed that almost every sentence in this chapter could serve a motivational poster. For example

  • It’s not the writing part that’s hard. What’s hard is sitting down to write.
  • The danger is greatest when the finish line is in sight.
  • The most pernicious aspect of procrastination is that it can become a habit.
  • The more scared we are of a work or calling, the more sure we can be that we have to do it.

Individually, each sentence makes sense, but their concentration was a bit too much for me. The way Pressfield talks about Resistance resembles the way Jewish preachers talk about Yetzer Hara: it sits everywhere, waiting for you to fail. I’ tdon’t like this approach.

The next chapters were even harder for me to digest. Pressfield started talking about Muses, gods, prayers, and other “spiritual” stuff; I almost gave up. But I fought the Resistance and finished the book.

My main takeaways:

  • Resistance is real
  • It’s a problem
  • The more critical the task is, the stronger is the Resistance. OK, I kind of agree with this. Pressfield continues to something do not agree with: thus (according to the author), we can measure the importance of a task by the Resistance it creates.
  • Justifying not pursuing a task by commitments to the family, job, etc. is a form of Resistance.
  • The Pro does stuff.
  • The Artist is a Pro (see above) who does stuff even if nobody cares.

Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

On Sunday, I wrote about bootstrapping. On Monday, I wrote about visualization uncertainty. Let’s now talk about bootstrapping and uncertainty visualization.

Robert Grant is a data visualization expert who wrote a book about interactive data visualization (which I should read, BTW).

Robert runs an interesting blog from which I learned another approach to uncertainty visualization, bootstrapping.

Source: Robert Grant.

Read the entire post: Data visualization with statistical reasoning: seeing uncertainty with the bootstrap — Dataviz – Stats – Bayes

On MOOCs

When Massive Online Open Courses (a.k.a MOOCs) emerged some X years ago, I was ecstatic. I was sure that MOOCs were the Big Boom of higher education. Unfortunately, the MOOC impact turned out to be very modest. This modest impact, combined with the high production cost was one of the reasons I quit making my online course after producing two or three lectures. Nevertheless, I don’t think MOOCs are dead yet. Following are some links I recently read that provide interesting insights to MOOC production and consumption.

  • A systematic study of academic engagement in MOOCs that is scheduled for publication in the November issue of Erudit.org. This 20+ page-long survey summarizes everything we know about MOOCs today (I have to admit, I only skimmed through this paper, I didn’t read all of it)
  • A Science Magazine article from January, 2019. The article, “The MOOC pivot,” sheds light to the very low retention numbers in MOOCs.
  • On MOOCs and video lectures. Prof. Loren Barbara from George Washington University explains why her MOOCs are not built for video. If you consider creating an online class, you should read this.
  • The economic consequences of MOOCs. A concise summary of a 2018 study that suggest that MOOC’s economic impact is high despite the high churn rates.
  • Thinkful.com, an online platform that provides personalized training to aspiring data professionals, got in the news three weeks ago after being purchased for $80 million. Thinkful isn’t a MOOC per-se but I have a special relationship with it: a couple of years ago I was accepted as a mentor at Thinkful but couldn’t find time to actually mentor anyone.

The bottom line

We still don’t know how this future will look like and how MOOCs will interplay with the legacy education system but I’m sure the MOOCs are the future

Error bars in bar charts. You probably shouldn’t

This is another post in the series Because You Can. This time, I will claim that the fact that you can put error bars on a bar chart doesn’t mean you should.

It started with a paper by prof. Gerd Gigerenzer whose work in promoting numeracy I adore. The paper, “Natural frequencies improve Bayesian reasoning in simple and complex inference tasks” contained a simple graph that meant to convince the reader that natural frequencies lead to more accurate understanding (read the paper, it explains these terms). The error bars in the graph mean to convey uncertainty. However, the data visualization selection that Gigerenzer and his team selected is simply wrong.

First of all, look at the leftmost bar, it demonstrates so many problems with error bars in general, and in error bars in barplots in particular. Can you see how the error bar crosses the X-axis, implying that Task 1 might have resulted in negative percentage of correct inferences?

The irony is that Prof. Gigerenzer is a worldwide expert in communicating uncertainty. I read his book “Calculated risk” from cover to cover. Twice.

Why is this important?

Communicating uncertainty is super important. Take a look at this 2018 study with the self-explaining title “Uncertainty Visualization Influences how Humans Aggregate Discrepant Information.” From the paper: “Our study repeatedly presented two [GPS] sensor measurements with varying degrees of inconsistency to participants who indicated their best guess of the “true” value. We found that uncertainty information improves users’ estimates, especially if sensors differ largely in their associated variability”.

Image result for clinton trump polls
Source HuffPost

Also recall the surprise when Donald Trump won the presidential elections despite the fact that most of the polls predicted that Hillary Clinton had higher chances to win. Nobody cared about uncertainty, everyone saw the graphs!

Why not error bars?

Keep in mind that error bars are considered harmful, and I have a reference to support this claim. But why?

First of all, error bars tend to be symmetric (although they don’t have to) which might lead to the situation that we saw in the first example above: implying illegal values.

Secondly, error bars are “rigid”, implying that there is a certain hard threshold. Sometimes the threshold indeed exists, for example a threshold of H0 rejection. But most of the time, it doesn’t.

stacked round gold-colored coins on white surface

More specifically to bar plots, error lines break the bar analogy and are hard to read. First, let me explain the “bar analogy” part.

The thing with bar charts is that they are meant to represent physical bars. A physical bar doesn’t have soft edges and adding error lines simply breaks the visual analogy.

Another problem is that the upper part of the error line is more visible to the eye than the lower one, the one that is seen inside the physical bar. See?undefined

But that’s not all. The width of the error bars separates the error lines and makes the comparison even harder. Compare the readability of error lines in the two examples below

The proximity of the error lines in the second example (take from this site) makes the comparison easier.

Are there better alternatives?

Yes. First, I recommend reading the “Error bars considered harmful” paper that I already mentioned above. It not only explains why, but also surveys several alternatives

Nathan Yau from flowingdata.com had an extensive post about different ways to visualize uncertainty. He reviewed ranges, shades, rectangles, spaghetti charts and more.

Claus Wilke’s book “Fundamentals of Data Visualization” has a dedicated chapter to uncertainty with and even more detailed review [link].

Visualize uncertainty about the future” is a Science article that deals specifically with forecasts

Robert Kosara from Tableu experimented with visualizing uncertainty in parallel coordinates.

There are many more examples and experiments, but I think that I will stop right now.

The bottom line

Communicating uncertainty is important.

Know your tools.

Try avoiding error bars.

Bars and bars don’t combine well, therefore, try harder avoiding error bars in bar charts.

You don’t need a fast way to increase your reading speed by 25%. Or, don’t suppress subvocalization

Not long ago, I wrote a post about a fast hack that increased my reading speed by tracking the reading with a finger. I think that the logic behind using a tracking finger is to suppress subvocalization. I noticed that, at least in my case, suppressing subvocalization reduces the fun of reading. I actually enjoy hearing the inner voice that reads the book “with me”.

Bootstrapping the right way?

Many years ago, I terribly overfit a model which caused losses of a lot of shekels (a LOT). It’s not that I wasn’t aware of the potential overfitting. I was. Among other things, I used several bootstrapping simulations. It turns out that I applied the bootstrapping in a wrong way. My particular problem was that I “forgot” about confounding parameters and that I “forgot” that peeping into the future is a bad thing.

Anyhow, Yanir Seroussi, my coworker data scientist, gave a very good talk on bootstrapping.

How do I look like?

From time to time, people (mostly conference organizers) ask for a picture of mine. Feel free using any of these images

Visualizations with perceptual free-rides

Dr. Richard Brath is a data visualization expert who also blogs from time to time. Each post in Richard’s blog provides a deep, and often unexpected to me, insight into one dataviz aspect or another.

richardbrath

We create visualizations to aid viewers in making visual inferences. Different visualizations are suited to different inferences. Some visualizations offer more additional perceptual inferences over comparable visualizations. That is, the specific configuration enables additional inferences to be observed directly, without additional cognitive load. (e.g. see Gem Stapleton et al, Effective Representation of Information: Generalizing Free Rides2016).

Here’s an example from 1940, a bar chart where both bar length and width indicate data:

Walter_Weld__How_to_chart_data_1960_hathitrust2

The length of the bar (horizontally) is the percent increase in income in each industry.  Manufacturing has the biggest increase in income (18%), Contract Construction is second at 13%.

The width of the bar (vertically) is the relative size of that industry: Manufacturing is wide – it’s the biggest industry – it accounts for about 23% of all industry. Contract Construction is narrow, perhaps the third smallest industry, perhaps around 3-4%.

What’s really interesting is that

View original post 446 more words

Book review. Indistractable by Nir Eyal

Nir Eyal is known for his book “Hooked” in which he teaches how to create addictive products. In his new book “Indistractable“, Nir teaches how to live in the world full of addictive products. The book itself isn’t bad. It provides interesting information and, more importantly, practical tips and action items. Nir covers topics such as digital distraction, productivity and procrastination.

Indistractable Control Your Attention Choose Your Life Nir Eyal 3D cover

I liked the fact that the author “gives permission” to spend time on Facebook, Instagram, Youtube etc, as long as it is what you planned to do. Paraphrasing Nir, distraction isn’t distraction unless you know what it distracts you from. In other words, anything you do is a potential distraction unless you know what, why and when you are doing it.

My biggest problem with this book is that I already knew almost everything that Nir wrote. Maybe I already read too many similar books and articles, maybe I’m just that smart (not really) but for me, most of Indistractable wasn’t valuable.

Until I got to the chapter that deals with raising children (“Part 6, how to raise indistractable children”). I have to admit, when it comes to speaking about raising kids in the digital era, Nir is a refreshing voice. He doesn’t join the global hysteria of “the screens make zombies of our kids”. Moreover, Nir brings a nice collection of hysterical prophecies from the 15th, 18th and 20th centuries in which “experts” warned about the bad influence new inventions (such as printed books, affordable education, radio) had on the kids.

Another nice touch is the fact that each chapter has a short summary that consists of three-four bullet points. Even nicer is the fact that Nir copied all the “Remember this” lists at the end of the book, which is very kind of him.

The Bottom line. 4/5. Read.

14-days-work-month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between 2008 and 2023 CE, and this is what we get:

Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to the constantly interrupt work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.
(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

A fast way to increase your reading speed by 25%

I was sceptic but I tried, measured, and arrived to the conclusion. First, I set a timer to 60 seconds and read some text. I managed to read seventeen lines. Then, I used my finger to guide my eyes the same way kids do when they learn reading. It turned out that I was able to read lines of text. By simply using my finger. Impressive.

Book review: The Formula by A. L Barabasi

The bottom line: read it but use your best judgement 4/5

I recently completed reading “The Formula. The Universal Laws of Success” by Albert-László Barabási. Barabási is a network science professor who co-authored the “preferential attachment” paper (a.k.a. the Barabási-Albert model). People who follow him closely are ether vivid fabs or haters accusing him of nonsense science.

For several years, A-L Barabási is talking and writing about the “science of success” (yeah, I can hear some of my colleagues laughing right now). Recently, he summarized the research in this area in an easy-to-read book with the promising title “The Formula. The Universal Laws of Success.” The main takeaways that I took from this book are:

  • Success is about us, not about you. In other words, it doesn’t matter how hard you work and how good your work is, if “we” (i.e., the public) don’t know about it, or don’t see it, or attribute it to someone else.
  • Be known for your expertise. Talk passionately about your job. The people who talk about an idea will get the credit for it. Consider the following example from the book. Let’s say, prof. Barabasi and the Pope write a joint scientific paper. If the article is about network science, it will be perceived as if the Pope helped Barabasi with writing an essay. If, on the other hand, if it is a theosophical book, we will immediately assume that the Pope was the leading force behind it.
  • It doesn’t matter how old you are; the success can come to you at any age. It is a well-known fact that most successful people broke into success at a young age. What Barabási claims is that the reason for that is not a form of ageism but the fact that the older people try less. According to this claim, as long as you are creative and work hard, your most significant success is ahead of you.
  • Persistence pays. This is another claim that Barabasi makes in his book. It is related to the previous one but is based on a different set of observations (did you know that Harry Potter was rejected twelve times before it was published?). I must say that I’m very skeptical about this one. Right now, I don’t have the time to explain my reasons, and I promise to write a dedicated post.

Keep in mind that the author uses academic success (the Nobel prize, citation index, etc.) as the metric for most of his conclusions. This limitation doesn’t bother him, after all, Barabási is a full-time University professor, but most of us should add another grain of salt to the conclusions. 

Overall, if you find yourself thinking about your professional future, or if you are looking for a good career advice, I recommend reading this book. 

My blog in Hebrew

As much as I love thinking that I live in a global world, most people whom I know speak Hebrew. From time to time, someone would tell me “nice post, but why not in Hebrew?”. So, from now on, I will try to translate all my new posts to Hebrew. I will try. Not promising anything. My Hebrew blog lives at https://he.gorelik.net/blog-feed

Published
Categorized as blog

Pseudochart. It’s like a pseudocode but for charts

Pseudocode is an informal high-level description of the operating principle of a computer program or other algorithm. People write pseudocode to isolate the “bigger picture” of an algorithm. Pseudocode doesn’t care about the particular implementation details that are secondary to the problem, such as memory management, dealing with different encoding, etc. Writing out the pseudocode version of a function is frequently the first step in planning the implementation of complex logic.

Similarly, I use sketches when I plan non-trivial charts, and when I discuss data visualization alternatives with colleagues or students.

One can use a sheet of paper, a whiteboard, or a drawing application. You may recognize this approach as a form of “paper prototyping,” but it deserves its own term. I suggest calling such a sketch a “pseudochart”*. Like a piece of pseudocode, the purpose of a pseudochart is to show the visualization approach to the data, not the final graph itself.

* Initially, I wanted to use the term “pseudograph” but the network scientists already took it for themselves.

** The first sentence of this post is a taken from the Wikipedia.

Please leave a comment to this post

Photo by Pixabay on Pexels.com

Please leave a comment to this post. It doesn’t matter what, it can be a simple Hi or an interesting link. It doesn’t matter when or where you see it. I want to see how many real people are actually reading this blog.

close up of text
Photo by Pixabay on Pexels.com

Word Sequentialization

Martin Remy

In some ways, “data visualization” is a terrible term. It seems to reduce the construction of good charts to a mechanical procedure. It evokes the tools and methodology required to create rather than the creation itself. It’s like calling Moby-Dick a “word sequentialization” or The Starry Night a “pigment distribution.”

It also reflects an ongoing obsession in the dataviz world with process over outcomes. Visualization is merely a process. What we actually do when we make a good chart is get at some truth and move people to feel it—to see what couldn’t be seen before. To change minds. To cause action.

— Scott Berinato, Visualizations that Really Work, HBR.org.

View original post

Published
Categorized as blog

Why you should speak at conferences?

In this post, I will try to convince you that speaking at a conference is an essential tool for professional development.

Many people are afraid of public speaking, they avoid the need to speak in front of an audience and only do that when someone forces them to. This fear has deep evolutional origins (thousands of years ago, if dozens of people were staring at you that would probably mean that you were about to become their meal). However, if you work in a knowledge-based industry, your professional career can gain a lot if you force yourself to speak.

Two days ago, I spoke at NDR, a machine learning/AI conference in Iași, Romania. That was a very interesting conference, with a diverse panel of speakers from different branches of the data-related industry. However, the talk that I enjoyed the most was mine. Not because I’m a narcist self-loving egoist. What I enjoyed the most were the questions that the attendees asked me during the talk, and in the coffee breaks after it. First of all, these questions were a clear signal that my message resonated with the audience, and they cared about what I had to say. This is a nice touch to one’s ego. But more importantly, these questions pointed out that there are several topics that I need to learn to become more professional in what I’m doing. Since most of the time, we don’t know what we don’t know, such an insight is almost priceless.

That is why even (and especially) if you are afraid of public speaking, you should jump into the cold water and do it. Find a call for presentations and submit a proposal TODAY.

And if you are afraid of that awkward silence when you ask “are there any questions” and nobody reacts, you should read my post “Any Questions? How to fight the awkward silence at the end of the presentation“.

Curated list of established remote tech companies

Someone asked me about distributed companies or companies that offer remote positions. Of course, my first response was Automattic but that person didn’t think that Automattic was a good fit for them. So I googled and was surprised to discover that my colleague, Yanir Seroussi, maintains a list of companies that offer remote jobs.

I work at Automattic, one of the biggest distributed-only companies in the world (if not the biggest one). Recently, Automattic founder and CEO, Matt Mullenweg started a new podcast called (surprise) Distributed.

כוון הציר האפקי במסמכים הנכתבים מימין לשמאל

אני מחפש דוגמאות נוספות

יש לכם דוגמה של גרף עברי ״הפוך״? גרפים בערבית או פארסי? שלחו לי.

Talking about productivity methods

The best way to procrastinate is to research productivity.

Boris Gorelik

This week, the majority of Automattic Data Division meets in person in Vienna. During one of the sessions I presented my productivity method to my friends and coworkers.

Presenting this method was a fun and enjoyable experience for me. I decided to try doing this again, in a more formal and structured way. If you know of a productivity-oriented meetups that might be interested in hearing me, let me know.

Some post-talk notes

It turns out that the method I’m using much closer to Mark Forster’s “Final Version” than to his AutoFocus

During the years, Mark Forster created and tested many time management approaches. Scan through this page http://markforster.squarespace.com/tm-systems to find something that might work for you to find something that might work for you.

An interesting way to beat procrastination when working from home

Working from home (or a coffee shop, or a library) is great. However, there is one tiny problem: the temptation not to work is sometimes much bigger than the temptation in a traditional office. In the traditional office you are expected to look busy which is the first step to do an actual work. When you work from home, nobody cares if you get up to have a cup of coffee or water the plants. This is GREAT but sometimes this freedom is too much. Sometimes, you wish someone would give you that look to encourage you to keep working.

This is the exact problem that Taylor Jacobson, the founder of https://focusmate.com is trying to solve. Here’s how Focusmate works. You schedule a fifty-minutes appointment with a random partner. During the session, you and your partner have exactly sixty seconds to tell each other what you want to achieve during the next fifty minutes and then start working, keeping the camera on. At the end of t the session, you and your partner tell each other how was your session. That’s it.

I signed up for this service and participated in two such session. I really liked the result. During that hour, I had the urge to get up for a coffee, to make phone calls, etc. But the fact that I saw someone on my screen, and the fact that they saw me stopped me. The result — 50 minutes of uninterrupted work. I even didn’t check Twitter, despite the fact that my buddy couldn’t see my screen.

I heard about this service in a podcast episode that was recommended to me by my coworker Ian Dunn. Focusmate is absolutely free for now. In that podcast show, Taylor (the founder) talks about the possible business models. Interestingly, when Taylor tried to crowd-fund this project he managed to get almost five time more money than he eventually planned to ([ref]).

One more thing. This podcast show, https://productivitycast.net, looks like an interesting podcast to follow if you are interested in productivity and procrastination.

The third wave data scientist – a useful point of view

In 2019, it’s hard to find a data-related blogger who doesn’t write about the essence and the future of data science as a profession. Most of these posts (like this one for example) are mostly useless both for existing data scientists who think about their professional plans and for people who consider data science as their career.

Today I saw yet another post which I find very useful. In this post, Dominik Haitz identifies a “third wave data scientist.” In Dominik’s opinion, a successful data scientist has to combine four features: (1) Business mindset (2) Software engineering craftsmanship (3) Statistics and algorithmic toolbox, and (4) Soft skills. In Dominik’s classification, the business mindset is not “another skill” but the central pillar.

The professional challenges that I have been facing during the past eighteen months or so, made me realize the importance of points 1, 2, and 3 from Dominik’s list (number 4 was already very important on my personal list). However, it took reading his post to put the puzzle parts in place.

Dominik’s additional contribution to the discussion is ditching the famous data science Venn Diagram in favor of another, “business-oriented” visual which I used as the “featured image” to this post.

Painting: sailors in a wavy sea
A fragment from an 1850 painting by the Russian Armenian marine painter Ivan Aivazovsky named “The Ninth Wave.” I wonder what the “ninth wave data scientist” will be.

To specialize, or not to specialize, that is the data scientists’ question

In my last post on data science career, I heavily promoted the idea that a data scientist needs to find his or her specialization. I back my opinion with my experience and by citing other people opinions. However, keep in mind that I am not a career advisor, I never surveyed the job market, and I might not know what I’m talking about. Moreover, despite the fact that I advocate for specialization, I think that I am more of a generalist.

Since I published the last post, I was pointed to some other posts and articles that either support or contradict my point of view. The most interesting ones are: “Why you shouldn’t be a data science generalist” and “Why Data Science Teams Need Generalists, Not Specialists“, both are very recent and very articulated but promote different points of view. Go figure

The featured image is based on a photo by Tom Parsons on Unsplash

The data science umbrella or should you study data science as a career move (the 2019 edition)?

TL/DR: Studying data science is OK as long as you know that it’s only a starting point.

Almost two years ago, I wrote a post titled “Don’t study data science as a career move.” Even today, this post is the most visited post on my blog. I was reminded about this post a couple of days ago during a team meeting in which we discussed what does a “data scientist” mean today. I re-read my original post, and I think that I was generally right, but there is a but…

The term “data science” was born as an umbrella term that meant to describe people who know programming, statistics, and business logic. We all saw those numerous Venn diagrams that tried to describe the perfect data scientist. Since its beginning, the field of “data science” has finally matured. There are more and more people that question the mere definition of data science.

Here’s what an entrepreneur Chuck Russel has to say:

Now don’t get me wrong — some of these folks are legit Data Scientists but the majority is not. I guess I’m a purist –calling yourself a scientist indicates that you practice science following a scientific method. You create hypotheses, test the hypothesis with experimental results and after proving or disproving the conjecture move on or iterate.

Screenshot of a Google image search showing many Venn diagrams
There can’t be enough Venn diagrams

Now, “create and test hypotheses” is a very vague requirement. After all, any A/B test is a process of “creating and testing hypotheses” using data. Is anyone who performs A/B tests a data scientist? I think not.
Moreover, a couple of years ago, if you wanted to run an A/B test, perform a regression analysis, build a classifier, you would have to write numerous lines of code, debug and tune it. This tedious and intriguing process certainly felt very “sciency,” and if it worked, you would have been very proud of our job. Today, on the other hand, we are lucky to have general-purpose tools that require less and less coding. I don’t remember the last time I had to implement an analysis or an algorithm from the first principles. With the vast amount of verified tools and libraries, writing an algorithm from scratch feels like a huge waste of time.
On the other hand, I spend more and more time trying to understand the “business logic” that I try to improve: why has this test fail? Who will use this algorithm and what will make them like the results? Does effort justify the potential improvement?

I (a data scientist) have all this extra time to think of a business logic thanks to the huge arsenal of generalized tools to choose from. These tools were created mostly by those data scientists whose primary job is to implement, verify, and tune algorithms. My job and the job of these data scientists is different and requires different sets of skills.

There is another ever-growing group of professionals who work hard to make sure someone can apply all those algorithms to any amount of data they feel suitable. These people know that any model is at most as good as the data it is based on. Therefore, they build systems that deliver the right information on time, distribute the data among computation nodes, and make sure no crazy “scientist” sends a production server to a non-responsive state due to a bad choice of parameters. We already have a term for professionals whose job is to build fail-proof systems. We call them engineers, or “data engineers” in this case.

The bottom line

Up till now, I mentioned three major activities that used to be covered by the data science umbrella: building new algorithms, applying algorithms to business logic, and engineering reliable data systems. I’m sure there are other areas under that umbrella that I forgot. In 2019, we reached the point where one has to decide what field of data science does one want to practice. If you consider stying data science think of it as studying medicine. The vast majority of physicians don’t end up general practitioners but rather invest at least five more years of their lives professionalize. Treat your data science studies as an entry ticket into the life-long learning process, and you’ll be OK. Otherwise, (I’m citing myself here): You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.

PS. Here’s a one-week-old article on Forbes.com with very similar theses: link.

Please leave a comment to this post

Photo by Pixabay on Pexels.com

Please leave a comment to this post. It doesn’t matter what, it can be a simple Hi or an interesting link. It doesn’t matter when or where you see it. I want to see how many real people are actually reading this blog.

close up of text
Photo by Pixabay on Pexels.com

Chișinău Jewish cemetery

Two years ago I visited Chișinău (Kishinev), the city in Moldova where I was born and where I grew up until the age of fifteen. Today I saw a post with photos from the ancient Chișinău Jewish cemetery and recalled that I too, took many pictures from that sad place. Less than half of the original cemetery survived to these days. The bigger part of it was demolished in the 1960s in favor of a park and a residential area. If you scroll through the pictures below, you will be able to see how they used tombstones to build the park walls.

Another notable feature of many Jewish cemeteries is memorial plates in memoriam of the relatives who don’t have their own graves — the relatives who were murdered over the course of the Jewish history.

בניית אתרים עם תמיכה בארץ

מדי פעם אנשים ששומעים שאני עובד בחברה שמפעילה את וורדפרקס.קום מבקשים ממני עזרה אם בניית האתר שלהם. אני חוקר נתונים, לא בונה אתרים. ברור שהחברה בה אני עובד עושה המון מאמצים כדי לאפשר לאנשים לבנות אתרים בעצמם, אבל לפעםמים אנשים צריכים להאציל את הסמכות הזאת למומחים, רוצים גמישות ושליטה וגם תמיכה. אני מכיר אישית את דידי אריאלי מהאתר ״קליקי בניית אתרים״ שעושה בדיוק את זה: בנייה ותחזוקת אתרים מותאמים אישית. מה שנחמד הוא שדידי נאמן לעקרונות הקוד הפתוח: הלקוח לא קשור אליו ושומר על השליטה בתוכן ובקוד של האתר.

דרך אגב, באתר של ״קליקי״ יש גם בלוג עם פרטי מידע שימושיים לבוני האתרים בוורדפרס

נ.ב. אני מכיר את דידי אישית אבל אין לי אתו קשרי עסקים. אני לא מרוויח שום דבר מהפוסט הזה.

Published
Categorized as blog

How to Increase Retention and Revenue in 1,000 Nontrivial Steps

The journey of a thousand miles begins with one step. My coworker, Yanir Seroussi, wrote about the work of data scientists in the marketing team.

Data for Breakfast

Recently, Automattic created a Marketing Data team to support marketing efforts with dedicated data capabilities. As we got started, one important question loomed for me and my teammate Demet Dagdelen: What should we data scientists do as part of this team?

Even though the term data science has been heavily used in the past few years, its meaning still lacks clarity. My current definition for data science is: “a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.” This is a very broad definition that offers a vague direction for what marketing data scientists should do. Indeed, many ideas for data science work were thrown around when the team was formed. Because Demet and I wanted our work to be proactive and influential, we suggested a long-term marketing data science…

View original post 2,068 more words

Published
Categorized as blog

On procrastination, or why too good can be bad

I’m a terrible procrastinator. A couple of years ago, I installed RescueTime to fight this procrastination. The idea behind RescueTime is simple — it tracks the sites you visit and the application you use and classifies them according to how productive you are. Using this information, RescueTime provides a regular report of your productivity. You can also trigger the productivity mode, in which RescueTime will block all the distractive sites such as Facebook, Twitter, news sites, etc. You can also configure RescueTime to trigger this mode according to different settings. This sounded like a killer feature for me and was the main reason behind my decision to purchase a RescueTime subscription. Yesterday, I realized how wrong I was.

RescueTime logo

When I installed RescueTime, I was full of good intentions. That is why I configured it to block all the distractive sites for one hour every time I accumulate more than 10 minutes of surfing such sites. However, from time to time, I managed to find a good excuse to procrastinate. Although RescueTime allows you to open a “bad” site after a certain delay, I found this delay annoying and ended up killing the RescueTime process (killing a process is faster than temporary disabling a filter). As a result, most of my workday stayed untracked, unmonitored, and unfiltered.

So, I decided to end this absurd situation. As of today, RescueTime will never block any sites. Instead of blocking, I configured it to show a reminder and to open my RescueTime dashboard, as a reminder to behave myself. I don’t know whether this non-intrusive reminder will be effective or not but at least I will have correct information about my day.

“Why it burns when you P” and other statistics rants

“Sunday grumpiness” is an SFW translation of Hebrew phrase that describes the most common state of mind people experience on their first work weekday. My grumpiness causes procrastination. Today, I tried to steer this procrastination to something more productive, so I searched for some statistics-related terms and stumbled upon a couple of interesting links in which people bitch about p-values.

Why it burns when you P” is a five-years-old rant about P values. It’s funny, informative and easy to read

Everything Wrong With P-Values Under One Roof” is a recent rant about p-values written in a form of a scientific paper. William M. Briggs, the author of this paper, ends it with an encouraging statement: “No, confidence intervals are not better. That for another day.”

Everything wrong with statistics (and how to fix it)” is a one-hour video lecture by Dr. Kristin Lennox who talks about the same problems. I saw this video, and two more talks by Dr. Lennox on a flight I highly recommend all her videos on YouTube.

Do You Hate Statistics as Much as Everyone Else?” — A Natan Yau’s (from flowingdata.com) attempt to get thoughtful comments from his knowledgeable readers.

This list will not be complete without the classics:

Why Most Published Research Findings Are False“, “Mindless Statistics“, and “Cargo Cult Science“. If you haven’t read these three pieces of wisdom, you absolutely should, they will change the way you look at numbers and research.

*The literal meaning of שביזות יום א is Sunday dick-brokenness.

Published
Categorized as blog

I have 101 followers!

Yesterday, the follower list of my blog exceeded one hundred followers! Even though I know that some of these followers are bots, this number makes me happy! Thank you all (humans and bots) for clicking the “follow” button.

A Brand Image Analysis of WordPress and Automattic on Twitter

My coworker analyzed Twitter social network around Automattic, WordPress, and other related projects.

Data for Breakfast

As a data scientist, I spend a lot of time analyzing how our users interact with WordPress.com. However, WordPress.com isn’t the only place to gain insight into how people use and talk about our services. Many WordPress.org and WordPress.com discussions take place on social media. Analyzing these discussions can help us understand what our users are saying about WordPress[*] and Automattic, the topics closely associated with our services, and who is leading these discussions.

In every social network, there are people who steer the topic and sentiment of the conversation. These influencers usually have large followings and are positioned centrally within the network. Brands often reach out to influencers to organize focus groups or invite them to events, since they’re usually knowledgeable about the brand and can offer insight into how consumers use the product and potential improvements.

At Automattic, we don’t do traditional influencer marketing. However, since the discussions…

View original post 1,060 more words

Published
Categorized as blog

Against A/B tests

Traditional A/B testsing rests on a fundamentally flawed premise. Most of the time, version A will be better for some subgroups, and version B will be better for others. Choosing either A or B is inherentlyinferior to choosing a targeted mix of A and B.

Michael Kaminsky locallyoptimistic.com

The quote above is from a post by Michael Kaminsky “Against A/B tests“. I’m still not fully convinced by Michael’s thesis but it is very interesting and thought-provoking. 

Links Worth Sharing: What Makes People Successful

Data for Breakfast

Boris Gorelik

The renown network scientist, Albert-László Barabási, has been applying scientific methods to study the factors that make people successful. Science has published an intriguing paper called Quantifying reputation and success in art written by Prof. Barabási and his collaborators. Prof. Barabási talks about the findings of his research in an interview with The HumanCurrent podcast.

(The featured image is a portion from Figure 1 in Fraiberger et al., Science 10.1126/science.aau7224 (2018)).

View original post

Published
Categorized as blog

Useful redundancy — when using colors is not completely useless

The maximum data-ink ratio principle implies that one should not use colors in their graphs if the graph is understandable without the colors. The fact that you can do something, such as adding colors, doesn’t mean you should do it. I know it. I even have a dedicated tag on this blog for that. Sometimes, however, consistent use of colors serves as a useful navigation tool in a long discussion. Keep reading to learn about the justified use of colors.

Pew Research Center is a “is a nonpartisan American fact tank based in Washington, D.C. It provides information on social issues, public opinion, and demographic trends shaping the United States and the world.” Recently, I read a report prepared by the Pew Center on the religious divide in the Israeli society. This is a fascinating report. I recommend reading without any connection to data visualization.

But this post does not deal with the Isreali society but with graphs and colors.

Look at the first chart in that report. You may see a tidy pie chart with several colored segments. 

Pie chart: Religious composition of Israeli society. The chart uses several colored segments

Aha! Can’t they use a single color without losing the details? Of course the can! A monochrome pie chart would contain the same information:

Pie chart: Religious composition of Israeli society. The chart uses monochrome segments

In most of the cases, such a transformation would make a perfect sense. In most of the cases, but not in this report. This report is a multipage research document packed with many facts and analyses. The pie chart above is the first graph in that report that provides a broad overview of the Israeli society. The remaining of this report is dedicated to the relationships between and within the groups represented by the colorful segments in that pie chart. To help the reader navigating through this long report, its authors use a consistent color scheme that anchors every subsequent graph to the relevant sections of the original pie chart.

All these graphs and tables will be readable without the use of colors. Despite the fact that the colors here are redundant, this is a useful redundancy. By using the colors, the authors provided additional information layers that make the navigation within the document easier. I learned about the concept of useful redundancy from “Trees, Maps, and Theorems” by Jean-luc Dumout. If you can only read one book about data communication, it should be this book.

Microtext Line Charts

Why adding text labels to graph lines, when you can build graph lines using text labels? On microtext lines

richardbrath

Tangled Lines

Line charts are a staple of data visualization. They’ve existed at least since William Playfair and possibly earlier. Like many charts, they can be very powerful and also have their limitations. One limitation is the number of lines that can be displayed. One line works well: you can see trend, volatility, highs, lows, reversals. Two lines provides opportunity for comparison. 5 lines might be getting crowded. 10 lines and you’re starting to run out of colors. But what if the task is to compare across a peer group of 30 or 40 items? Lines get jumbled, there aren’t enough discrete colors, legends can’t clearly distinguish between them. Consider this example looking at unemployment across 37 countries from the OECD: which country had the lowest unemployment in 2010?

unemployment_plain

Tooltips are an obvious way to solve this, but tooltips have problems – they are much slower than just shifing visual attention…

View original post 1,323 more words

On the importance of perspective

Stalin was a relatively short man, his height was 1.65 m. Khrushchev was even shorter, his height was 1.60. It seems that the difference wasn’t enough for the official Soviet propaganda of that time. Take a look at this photo. We can clearly see that Stalin is taller than Khrushchev.

stalin.png

Do you notice something strange? Take a look at the windows in the background. I added horizontal and vertical guides for your convenience.

Screen Shot 2018-11-05 at 8.38.08

Now, look what happens when we fix the horizontal and vertical lines

Screen Shot 2018-11-05 at 8.39.03

Now, Khrushchev is still shorter than Stalin but not by that much.

איך אומרים דאטה ויזואליזיישן בעברית?

This post is written in Hebrew about a Hebrew issue. I won’t translate it to English.

אני מלמד data visualization בשתי מכללות בישראלבמכללת עזריאלי להנדסה בירושלים ובמכון הטכנולוגי בחולון. כשכתבתי את הסילבוס הראשון שלי הייתי צריך למצוא מונח ל־data visualization וכתבתיהדמיית נתונים״ אומנם זה הזכיר לי קצת תהליך של סימולציה, אבל האופציה האחרת ששקלתי היתה ״דימות״ וידעתי שהיא שמורה ל־imaging, דהיינו תהליך של יצירת דמות או צורה של עצם, בעיקר בעולם הרפואה.

הבנתי שהמונח בעייתי בשיעור הראשון שהעברתי. מסתברששניים מארבעת הסטודנטים שהגיעו לשיעור חשבו שקורס ״הדמיית נתונים בתהליך מחקר ופיתוח״ מדבר על סימולציות.

מתישהו שמעתי מחבר של חבר שהמונח הנכון ל־visualization זה הדמאה, אבל זה נשמע לי פלצני מדי, אז השארתי את ה־״הדמיה״ בשם הקורס והוספתי “data visualization” בסוגריים.

היום, שלוש שנים אחרי ההרצאה הראשונה שהעברתי, ויומיים לפני פתיחת הסמסטר הבא, החלטתי לגגל (יש מילה כזאת? יש!) את התשובה. ומה מסתבר? עלון ״למד לשונך״ מס׳ 109 של האקדמיה ללשון עברית שיצא לאור בשנת 2015 קובע שהמונח ל־visualization הוא הַחְזָיָה. לא יודע מה אתכם, אבל אני לא משתגע על החזיה. עוד משהו שאני לא משתגע עליו הוא שבתור הדוגמא להחזיה, האקדמיה החלטיה לשים תרשים עוגה עם כל כך הרבה שגיאות!

Screen Shot 2018-10-23 at 20.35.52

נראה לי שאני אשאר עם הדמיה. ויקימילון מרשה לי.

נ.ב. שמתם לב שפוסט זה השתמשתי במקף עברי? אני מאוד אוהב את המקף העברי.

Innumeracy

Innumeracy is “inability to deal comfortably with the fundamental notions of number and chance”.
I which there was a better term for “innumeracy”, a term that would reflect the importance of analyzing risks, uncertainty, and chance. Unfortunately, I can’t find such a term. Nevertheless, the problem is huge. In this long post, Tom Breur reviews many important aspects of “numeracy”.

Data, Analytics and beyond

Tom Breur

21 October 2018

It has long been known that the general public is sometimes remarkably out of tune with math and numbers. In 1988 mathematician John Allan Paulos wrote a classic “Innumeracy” that is chockful of striking examples of misinterpretation of numeric evidence. Paulos refers to innumeracy as “… inability to deal comfortably with the fundamental notions of number and chance …” Personally, I consider it the mathematical equivalent to illiteracy. Another classic from Paulos is “A Mathematician Reads the Newspaper” (1995) which contains a lot of satire, debunking ridiculous claims in the press. It highlights more spectacular examples of innumeracy.

Paulos illustrates innumeracy with lighthearted anecdotes and many common, everyday scenarios. These examples highlight how readers might be fooled by misleading quantitative evidence. His examples span diverse topics like probability and coincidence, misguessing extremely small or very large numbers, pseudoscience and superstition…

View original post 1,450 more words

Published
Categorized as blog

Working Remotely and the Virtue of Aggressive Transparency

Excellent post by my colleague Simon Ouderkirk on working in a distributed company. It’s a three-year-old post. I wonder how I missed it.

Simon Ouderkirk

public-domain-images-free-stock-photos-bicycle-bike-black-and-white

One of the things that it has taken me quite a long time to figure out, when it comes to this remote work gig, is this idea I’ve taken to calling aggressive transparency.

I’ve been chewing on this idea quite a lot, and in chatting with my team and other folks whose opinions I respect, I think I’m starting to feel like it’s something I should articulate in greater detail.

View original post 1,077 more words

Published
Categorized as blog

Data visualization in right-to-left languages

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does.

Right-to-left (RTL) languages such as Hebrew, Arabic, and Farsi are used by roughly 1.8 billion people around the world. Many of them consume data in their native languages. Nevertheless, I have never seen any research or study that explores data visualization in RTL languages. Until a couple of days ago, when I saw this interesting observation by Nick Doiron “Charts when you read right-to-left“.

I teach data visualization in Israeli colleges. Whenever a student asks me RTL-related questions, I always answer something like “it’s complicated, let’s not deal with that”. Moreover, in the assignments, I even allow my students to submit graphs in English, even if they write the report in Hebrew.

Nick’s post made me wonder about data visualization do’s and don’ts in RTL environments. Should Hebrew charts differ from Arabic or Farsi? What are the accepted practices?

If you speak Arabic or Farsi, I need your help. If you don’t speak, share this post with someone who does. I want to collect as many examples of data visualization in RTL languages. Links to research articles are more than welcome. You can leave your comments here or send them to boris@gorelik.net.

Thank you.

The image at the top of this post is a modified version of a graph that appears in the post that I cite. Unfortunately, I wasn’t able to find the original publication.

Can error correction cause more error? (The answer is yes)

This is an interesting thought experiment. Suppose that you have some appliance that acts in a normally distributed way. For example, a nerf gun. Let’s say now that you aim and fire the gun. What happens if you miss by some amount of X? Should you correct your aim in the opposite direction? My intuition says “yes.” So does the intuition of many other people with whom I talked about this problem. However, when we start thinking about this problem, we realize that the intuition is wrong. Since we aim the gun, our assumption should be that the deviation is zero. A single observation is not sufficient to reject this assumption. By continually adjusting the data generating process based on a single observation, we reduce the precision (increase the dispersion).
Below is a simulation of adjusted and non-adjusted processes (the code is here). The broader spread of the adjusted data (blue line) is evident.

Two curves. Blues: high dispersion of values when adjustments are performed after every observation. Orange: smaller dispersion when no adjustments are done.

Due to the nature of the normal random variable, a single large accidental deviation can cause an extreme “correction,” which in turn will create a prolonged period of highly inaccurate points. This is precisely what you see in my simulation.
The moral of this simple experiment is that you shouldn’t let a single affect your actions.

 

Me

Published
Categorized as blog Tagged

“Any questions?” How to fight the awkward silence at the end of a presentation?

If you ever gave or attended a presentation, you are familiar with this situation: the presenter asks whether there are any questions and … nobody asks anything. This is an awkward situation. Why aren’t there any questions? Is it because everything is clear? Not likely. Everything is never clear. Is it because nobody cares? Well, maybe. There are certainly many people that don’t care. It’s a fact of life. Study your audience, work hard to make the presentation relevant and exciting but still, some people won’t care. Deal with it.

However, the bigger reasons for lack of the questions are human laziness and the fear of being stupid. Nobody likes asking a question that someone will perceive as a stupid one. Sometimes, some people don’t mind asking a question but are embarrassed and prefer not being the first one to break the silence.

What can you do? Usually, I prepare one or two questions by myself. In this case, if nobody asks anything, I say something like “Some people, when they see these results ask me whether it is possible to scale this method to larger sets.”. Then, depending on how confident you are, you may provide the answer or ask “What do you think?”.

You can even prepare a slide that answers your question. In the screenshot below, you may see the slide deck of the presentation I gave in Trento. The blue slide at the end of the deck is the final slide, where I thank the audience for the attention and ask whether there are any questions.

My plan was that if nobody asks me anything, I would say “Thank you again. If you want to learn more practical advises about data visualization, watch the recording of my tutorial, where I present this method  <SLIDE TRANSFER, show the mockup of the “book”>. Also, many people ask me about reading suggestions, this is what I suggest you read: <SLIDE TRANSFER, show the reading pointers>

Screen Shot 2018-09-17 at 10.10.21

Luckily for me, there were questions after my talk. Luckily, one of these questions was about practical advice so I had a perfect excuse to show the next, pre-prepared, slide. Watch this moment on YouTube here.

Graphing Highly Skewed Data – Tom Hopper

My colleague, Chares Earl, pointed me to this interesting 2010 post that explores different ways to visualize categories of drastically different sizes.

The post author, Tom Hopper, experiments with different ways to deal with “Data Giraffes”. Some of his experiments are really interesting (such as splitting the graph area). In one experiment, Tom Hopper draws bar chart on a log scale. Doing so is considered as a bad practice. Bar charts value (Y) axis must include meaningful zero, which log scale can’t have by its definition.

Other than that, a good read Graphing Highly Skewed Data – Tom Hopper