How to Increase Retention and Revenue in 1,000 Nontrivial Steps

The journey of a thousand miles begins with one step. My coworker, Yanir Seroussi, wrote about the work of data scientists in the marketing team.

Data for Breakfast

Recently, Automattic created a Marketing Data team to support marketing efforts with dedicated data capabilities. As we got started, one important question loomed for me and my teammate Demet Dagdelen: What should we data scientists do as part of this team?

Even though the term data science has been heavily used in the past few years, its meaning still lacks clarity. My current definition for data science is: “a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.” This is a very broad definition that offers a vague direction for what marketing data scientists should do. Indeed, many ideas for data science work were thrown around when the team was formed. Because Demet and I wanted our work to be proactive and influential, we suggested a long-term marketing data science…

View original post 2,068 more words

On procrastination, or why too good can be bad

Illustration: a plastic gun and letters that say "bang"

I’m a terrible procrastinator. A couple of years ago, I installed RescueTime to fight this procrastination. The idea behind RescueTime is simple — it tracks the sites you visit and the application you use and classifies them according to how productive you are. Using this information, RescueTime provides a regular report of your productivity. You can also trigger the productivity mode, in which RescueTime will block all the distractive sites such as Facebook, Twitter, news sites, etc. You can also configure RescueTime to trigger this mode according to different settings. This sounded like a killer feature for me and was the main reason behind my decision to purchase a RescueTime subscription. Yesterday, I realized how wrong I was.

RescueTime logo

When I installed RescueTime, I was full of good intentions. That is why I configured it to block all the distractive sites for one hour every time I accumulate more than 10 minutes of surfing such sites. However, from time to time, I managed to find a good excuse to procrastinate. Although RescueTime allows you to open a “bad” site after a certain delay, I found this delay annoying and ended up killing the RescueTime process (killing a process is faster than temporary disabling a filter). As a result, most of my workday stayed untracked, unmonitored, and unfiltered.

So, I decided to end this absurd situation. As of today, RescueTime will never block any sites. Instead of blocking, I configured it to show a reminder and to open my RescueTime dashboard, as a reminder to behave myself. I don’t know whether this non-intrusive reminder will be effective or not but at least I will have correct information about my day.

“Why it burns when you P” and other statistics rants

Illustration: computer terminal window that says "p-value=0.55001 is GREATER than 0.05"

“Sunday grumpiness” is an SFW translation of Hebrew phrase that describes the most common state of mind people experience on their first work weekday. My grumpiness causes procrastination. Today, I tried to steer this procrastination to something more productive, so I searched for some statistics-related terms and stumbled upon a couple of interesting links in which people bitch about p-values.

Why it burns when you P” is a five-years-old rant about P values. It’s funny, informative and easy to read

Everything Wrong With P-Values Under One Roof” is a recent rant about p-values written in a form of a scientific paper. William M. Briggs, the author of this paper, ends it with an encouraging statement: “No, confidence intervals are not better. That for another day.”

Everything wrong with statistics (and how to fix it)” is a one-hour video lecture by Dr. Kristin Lennox who talks about the same problems. I saw this video, and two more talks by Dr. Lennox on a flight I highly recommend all her videos on YouTube.

Do You Hate Statistics as Much as Everyone Else?” — A Natan Yau’s (from attempt to get thoughtful comments from his knowledgeable readers.

This list will not be complete without the classics:

Why Most Published Research Findings Are False“, “Mindless Statistics“, and “Cargo Cult Science“. If you haven’t read these three pieces of wisdom, you absolutely should, they will change the way you look at numbers and research.

*The literal meaning of שביזות יום א is Sunday dick-brokenness.

Hackers beware: Bootstrap sampling may be harmful

Anything is better when bootstrapped. Read my co-worker’s post on bootstrapping. Also make sure following the links Yanir gives to support his claims

Yanir Seroussi

Bootstrap sampling techniques are very appealing, as they don’t require knowing much about statistics and opaque formulas. Instead, all one needs to do is resample the given data many times, and calculate the desired statistics. Therefore, bootstrapping has been promoted as an easy way of modelling uncertainty to hackers who don’t have much statistical knowledge. For example, the main thesis of the excellent Statistics for Hackers talk by Jake VanderPlas is: “If you can write a for-loop, you can do statistics”. Similar ground was covered by Erik Bernhardsson in The Hacker’s Guide to Uncertainty Estimates, which provides more use cases for bootstrapping (with code examples). However, I’ve learned in the past few weeks that there are quite a few pitfalls in bootstrapping. Much of what I’ve learned is summarised in a paper titled What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum by Tim…

View original post 1,462 more words

A Brand Image Analysis of WordPress and Automattic on Twitter

My coworker analyzed Twitter social network around Automattic, WordPress, and other related projects.

Data for Breakfast

As a data scientist, I spend a lot of time analyzing how our users interact with However, isn’t the only place to gain insight into how people use and talk about our services. Many and discussions take place on social media. Analyzing these discussions can help us understand what our users are saying about WordPress[*] and Automattic, the topics closely associated with our services, and who is leading these discussions.

In every social network, there are people who steer the topic and sentiment of the conversation. These influencers usually have large followings and are positioned centrally within the network. Brands often reach out to influencers to organize focus groups or invite them to events, since they’re usually knowledgeable about the brand and can offer insight into how consumers use the product and potential improvements.

At Automattic, we don’t do traditional influencer marketing. However, since the discussions…

View original post 1,060 more words

Against A/B tests

Traditional A/B testsing rests on a fundamentally flawed premise. Most of the time, version A will be better for some subgroups, and version B will be better for others. Choosing either A or B is inherentlyinferior to choosing a targeted mix of A and B.

Michael Kaminsky

The quote above is from a post by Michael Kaminsky “Against A/B tests“. I’m still not fully convinced by Michael’s thesis but it is very interesting and thought-provoking. 

Links Worth Sharing: What Makes People Successful

Data for Breakfast

Boris Gorelik

The renown network scientist, Albert-László Barabási, has been applying scientific methods to study the factors that make people successful. Science has published an intriguing paper called Quantifying reputation and success in art written by Prof. Barabási and his collaborators. Prof. Barabási talks about the findings of his research in an interview with The HumanCurrent podcast.

(The featured image is a portion from Figure 1 in Fraiberger et al., Science 10.1126/science.aau7224 (2018)).

View original post

Useful redundancy — when using colors is not completely useless

The maximum data-ink ratio principle implies that one should not use colors in their graphs if the graph is understandable without the colors. The fact that you can do something, such as adding colors, doesn’t mean you should do it. I know it. I even have a dedicated tag on this blog for that. Sometimes, however, consistent use of colors serves as a useful navigation tool in a long discussion. Keep reading to learn about the justified use of colors.

Pew Research Center is a “is a nonpartisan American fact tank based in Washington, D.C. It provides information on social issues, public opinion, and demographic trends shaping the United States and the world.” Recently, I read a report prepared by the Pew Center on the religious divide in the Israeli society. This is a fascinating report. I recommend reading without any connection to data visualization.

But this post does not deal with the Isreali society but with graphs and colors.

Look at the first chart in that report. You may see a tidy pie chart with several colored segments. 

Pie chart: Religious composition of Israeli society. The chart uses several colored segments

Aha! Can’t they use a single color without losing the details? Of course the can! A monochrome pie chart would contain the same information:

Pie chart: Religious composition of Israeli society. The chart uses monochrome segments

In most of the cases, such a transformation would make a perfect sense. In most of the cases, but not in this report. This report is a multipage research document packed with many facts and analyses. The pie chart above is the first graph in that report that provides a broad overview of the Israeli society. The remaining of this report is dedicated to the relationships between and within the groups represented by the colorful segments in that pie chart. To help the reader navigating through this long report, its authors use a consistent color scheme that anchors every subsequent graph to the relevant sections of the original pie chart.

All these graphs and tables will be readable without the use of colors. Despite the fact that the colors here are redundant, this is a useful redundancy. By using the colors, the authors provided additional information layers that make the navigation within the document easier. I learned about the concept of useful redundancy from “Trees, Maps, and Theorems” by Jean-luc Dumout. If you can only read one book about data communication, it should be this book.

Microtext Line Charts

Why adding text labels to graph lines, when you can build graph lines using text labels? On microtext lines


Tangled Lines

Line charts are a staple of data visualization. They’ve existed at least since William Playfair and possibly earlier. Like many charts, they can be very powerful and also have their limitations. One limitation is the number of lines that can be displayed. One line works well: you can see trend, volatility, highs, lows, reversals. Two lines provides opportunity for comparison. 5 lines might be getting crowded. 10 lines and you’re starting to run out of colors. But what if the task is to compare across a peer group of 30 or 40 items? Lines get jumbled, there aren’t enough discrete colors, legends can’t clearly distinguish between them. Consider this example looking at unemployment across 37 countries from the OECD: which country had the lowest unemployment in 2010?


Tooltips are an obvious way to solve this, but tooltips have problems – they are much slower than just shifing visual attention…

View original post 1,323 more words