This week, Sirin, Boris, and Demet have some recommended reading for you in the fields of descriptive data analysis, machine learning, and ethics in artificial intelligence. Have you recently read anything thought-provoking in the field of data science? Written anything thought-provoking? Be sure to comment and share your recommendations with us.
Seth Stephens-Davidowitz studies publicly available, anonymous Google Search data. His work reveals prejudices and sheds light on aspects of demography that are hard to tackle with surveys. It’s a long, yet captivating read and a great example data story telling that shows how insightful descriptive data analysis can be. It’s also deeply infuriating because, among other things, his work implies that open racism and biases against girls are widespread.
Almost half a year ago, I decided to create an online data visualization course. After investing hundreds of hours, I managed to release the first lecture and record another one. However, I decided not to publish new lectures and to remove the existing one from the net. Why? The short answer is a huge cost-to-benefit ratio. For a longer answer, you will have to keep reading this post.
Why creating a course?
It’s not that there are no good courses. There are. However, most of them are tightly coupled with one tool or another. Moreover, many of the courses I have reviewed online are act as an advanced tutorial of a certain plotting tool. The course that I wanted to create was supposed to be tool-neutral, full of theoretical knowledge and bits of practical advice. Another decision that I made was not to write a set of text files (online book, set of Jupyter notebooks, whatever) but to create a course in which the majority of the knowledge is brought to the audience by the means of frontal video lectures. I assumed that this kind of format will be the easiest for the audience to consume.
What went wrong?
So, what went wrong? First of all, you should remember that I work full time at Automattic, which means that every side project is a … side project, that I have to do during my free time. I realized that since the very beginning. However, since I already teach data visualization in different institutions in Israel, I already have a well-formed syllabus with accompanying slide decks full of examples. I assumed that it will take me not more than one hour to prepare every online lecture.
So, instead of verifying this assumption, I started solving the technical problems, such as buying a nice microphone (which turned out to be a crap), tripods, building a green room in my home office, etc. Once I was satisfied with my technical setup, I decided to record a promo video. Here, I faced a big problem. You see, talking to people and to the camera are completely different things. I feel pretty comfortable talking to people but when I face the camera, I almost freeze. Also, in person-to-person communication, we are somewhat tolerant to small studdering and longish pauses. However, when watching recorded video clips, we expect television quality narration. It turns out that achieving this kind of narration is very hard. Add the fact that English is my third language, and you get a huge time drain. To be able to record a two-minute promo video, I had to write the entire script, rehearse it for a dozen of times, and record it in front of a teleprompter. The filming session alone took around half an hour, as I had to repeat almost every line, time after time.
Preparing slide decks for the lectures wasn’t an easy task either. Despite the fact that I had pretty good slide decks, I realized that they are good for an in-class lecture, where I can point to the screen, go back and forth within a presentation, open external URL’s etc. Once I had my slide decks ready, I faced the narration problem once again. So, I had to write the entire lesson’s script, edit it, rehearse for several days, and shoot. At this time, I became frustrated. I might have been more motivated had my first video received some real traffic. However, with 18 (that’s eighteen) views, most of which lasted not more than a minute or two, I hardly felt a YouTube super star. I know that it’s impossible to get a real traction in such a short period, without massive promotion. However, after I completed shooting the second lecture, I realized that I will not be able to do it much longer. Not without quitting my day job. So, I decided to quit.
Since I already have pretty good texts for the first two lectures, I might be able to convert them to posts in this blog. I also have material for some before-and-after videos that I planned to have as a part of this course. I will make convert them to posts, too, similar to this post on the data.blog.
Was it worth it?
It certainly was! During the preparations, I learned a lot. I learned new things about data visualization. I took a glimpse into the world of video production. I had a chance to restructure several of my presentations.
Featured image for this post by Nicolas Nova under the CC-by license.
If you create charts using your tool’s default settings and your intuition, chances are you’re doing it wrong.
Let me present you an online course that dives into the theory of data visualization and its practical aspects. Every lecture is accompanied by before & after case studies and learner assignments. The course is tool-neutral. It doesn’t matter if you use Python, R, Excel, or pen, and paper.
The first lecture will be published on July 7th. Future lectures will follow every two weeks. Meanwhile, you may visit the course page and watch the intro video. Follow this blog so that you don’t miss new lectures!
Please spread the word! Reblog this post, share it on Twitter (@gorelik_boris), Facebook, LinkedIn or any other network. Tell about this course to your colleagues and friends. The more learners will take this course, the happier I will be.
A couple of days ago, I read the excellent post by Bob Rudis about data ethics and the importance of keeping users’ data safe. In this post, Bob recited the mantra I have heard for the past several years that “data is the new gold.” Comparing something to gold implies that it is scarce, unchangeable and has zero utility value. Data is neither, it’s ubiquitous, ever-changing and has some utility value of its own.
I think that oil (petroleum) is a better analogy for data. Much like the oil, data has some utility value by itself but is most valuable when properly distilled, processed and transformed.
Regardless of the analogy, I highly recommend reading Bob Rudis’ post.
I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap. The project goal is noble: crowdsource and make a repository of open speech data for…
Being highly professional, many data scientists strive toward the best results possible from a practical perspective. However, let’s face it, in many cases, nobody cares about the neat and elegant models you’ve built. In these cases, fast deployment is pivotal for the adoption of your work — especially if you’re the only one who’s aware of the problem you’re trying to solve.
This is exactly the situation in which I recently found myself. I had the opportunity to touch an unutilized source of complex data, but I knew that I only had a limited time to demonstrate the utility of this data source. While working, I realized it’s not enough that people KNOW about the solution, I had to make sure that people would NEED it. That is why I sacrificed modeling accuracy to create the simplest solution possible. I also had to create a RESTful API server, a visualization…
March 2019: Two years after the completion of this post I wrote a follow-up. Read it here.
January 2020: Three years after the completion of this post, I realized that I wrote a whole bunch of career advices. Make sure you check this link that collects everything that I have to say about becoming a data scientist
No, this account wasn’t hacked. I really think that studying data science to advance your career is wasting your time. Briefly, my thesis is as follows:
Data science is a term coined to bridge between problems and experts.
The current shortage of data scientists will go away, as more and more general purpose tools are developed.
When this happens, you’d better be an expert in the underlying domain, or in the research methods. The many programs that exist today are too shallow to provide any of these.
I am a pharmacist. I am interested in becoming a data scientist. My > interests are pharmacoeconomics and other areas of health economics. What do I need to study to become a data scientist?
To answer this question, I described how I gradually transformed from a pharmacist to a data scientists by continuous adaptation to the new challenges of my professional career. In the end, I invited anyone to ask personal questions via e-mail (it’s firstname.lastname@example.org). Two days ago, I received a follow-up question:
I would like to know how to learn data science. Would you suggest a master’s degree in analytics? Or is there another way to add “data scientist” label on my resume?
Here’s my answer that will explain why, in my opinion, studying data science won’t give you job security.
Data scientists are real. Data science isn’t.
I think that while “data scientists” are real, “data science” isn’t. We, the data scientists, analyze data using the scientific methods we know and using the tools we mastered. The term “data scientist” was coined about five years ago for the job market. It was meant to help to bring the expertise and the positions together. How else would you explain a person who knows scientific analysis, machine learning, writes computer code and isn’t too an abstract thinker to understand the business need of a company? Before “data scientist,” there was a less catchy “dataist” http://www.dataists.com/. However, “data scientist” sounded better. It is only after the “data scientist” became a reality, people started searching for “data science.” In the future, data science may become a scientific field, similar to statistics. Currently, though, it is not mature enough. Right now, data science is an attempt to merge different disciplines to answer practical questions. Sometimes, this attempt is successful, which makes my life and the lives of many my colleagues so exciting.
One standard feature of most if not all, the data science tasks is the requirement to understand the underlying domain. A data scientist in a cyber security team needs to have an understanding of data security, a bioinformatician needs to understand the biological processes, and a data scientist in a financial institution needs to know how money works.
That is why, career-wise, I think that the best strategy is to study an applied field that requires data-intense solutions. By doing so, you will learn how to use the various data analysis techniques. More importantly, you will also learn how to conduct a complicated research, and how the analysis and the underlying domain interact. Then, one of the two alternatives will happen. You will either specialize in your domain and will become an expert; or, you will switch between several domains and will learn to build bridges between the domains and the tools. Both paths are valuable. I took the second path, and it looks like most of the today’s data scientists took that route too. However, sometimes, I am jealous with the specialization I could have gained had I not left computational chemistry about ten years ago.
Who can use the “data scientist” title?
Who can use the “data scientist” title? I started presenting myself as a “data scientist and algorithm developer” not because I passed some licensing exams, or had a diploma. I did so because I was developing algorithms to answer data-intense questions. Saying “I’m a data scientist” is like saying “I’m an expert,” or “I’m an analyst,” or “I’m a manager.” If you feel comfortable enough calling yourself so, and if you can defend this title before your peers, do so. Out of the six data scientists in my current team, we have a pharmacist (me), a physicist, an electrical engineer, a CS major, and two mathematicians. We all have advanced degrees (M.A. or Ph.D.), but none of us had any formal “data science” training. I think that the many existing data science courses and programs are only good for people with deep domain knowledge who need to learn the data tools. Managers can benefit from these courses too. However, by taking such a program alone, you will lack the experience in scientific methodology, which is central to any data research project. Such a program will not provide you the computer science knowledge and expertise to make you a good data engineer. You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.
You might end up a mediocre Python or R programmer who can fiddle with the parameters of various machine learning libraries, one of the many. Sometimes it’s good enough. Frequently, it’s not.
Lessons from the past
When I started my Ph.D. (in 2001), bioinformatics was HUGE. Many companies had bioinformatics departments that consisted of dozens, sometimes, hundreds of people. Every university in Israel (where I live), had a bioinformatics program. I knew at least five bioinformatics startups in my geographic area. Where is it now? What do these bioinformaticians do? I don’t know any bioinformatician who kept their job description. Most of those who I know transformed into data science, some became managers. Others work as governmental clerks.
Existing tools like Tableau have already sweated much of the complexity out of the once-very-hard task of data visualization, said Raghuram. And there are more higher-level tools on the way … that will improve workflow and automate how data interpretations are presented. “That’s the sort of automation that eliminates the need for data scientists to a large degree,” … And as the technology solves more of these problems, there will also be a lot more human job candidates from the 100 graduate programs worldwide dedicated to churning out data scientists Supply, meet demand. And bye-bye perks.
My point is, you have to be versatile and expert. The best way to become one isn’t to take a crash course but to solve hard problems, preferably, under supervision. Usually, you do so by obtaining an advanced degree. By completing an advanced degree, you learn, you learn to learn, and you prove to yourself and your potential employees that you’re capable of bridging the knowledge gaps that will always be there. That is why is why I advocate obtaining a degree in an existing field, keeping the data science as a tool, not a goal.
I might be wrong.
Giving advice is easy. Living the life is not. The path I’m advocating for worked for me. I might be completely wrong here.
I may be completely wrong about data science not being a mature scientific field. For example, deep learning may be the defining concept of data science as a scientific field on its own.
Credits: The crowd image is by Flicker user Amy West. Hilary Mason's photo is from her site https://hilarymason.com/about/
On June 12th, I’ll be talking about anomaly detection and future forecasting when “good enough” is good enough. This lecture is a part of PyCon Israel that takes place between June 11 and 14 in the Bar Ilan University. The conference agenda is very impressive. If “python” or “data” are parts of your professional life, come to this conference!
This week we’re bringing your new reading (and watching!) on neural networks, artificial intelligence, and…poetry. (Yes, poetry!) Check out our recommendations and share your perspectives on them in the comments.
Are you a stereotypical millennial who’s curious about many things and loves your information fast, fun, and to the point? Are you also interested in learning about machine learning, neural networks, and other cool stuff? You aren’t, but you know such a person? Make sure to check the Siraj Raval’s YouTube channel. Siraj is a Pythonista, a machine learning geek, a rapper, and, apparently, a YouTube star. He’s talking about self-driving cars, stock exchange prediction, image classification, and other cool stuff in terms that do not require deep prior knowledge.
“The dark side of data science” discusses the problems around relying too heavily on learning algorithms. The problems described in this post lie on the…
This week Boris shares a piece on the persuasive power of data visualization. Read something provocative in the field of data science? Be sure to share your links in the comments.
In “The Persuasive Power of Data Visualization,” a group of New York University researchers demonstrates the results of an experimental assessment: the claim that data visualization is indeed an effective tool for conveying a message.
The 2014 study claims that despite the fact that “…data visualization has been used extensively to inform users…little research has been done to examine the effects of data visualization in influencing users or in making a message more persuasive.”
To make their point, the researchers presented a series of questions to 150 Amazon Mechanical Turk users. (Such a setup isn’t flawless, of course, but is a common practice in many perception studies.)
Indeed, graphical representation was effective, at least for…
With over 545 employees spread over more than 50 countries, Automattic is one of the largest distributed companies in the world.
Being distributed means that we, as Automatticians, work on our common goal to democratize publishing from wherever we wish. It also means that we heavily rely on online communication.
Besides providing flexibility, working in a distributed company brings challenges when you meet in person. Does X from team I/O appreciate informal humor?What is the hobby of Y, a member of the Happiness Engineering team?Does Z, our team’s HR person, smile a lot or only when taking a gravatar picture? The answers to these questions are trivial to get in a “traditional” company but not in a distributed environment.
However, this information is critical. It helps us relate to one another and form strong relationships that nourish creativity and cooperation. If the challenge doesn’t seem hard enough to…
Adding legends to a graph is easy. With matplotlib, for example, you simply call plt.legend() and voilà, you have your legends. The fact that any major or minor visualization platform makes it super easy to add a legend doesn’t mean that it should be added. At least, not in graphs that are supposed to be shared with the public.
The chart provides fascinating information. However, to “decipher” it, the viewer needs to constantly switch between the chart and the legend to the right. Moreover, having to encode eight different categories, resulted in colors that are hard to distinguish. And if you happen to be a colorblind person, your chances to get the colors right are significantly lower.
What is the solution to this problem? Let’s reduce the distance between the labels and the data by putting the labels and the data together.
Notice the multiple advantages of the “after” version. First, the viewer doesn’t need to jump back-and-forth to decide which segment represents which data series. Secondly, by moving the legends inside the graph, we freed up valuable real estate area. But that’s not all. The new version is readable by the colorblind. Plus, the slightly bigger letters make the reading easier for the visually impaired. It is also readable and understandable when printed out using a black and white printer.
“Wait a minute,” you might say, “there’s not enough space for all the labels! We’ve lost some valuable information. After all,” you might say, “we now only have four labels, not eight”. Here’s the thing. I think that losing four categories is an advantage. By imposing restrictions, we are forced to decide what is it that we want to say, what is important and what is not. By forcing ourselves to only label larger chunks, we are forced to ask questions. Is the distinction between “Moustache with Muttonchops” and “Moustache with Sideburns” THAT important? If it is, make a graph about Muttonchops and Sideburns. If it’s not, combine them into a single category. Even better, combine them with “Mustache”.
Having the ability to add a legend with any number of categories, using only one code line is super convenient and useful, especially, during data exploration. However, when shared with the public, graphs need to contain as fewer legends as practically needed. Remove the legends, place the labels close to the data. If doing so results in unreadable overlapping labels, refine the graph, rethink your message, combine categories. This may take time and cause frustration, but the result might surprise you. If none of these is possible, put the legend back. At least you tried.
Chart legends are like Muttonchops — the fact that you can have them doesn’t mean you should.
The goal of data visualization is to transform numbers into insights. However, default data visualization output often disappoints. Sometimes, the graph shows irrelevant data or misses important aspects; sometimes, the graph lacks context; sometimes, it’s difficult to read. Often, data practitioners “feel” that something isn’t right with the graph, but cannot pinpoint the problem.
In this post, I’ll share the process of visualizing a complex issue using a simple plot. Despite the fact that the final plot looks elementary and straightforward, it took me several hours and trial-and-error attempts to achieve the result. By sharing this process, I hope to accomplish two goals: to offer my perspectives and approaches to data visualization and to learn from other options you suggest. You’ll find the code and the data used in this post here.
Plotting power distribution in the Knesset
This post is devoted to a graph I created to explore…
Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation, so we will treat those days as half working days in the following analysis.
I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between 1993 and 2020 CE, and this is what we get:
Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:
Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to constantly interrupted work day, but at a different scale.
So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.
(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.
(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion
Two weeks ago, I gave an interview to Matthew Kaboomis Loomis from http://www.buildyourownblog.net. This was my first time, and I was pretty nervous. During the interview, Matthew and I talked about the recent findings that I have published in my previous post. Surprisingly, I really enjoyed the interview.
Click on the image below to see the interview on Matthews’ blog.
Socially active bloggers, and bloggers who write in teams tend to write a blog for longer times.
Writing a blog is a hard and demanding task. It requires creativity, dedication and persistence. Data shows that a large percentage of bloggers stop posting after a couple of months, and most of them don’t survive for more than a year. What is the force that drives the successful bloggers? What factors distinguish the persistent bloggers from the “quitters”? Is there anything a person can do to increase his or her chances to keep blogging and not to quit?
The research that I will present here was performed in collaboration with Lior Zalmanson, a post-doc researcher at Stern School of Business New York University. In this study, we asked ourselves whether people can increase their chance to keep blogging by joining forces.
Whether people work better in groups or as individuals is an open and long-debated question in social psychology. On one hand, there is a phenomenon of “Social Loafing” — a phenomenon that was first described by a French agricultural engineer Maximilien Ringelmann in 1913. According to Ringelmann’s findings, having group members work together on a task results in significantly less effort than when individual members are acting alone [Wikipedia, Original paper].
On the other hand, Otto Köhler, a German psychologist, has found in 1926 that the weaker group members strive to keep up with the accomplishment of the other group members, which results in an overall performance improvement [Enc Britanica]. The effect of “Social Compensation” was suggested by Karau and Williams in 1991. According to their observations, a worker will work harder in groups, compensating for those who work less.
The connected world of WordPress.com bloggers
Which of the two competing theories is more applicable towards the world of bloggers? To shed some light on this question, we studied the blogging patterns of WordPress.com users. WordPress.com is a platform that hosts more than 110,000,000 sites that belong to more than 102,000,000 registered users. To better understand the implications of social interactions on blogging activities, we analyze the links between WP.com users and blogs.
This analysis results in a mathematical structure called ‘graph’ that contains different types of interactions among various kinds of entities.
Let’s consider a simple example. Alice who wrote a post on her blog. The fact of publishing a post created a relationship between Alice and her blog. We will call this connection “IS_CONTRIBUTOR”. At some point, Bob joins Alice and writes a post to the same blog. Now, both Alice and Bob are contributors of that blog.
As time passes by, Bob continues submitting content to the blog and Alice doesn’t. To reflect this difference between the two, we define a “weight” of a link — the more content a user contributes to the blog, the higher is the weight. If a user, Alice in our case, doesn’t write new content to the blog her connection to the blog gradually disappears until at some time we will delete the link. We will no longer consider Alice as a contributor to B1.
At some point, Charlie reads Alice’s blog post and presses the “like” button. By clicking this button, a connection between Charlie and Alice is created. We will call this relationship “LIKES_AUTHOR”. We consider bringing a new audience to a blog as a contribution to that blog. Thus, when Charlie “likes” Alice’s post, he also increases the “IS_CONTRIBUTION” link between Alice and the blog.
For the sake of our discussion, let’s assume that Alice writes posts in another blog, we will call it B2. It turns out that Daphne and Eve are also authors in B2. Charlie, whom we already met, is also writing a blog, all by himself.
We want to know how collaboration affects authors’ persistence. In other words: does the number of collaborators an author has have an impact on the probability that that author will keep blogging for a longer time. In this toy example, Alice has three collaborators (Bob, Daphne and Eve), Daphne and Eve have both two collaborators; Bob has only one collaborator (Alice), and Charlie has no collaborators at all. In order not to upset writers who write alone, we will consider a person a partner of him- or her- selves. Thus, for the purpose of our analysis, Alice has four collaborators (including herself), Daphne and Eve have three, and Bob has two collaborators.
# of collaborators
What we see here is a small example. In reality, WordPress.com users form a large complex network of people and blogs. Since the connections between the nodes in this network can appear and disappear, this network is in constant change. One of the interesting things that we can do with this kind of dynamic systems is to discover user communities. I have already presented one such an analysis in the past (see this presentation) and will certainly show more of it in the future. Meanwhile, note that blogs don’t exist in a vacuum. Most of the time, people who write don’t write for themselves, they seek an audience. Writing inside a large platform of many interconnected communities brings the author closer to such an audience and promotes discussion and exchange of ideas.
Collecting the data
There are many types of communities in the online world. There are also many types of collaboration between people. Let us now concentrate on a very specific kind of community — a community of writers. In this analysis, we treat a blog or a website as a gathering point and all the writers in that website as the community members. We can tackle the question raised at the beginning of this post: what makes some bloggers keep blogging?
To answer this question, we look at people who opened a WordPress.com account during the thirteen months from Jan 2013 and Feb 2014. WordPress.com is a home of large professional (VIP) customers such as NBC Sports, TED, CNN, Time and others (https://vip.wordpress.com/clients/). It would be unfair to include these professional writers in our analysis. Unfortunately, some people use WordPress.com for spamming, fraud and other non-legit activities; we have removed those people from the analysis. Many people open a WordPress.com account, write a test post and go away. We did not like to include such users in this study, so we only included those people who gave or received, at least, one “like”, as a substitute for a minimal level of quality and commitment. Last but not least, we excluded the 400+ Automattic employees and contractors from the analysis. In Automattic, we use WordPress.com for internal communication purposes and are a clear outlier in any analysis.
When we look at monthly snapshots of the dynamic networks mentioned above, we collect various descriptive statistics about the users who registered two to three months before the snapshot. We believe that this period is long enough for novice users to accommodate with the platform and to gain some popularity and social ties. Next, we look at the monthly snapshot made one year later and check whether a current user appears in the network as a contributor to at least one blog or not. We consider the users who appear in this graph as survivors. For the purpose of this analysis, we completely ignore what happens during the entire year — between the two snapshots.
In the end, we need to analyze the connection between two sets of information. On one hand we have the descriptive statistics collected at the data collection point. On the other side of the equation, we have the survival information. Let’s see the results.
Silver bullet of blogging success?
Approximately 570,000 authors met the study inclusion criteria. We don’t analyze these authors alone but in a context of a network that contains about 5,000,000 users. What factors promote author survival? In the next paragraphs, I will show several similar graphs. In these graphs, the Y-axis will show the probability to continue blogging, as a function of the variable presented on the X-axis. On that axis, below the number that represents variable value, you will find the number of people for whom this true.
The figure above shows how the survival probability depends on the number of likes an author received two to three months after the registration. We can see that there were 172 thousand users who did not get any “like” at the data collection point. These authors had a 2% probability to stay active at the end of the study. Approximately 213 thousand users with one “like” had the probability to keep blogging of 4% — a two-fold increase. Overall, we see that the more “likes” a person receives, the higher is the survival probability. Users who received more than 32 likes (there were only 800 of them) are 43% likely to keep blogging (the shaded area represents the 95% confidence interval — how confident we are in the survival estimate).
Can this graph help an aspiring blogger? Not directly. It doesn’t surprise that an author who is capable of producing high-quality content that is also popular with a broad audience will also be likely to continue blogging in the future. Plus, authors can’t directly affect the number of likes they receive. Unlike receiving likes, pressing the “like” button on other’s content is under a person’s direct control. The following graph shows survival probability as a function of how many “likes” a user gave to others.
Generally speaking, the more socially active the authors are, the more likely they are to keep blogging. It is interesting to note that there is such a thing as too much love: authors who pressed the “like” button more than 32 times during their first two to three months are less likely to survive, compared to the immediately preceding group.
At first glance, one might take the two graphs above as a magic recipe for blogging success. Doing this will be a mistake. It is correct that we have found a correlation between the number of incoming and outgoing likes and the survival probability. This connection does not tell us which of the two sides of the association is the cause and which one is the result. This uncertainty, which is often called the correlation-causation question, is very hard to solve. However, based on my general knowledge of human psychology, I claim that social interactions are partially responsible for the high correlation that we see in the data.
A problem shared is a problem halved
Let’s get back to our original question: do authors who collaborate with others have a higher probability to keep on blogging? In the terms of our example network, does Alice, who has four collaborators, have greater chance to survive, compared to Charlie, who writes by himself? The answer is yes. According to the data we have, authors who write by themselves have only a 6% probability to keep blogging till the end of the study. Authors who write in pairs have a 9% probability to stay — a 50% increase. Interestingly, subsequent addition of collaborators has almost no effect, until we see teams of more than 16 authors.
Does fairness matter?
This discussion started with a description of two competing theories that try to predict the outcome of a joined effort. Both theories talk about “weaker” and “stronger” participants. It turns out that we can measure the extent to which a person contributes to a blog and the extent to which blog’s authors provide equal contributions. Let’s get back to our example network. Suppose that, at some point we notice that Alice, Bob and Charlie collaborated in submitting content to a blog (B1). Alice wrote most of the posts and attracted most of the “likes”; Bob wrote from time to time; and Charlie contributed only one post, long time ago. If we measure link strength between each author and this particular blog, we see that Alice’s contribution is 10, Bob’s contribution is 0.5 and Charlie’s contribution is 0.01. On the other hand, Charlie and Daphne and Eve occasionally write a joined blog (B2) to which they contribute equally, with tie strength values of 0.1 each.
Will Alice feel abused by Charlie and Bob and cease her collaboration with them? Is it possible that Charlie will feel insecure about his weak cooperation with Alice and Bob, and will only write to B2? We can record how unbalanced a user’s input is in a given blog. We will measure this type of inequality such that authors who contribute less than the average will receive negative values; authors who provide the average contribution will have inequality value of zero and authors who perform most of the job will have a positive inequality score. We compute this score for each person-blog connection, which means that people who contribute to several blogs will have several inequality scores.
What is the probability of a given author to continue contributing to a particular blog, given how equal (or unequal) person’s contribution to this blog is. Note, that a user who contributes to several blogs may stop writing to one blog, but continue adding content to another one. In our latest example, Charlie may decide that he doesn’t want to write to B1 anymore and will contribute only to B2.
Generally speaking, there may be four types of connection between the inequality score and the persistence probability as schematically depicted in the figure below:
One possible outcome is that the more a person contributes to a blog, the higher is the chance that he or she will continue writing to that blog (case A in the figure above). Another possibility is that the opposite is the truth: the more a person contributes to the blog, the smaller is the chance to keep writing (for example, due to the sense of unfairness; case B). Another possibility is that the average contributing people will have a higher chance to drop off, leaving only the super engaged and occasional contributors (case C). Finally, an opposite possibility exists, where average-contributing authors have the highest chance to persist.
According to our data, option A seems to describe better the reality. Contrarily to my initial belief, people don’t mind carrying the load. To some extent, authors who contribute more than the average to the joined effort are more likely to keep writing.
There is another level to analyze contribution inequality. If you go back to our latest example, you’ll notice that the contributions to B1 are very balanced, while the contributions to B1 are not. To quantify such a balance we use the Gini index. The Gini index is in extensive use in economics. It was developed to measure the extent to which the distribution of a limited resource among individuals deviates from a perfectly equal distribution. The Gini index ranges from zero to one. The value of zero means total equality, and one means the strongest inequality possible. Taken in the example above, blog B1 has the Gini index of 0.44 and B2 has the index of 0.
Having computed the Gini index for all the blogs in our study, we can analyze the connection between the Gini index and the probability of a blog to survive one year after the analysis. Similarly to the previous case, there are four possible relationships between the measured value (Gini index) and the probability:
I must admit that before the analysis, I expected to see a curve similar to the case B. I was very surprised (and super excited) to see that sites with extreme input inequality have much higher survival probability, compared to the overall equal blogs.
Let me go back to the questions I asked at the beginning of this post.
Q: What is the force that drives the successful bloggers?
Q: What factors distinguish the persistent bloggers from the “quitters”?
I have shown an empirical evidence that social interaction with other authors is strongly correlated to blogging success. Does social interaction bring blogging success, or are highly motivated authors also active in inter-personal activity? I don’t know for sure, but I assume that the reality is a combination of both. Thus, when choosing a blogging platform, make sure you can interact with your readers and with bloggers with similar interests. Make sure to connect with them. You can only gain from these connections.
Q: Is there anything a person can do to increase his or her chances to keep blogging and not to quit?
Team up! The evidence is clear: people who write in groups are more likely to keep writing. Every project needs a leader (recall the inequality graphs). If you are already motivated, don’t be afraid to lead. On the other hand, don’t be scared to be in the shadow of a dominant partner. Unbalanced contribution to a joint project is not about unfairness but is about leadership. Remember that 1% of fame is better than 100% of anonymity.
The information in this post was first presented at WordCamp Israel that took place in Jerusalem on March 28, 2016. This study was performed in tight collaboration with Lior Zalmanson, a post-doc researcher in the Stern School of Business, New York University.
We are surrounded by discrete events: posts and comments, purchases from the online store, page visits are all examples of discrete events. Some of these events happen periodically; others happen sporadically. Some happen once in a while; others are generated every microsecond. There are many ways to visualise such streams, the most naive one being a one-dimensional plot in which events are placed on a time axis, as demonstrated in the figure below for three simple cases.
But what happens if there are too many such events? How do we gain a global overview? Recently, I have learned about “Time maps” — a creative way to perform such a visualization, which I will describe below using examples from WordPress.com. I will expand this approach to accommodate for bringing cases with drastically different time scales to the same domain, for easier comparison and pattern discovery.
Instead of plotting the absolute time values, or time differences, Max Watson from District Data Labssuggests representing each data point in two dimensions — in terms of time passed before and after the events. Thus, all the data points from case A in the figure above will be represented by the pairs as all of them occur at regular intervals of 1 second. Most of the dots in case B from the same figure will be represented by the same pair, . However, the pause between the event at time marks 3 and 6 seconds and not 1. Thus, the third point will be represented by and the fourth one — by . Following are the time maps for the three toy cases from the figure above:
Note that all the dots in case A overlap. Now we can identify several interesting regions in these time maps. Events that happen at equal intervals lie on the diagonal line and overlap. Events that occur at approximately equal intervals will be distributed over the diagonal. The upper right portion of the map represent events that happen at a slow pace while the lower-left quadrant of the graph represents fast events. Long interruptions followed by recovery manifest as large deviations from the diagonal — the interruption itself (high ) is shown as a dot above the diagonal and the event after the interruption (high ) appears as a dot to the right of the diagonal.
Let us examine some real-life events. I’ll start with publishing posts or pages on some of the WordPress.com blogs.
Following is a time map of a particularly busy site with more than 7,700 posts. On the right, you may see post and pages time map in which each point is colored according to the GMT in which the post was published. Note the logarithmic scale of the plot that shows events at time intervals of several orders of magnitude — from seconds to weeks. As you may see, the vast majority of publishing events occur at intervals of several minutes to one hour. The lines of purple dots represent breaks of approximately eight hours in publishing new posts — something that is consistent with a medium size editorial team. To the right, I present a landscape representation of the same time map. The dark blue bulb in the graph on the left shows that the vast majority of the events occur at regular intervals between several minutes and one hour.
Time map of another busy site, with its 6511 posting events, shows a slightly different publishing pattern.
In this news site, most of the events occurred at one-minute intervals. The typical service break in this site is either one or two days long, as represented by the two distinct strips.
Using the time map approach is relatively easy to identify bots. For example, the owners of the following site have uploaded more than 1,000 posts such that most of them were published at a one-minute interval — a pattern that is unlikely to be performed by a human.
What about smaller blogs? This Norwegian blogger posted less than 150 posts with a typical time interval between a day and week.
I have analyzed time maps of different administrative tasks performed by several random WordPress.com users.
This user, for example, is a moderately active blogger with a single blog. Their time map is characteristic of sporadic events with intervals of several seconds to one day. We may see that this user is very disciplined and does not leave the blog for more than a day.
On the other hand, another user, who has two blogs presents a completely different time map with thousands of events within the range of micro- and milliseconds. Notice that the two types of events (the super-fast ones and the “normally paced” ones) are well separated and that the super-fast events occur at very specific hours. Does this mean that that user’s computer is affected by a virus? Does that user use an aggressive script to monitor their site? This is a topic for another post.
Bringing the maps to the common scale
You may have noticed the vast differences in time scales that we saw in the maps above. Sometimes, we are interested in map’s shape, so that we can compare similar patterns that occur at different time scales. To achieve this task, I first sort time difference values in each case (user, blog, log record, etc.). Next, I transform these values such that the median time difference (the one that is larger than half of the observed points) gets the value of 0. Time differences larger then the median receive positive values and time differences smaller than the median receive negative values. In technical terms: I first compute the empirical cumulative distribution of the time differences, and then perform logit transform.
The following figure shows the normalised maps for four blogs I have mentioned at the beginning of this document . We may see that the elliptical shape of the dot cloud represents a series of events that occur somewhat regularly at a relatively low frequency. Round-shaped graphs represent cases in which the typical variability in inter-event periods is comparable to that period. Bot-induced time maps are still easily detected in this representation
Let us now examine the normalised time maps of the two users mentioned above:
One interesting thing to note is the fact that the lower-left region in the second user’s graph is now significantly enlarged because it contains a much larger amount of data points. As with the case of the blog time maps, the approximately round shape of the first user’s time map indicates that the variability of the between-event interval is comparable in its size to the interval. Note, though, that when observing first user’s raw map, it was clear that that user’s events mostly occur not faster than once a second, and that there the maximal interval is never larger than one day. Such a notion, which may be valuable in many cases, is absent from the normalised graph.
Time maps, as developed by Max Watson from District Data Labs provide a valuable insight in the analysis of event streams. It is evident that these maps reveal interesting activity patterns. Analysis of such patterns may be used in spam and fraud detection.
The proposed normalisation of time maps provides a tool for comparing the shapes of drastically different maps. The following figure, for example, presents time maps of Premium and Business purchases performed in Japanese Yens between June 1st and Dec 20th, 2015 :.
I selected the Yen as the least frequent currency used in WordPress.com. Had I analysed the same data for USD, we would see a completely different raw map, but comparable normalised one: