• Tips on making remote presentations

    Tips on making remote presentations

    February 21, 2018

    Today, I made a presentation to the faculty of the Chisinau
    Institute of Mathematics and Computer Science. The audience gathered in a conference room in Chisinau, and I was in my home office in Israel.

    Me presenting in front of the computer

    Following is a list of useful tips for this kind of presentations.

    * When presenting, it is very important to see your audience. Thus, use two monitors. Use one monitor for screen sharing, and the other one to see the audience
    * Put the (Skype) window that shows your audience under the camera. This way you'll look most natural on the other side of the teleconference.
    * Starting a presentation in Powerpoint or Keynote "kidnaps" all the displays. You will not be able to see the audience when that happens. I export the presentation to a PDF file and use Acrobat Reader in full-screen mode. The up- and down- buttons in my presentation remote control work with the Reader. The "make screen black" button doesn't.
    * I open a "lightable view" of my presentation and put it next to the audience screen. It's not as useful as seeing the presenter's notes using a "real" presentation program, but it is good enough.
    * Stand up! Usually, we stand up when we present in front of live audience. For some reason, when presenting remotely, people tend to sit. A sitting person is less dynamic and looks less engaging. I have a standing desk which allows me to stand up and to raise the camera to my face level. If you can't raise the camera, stay sitting. You don't want your audience staring at your groin.
    

    Auditorium in Chisinau showing me on their screen

    February 21, 2018 - 2 minute read -
    chisinau kishinev moldova presentation presenting remote skype blog Data Visualization
  • The best productivity system I know

    The best productivity system I know

    February 20, 2018

    I am an awful procrastinator. I realized that, many years ago. Once I did, I started searching for productivity tips and systems. Of course, most of these searches are another form of procrastination. After all, it’s much more fun to read about productivity than writing that boring report. In 2012, I discovered a TiddlyWiki that implements AutoFocus – a system developed by Mark Forster (AutoFocus instructions: link, TiddlyWiki page link)

    I loved the simplicity of that system and used it for a while. I also started following Mark Forster’s blog. Pretty soon after that, Mark published another, even simpler version of that system, which he called “The Final Version.” I loved it even better and readily adopted it. For many reasons, I moved from TiddlyWiki to Trello and made several personal adjustments to the system.

    At some point, I read “59 seconds” in which the psychologist Richard Wiseman summarizes many psychological studies in the field of happiness, productivity, decision making, etc. From that book, I learned about the power of writing things down. It turns out, that when you write things down, your brain gets a better chance to analyze your thoughts and to make better decisions. I also learned from other sources about the importance to disconnect from the Internet several times a day. So, on November 2016, I made a transition from electronic productivity system to an old school notebook. In the beginning, I decided to keep that notebook as a month-long experiment, but I loved that very much. Since then, I have always had my analog productivity system and an introspection device with me. Today, I started my sixth notebook. I love my system so much, I actually consider writing a book about it.

    [caption id=”attachment_2088” align=”alignnone” width=”4000”]Blank notebook page with #1 in the page corner

    The first page of my new notebook. The notebook is left-to-right since I write in Hebrew[/caption]

    February 20, 2018 - 2 minute read -
    procrastination productivity psychology blog Productivity & Procrastination
  • Once again on becoming a data scientist

    Once again on becoming a data scientist

    February 19, 2018

    My stand on learning data science is known: I think that learning “data science” as a career move is a mistake. You may read this long rant of mine to learn why I think so. This doesn’t mean that I think that studying data science, in general, is a waste of time.

    Let me explain this confusion. Take this blogger for example https://thegirlyscientist.com/. As of this writing, “thegirlyscientst” has only two posts: “Is my finance degree useless?” and “How in the world do I learn data science?”. This person (whom I don’t know) seems to be a perfect example of someone may learn data science tools to solve problems in their professional domain. This is exactly how my professional career evolved, and I consider myself very lucky about that. I’m a strong believer that successful data scientists outside the academia should evolve either from domain knowledge to data skills or from statistical/CS knowledge to domain-specific skills. Learning “data science” as a collection of short courses, without deep knowledge in some domain, is in my opinion, a waste of time. I’m constantly doubting myself with this respect but I haven’t seen enough evidence to change my mind. If you think I miss some point, please correct me.

    February 19, 2018 - 1 minute read -
    career data science machine learning blog Career advice
  • The case of meaningless comparison

    The case of meaningless comparison

    February 18, 2018

    Exposé, an Australian-based data analytics company, published a use case in which they analyze the benefits of a custom-made machine learning solution. The only piece of data in their report [PDF] was a graph which shows the observed and the predicted

    Screenshot that shows two time series curves: one for the observed and one for the predicted values

    Graphs like this one provide an easy-to-digest overview of the data but are meaningless with respect to our ability to judge model accuracy. When predicting values of time series, it is customary to use all the available data to predict the next step. In cases like that, “predicting” the next value to be equal to the last available one will result in an impressive correlation. Below, for example, is my “prediction” of Apple stock price. In my model, I “predict” tomorrow’s prices to be equal to today’s closing price plus random noise.

    Two curves representing two time series - Apple stock price and the same data shifted by one day

    Look how impressive my prediction is!

    I’m not saying that Exposé constructed a nonsense model. I have no idea what their model is. I do say, however, that their communication is meaningless. In many time series, such as consumption dynamics, stock price, etc, each value is a function of the previous ones. Thus, the “null hypothesis” of each modeling attempt should be that of a random walk, which means that we should not compare the actual values but rather the changes. And if we do that, we will see the real nature of the model. Below is such a graph for my pseudo-model (zoomed to the last 20 points)

    diff_series

    Suddenly, my bluff is evident.

    To sum up, a direct comparison of observed and predicted time series can only be used as a starting point for a more detailed analysis. Without such an analysis, this comparison is nothing but a meaningless illustration.

    February 18, 2018 - 2 minute read -
    data visualisation Data Visualization dataviz time-series blog
  • I should read more about procrastination. Maybe tomorrow.

    I should read more about procrastination. Maybe tomorrow.

    February 17, 2018

    You’ve been there: you need to complete a project, submit a report, or document your code. You know how important all these tasks are, but you can’t find the power to do so. Instead, you’re researching those nice pictures the Opportunity rover sent to the Earth, type random letters in Google to see where they will lead you to, tidy up your desk, or make another cup of coffee. You are procrastinating.

    Because I procrastinate a lot, and because I have several important tasks to complete, I decided to read more about the psychological background of procrastination. I went to Google Scholar and typed “procrastination.” One of the first results was a paper with a promising title. “The Nature of Procrastination: A Meta-Analytic and Theoretical Review of Quintessential Self-Regulatory Failure” by Piers Steel. Why was I intrigued by this paper? First of all, it’s a meta-analysis, meaning that it reviews many previous quantitative studies. Secondly, it promises a theoretical review, which is also a good thing. So, I decided to read it. I started from the abstract, and here’s what I see:
    Strong and consistent predictors of procrastination were task aversiveness, task delay, selfefficacy, and impulsiveness, as well as conscientiousness and its facets of self-control, distractibility, organization, and achievement motivation.

    Hmmm, isn’t this the very definition of procrastination? Isn’t this sentence similar to “A strong predictor of obesity is a high ratio between person’s weight to their height?”. Now, I’m really intrigued. I am sure that reading this paper will shed some light, not only on the procrastination itself but also on the self-assuring sentence. I definitely need to read this paper. Maybe tomorrow.

    PS. After writing this post, I discovered that the paper author, Piers Steel, has a blog dedicated to “procrastination and science” https://procrastinus.com/. I will read that blog too. But not today

    February 17, 2018 - 2 minute read -
    paper procrastination productivity blog Productivity & Procrastination
  • Lie factor in ad graphs

    Lie factor in ad graphs

    February 16, 2018

    What do you do when you have spare time? I tend to throw graphs from ads to a graph digitizer to compute the “lie factor”. Take the following graph for example. It appeared in an online ad campaign a couple of years ago. In this campaign, one of the four Israeli health care providers bragged about the short waiting times in their phone customer support. According to the Meuheded (the health care provider who run the campaign), their customers had to wait for one minute and one second, compared to 1:03, 1:35, and 2:39 in the cases of the competitors. Look how dramatic the difference is:

    Screen Shot 2018-02-16 at 18.34.38

    The problem?

    If the orange bar represents 61 seconds, then the dark blue one stands for 123 seconds, almost twice as much, compared to the actual numbers, the green bar is 4:20 minutes, and the light-blue one is approximately seven minutes, and not 2:39, as the number says.

    Screen Shot 2018-02-16 at 18.32.53

    I can’t figure out what guided the Meuhedet creative team in selecting the bar heights. What I do know that they lied. And this lie can be quantified.

    February 16, 2018 - 1 minute read -
    data visualisation Data Visualization dataviz lie lie-factor blog
  • Never read reviews before reading a book (except for this one). On

    Never read reviews before reading a book (except for this one). On "Surely You're Joking, Mr. Feynman!"

    February 15, 2018

    Several people suggested that I read “Surely You’re Joking, Mr. Feynman!”. That is why, when I got my new Kindle, “Surely You’re Joking, Mr. Feynman!” was the first book I bought.
    Richard Feynman was a trained theoretical physics who co-won the Nobel Prize. From reading the book, I discovered that Feynman was also a drummer, a painter, an expert on Native American mathematics, safecracker, a samba player, and an educator. The more I read this book, the more astonished I was about Feynman’s personality and his story.

    When I was half the way through the book, I decided to read the Amazone reviews. When reading reviews, I tend to look for the one- and two- stars, to seed my critical thinking. I wish I haven’t done that. The reviewers were talking about how arrogant and self-bragging man Feynman was, and how it must have been terrible to work with him. I almost stopped reading the book after being exposed to those reviews.

    Admittedly, Richard Feynman never missed an opportunity to brag about himself and to emphasize how many achievements he made without meaning to do so, almost by accident. Every once in a while, he mentioned many people who were much better than him in that particular field that managed to conquer. I call this pattern a self-bragging modesty, and it is a pattern typical of many successful people. Nevertheless, given all his achievements, I think that Feynman deserves the right to be self-bragging. Being proud of your accomplishments isn’t arrogance, and is a natural thing to do. “Surely You’re Joking, Mr. Feynman!” is fun to read, is very informative and inspirational. I think that everyone who calls themselves a scientist or considers being a scientist should read this book.

    P.S. After completing the book, I took some time to watch several Feynman’s lectures on YouTube. It turned out that besides being a good physicist, Feynman was also a great teacher.

    February 15, 2018 - 2 minute read -
    book review feynman physics blog
  • Is Data Science a Science?

    Is Data Science a Science?

    February 14, 2018

    Is Data Science a Science? I think that there is no data scientist who doesn’t ask his- or herself this question once in a while. I recalled this question today when I watched a fascinating lecture “Theory, Prediction, Observation” made by Richard Feynman in 1964. For those who don’t know, Richard Feynman was a physicist who won the Nobel Prize, and who is considered one of the greatest explainers. In that particular lecture, Prof. Feynman talked about science as a sequence of Guess ⟶ Compute Consequences ⟶ Compare to Experiment

    Richard Feynman in front of a blackboard that says: Guess ⟶ Compute Consequences ⟶ Compare to Experiment

    This is exactly what we do when we build models: we first guess what the model should be, compute the consequences (i.e. fit the parameters). Finally, we evaluate our models against observations.

    My favorite quote from that lecture is

    … and therefore, experiment produces troubles, every once in a while …

    I strongly recommend watching this lecture. It’s one hour long, so if you don’t have time, you may listen to it while commuting. Feynman is so clear, you can get most of the information by ear only.

    https://www.youtube.com/watch?v=OX1EK5IBSdw

    February 14, 2018 - 1 minute read -
    data science feynman philosophy philosophy-of-science science blog
  • Why deeply caring about the analysis isn't always a good thing?

    Why deeply caring about the analysis isn't always a good thing?

    February 13, 2018

    Does Caring About the Analysis Matter?

    The simplystatistics.org blog had an interesting discussion about podcast Roger Peng from simplystatistics.org recorded on A/B testing on Etsy. One of the late conclusions Roger Peng had is as follows
    “Whether caring matters for data analysis also has implications for how to build a data analytic team. If you need your data analyst to be 100% committed to a product and to be fully invested, it’s difficult to achieve that with contractors or consultants, who are typically [not deeply invested].”

    A hypothetical graph that show that $$ potential is lower as

    Yes, deeply caring is very important. That is why I share Roger Peng’s skepticism about external contractors. On the other hand, too deep involvement is also a bad idea. Too deep involvement creates a bias. Such a bias, that can be conscious or subconscious, reduces critical thinking and increases the chances of false findings. If you don’t believe me, recall the last time you debugged a model after it produced satisfactory results. I bet you can’t. The reason is that we all tend to work hard, looking for errors and problems until we get the results we expect. But mostly, not long after that.

    There are more mechanisms that may cause false findings. For a good review, I suggest reading Why Most Published Research Findings Are False by John P. A. Ioannidis.
    Image source: Data Analysis and Engagement - Does Caring About the Analysis Matter? — Simply Statistics

    February 13, 2018 - 2 minute read -
    best-practice debugging overfitting statistics blog
  • Does chart junk really damage the readability of your graph?

    Does chart junk really damage the readability of your graph?

    February 12, 2018

    Screen Shot 2018-02-12 at 16.32.56

    Data-ink ratio is considered to be THE guiding principle in data visualization. Coined by Edward Tufte, data-ink is “the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” According to Tufte, the ratio of the data-ink out of all the “ink” in a graph should be as high as possible, preferably, 100%.
    Everyone who considers themselves serious about data visualization knows (either formally, or intuitively) about the importance to keep the data-ink ratio high, the merits of high signal-to-noise ratio, the need to keep the “chart junk” out. Everybody knows it. But are there any empirical studies that corroborate this “knowledge”? One of such studies was published in 1988 by James D. Kelly in a report titled “The Data-Ink Ratio and Accuracy of Information Derived from Newspaper Graphs: An Experimental Test of the Theory.”

    In the study presented by J.D. Kelly, the researchers presented a series of newspaper graphs to a group of volunteers. The participants had to look at the graphs and answer questions. A different group of participants was exposed to similar graphs that underwent rigorous removal of all the possible “chart junk.” One such an example is shown below

    Two bar charts based on identical data. One - with "creative" illustrations. The other one only presents the data.

    Unexpectedly enough, there was no difference between the error rate the two groups made. “Statistical analysis of results showed that control groups and treatment groups made a nearly identical number of errors. Further examination of the results indicated that no single graph produced a significant difference between the control and treatment conditions.”

    I don’t remember how this report got into my “to read” folder. I am amazed I have never heard about it. So, what is my take out of this study? It doesn’t mean we need to abandon the data-ink ratio at all. It does not provide an excuse to add “chart junk” to your charts “just because”. It does, however, show that maximizing the data-ink ratio shouldn’t be followed zealously as a religious rule. The maximum data-ink ratio isn’t a goal, but rather a tool. Like any tool, it has some limitations. Edward Tufte said, “Above all, show data.” My advice is “Show data, enough data, and mostly data.” Your goal is to convey a message, if some decoration (a.k.a chart junk) makes your message more easily digestible, so be it.

    February 12, 2018 - 2 minute read -
    chart-junk data-ink-ratio data visualisation Data Visualization dataviz research blog
  • On statistics and democracy, or why exposing a fraud may mean nothing

    On statistics and democracy, or why exposing a fraud may mean nothing

    February 11, 2018

    “stat” in the word “statistics” means “state”, as in “government/sovereignty”. Statistics was born as a state effort to use data to rule a country. Even today, every country I know has its own statistics authority. For many years, many governments, have been hiding the true statistics from the public, under the assumption that knowledge means power. I was reminded of this after reading Charles Earl’s (my teammate) post “Mathematicians, rock the vote!”, in which he encourages mathematicians to fight gerrymandering. Gerrymandering is a dubious practice in the American voting system, where a regulatory body forms voting districts in such a way that the party that appointed that body has the highest chance to win. Citing Charles:

    It is really heartening that discrete geometry and other branches of advanced mathematics can be used to preserve democracy

    I can’t share Charles’s optimism. In the past, statistics have been successfully used for several times to expose election frauds in Russia (see, for example, these two links, but there are much much more [one] [two]). People went to the streets, waving posters such as “We don’t believe Churov [a Russian politician], we believe Gauss.”

    [caption id=”attachment_2045” align=”alignnone” width=”538”]Demonstration in Russia. Poster: "We don't believe Churov. We believe Gauss"

    “We don’t believe Churov. We believe Gauss”. Taken from Anatoly Karlin’s site http://akarlin.com/2011/12/measuring-churovs-beard/[/caption]

    Why, then, am I not optimistic? After all, even the great Terminator, one of my favorite Americans, Arnold Schwarzenegger fights gerrymandering.

    schwarznegger-on-the-gerrymandering-problem-00025416-super-169.jpg

    The problem is not that the American’s don’t know how to eliminate Gerrymandering. The information is there, the solution is known [ref, as an example]. In theory, it is a very easy problem. In practice, however, power, even more than drugs and sex, is addictive. People don’t tend to give up their power easily. What happened in Russia, after an election fraud was exposed using statistics? Another election fraud. And then yet another. What will happen in the US? I’m afraid that nothing will change there either.

    February 11, 2018 - 2 minute read -
    gerrymandering politics russia statistics usa blog
  • What is the best way to handle command line arguments in Python?

    What is the best way to handle command line arguments in Python?

    February 10, 2018

    The best way to handle command line arguments with Python is [defopt](http://evanunderscore/defopt: Effortless argument parser). It works like magic. You write a function, add a proper docstring using any standard format (I use [numpy doc]), and see the magic

    [code language=”python”]

    import defopt

    def main(greeting, *, count=1): “"”Display a friendly greeting.

    :param str greeting: Greeting to display
    :param int count: Number of times to display the greeting
    """
    for _ in range(count):
        print(greeting)
    

    if name == ‘main’: defopt.run(main)
    [/code]

    You have:

    * help string generation
    * data type conversion
    * default arguments
    * zero boilerplate code
    

    Magic!

    Illustration: the famous XKCD

    February 10, 2018 - 1 minute read -
    cli python blog
  • Measuring the wall time in python programs

    Measuring the wall time in python programs

    February 9, 2018

    Measuring the wall time of various pieces of code is a very useful technique for debugging, profiling, and computation babysitting. The first time I saw a code that performs time measurement was many years ago when a university professor used Matlab’s tic-toc pair. Since then, whenever I learn a new language, the first “serious” code that I write is a tic-toc mechanism. This is my Python Tictoc class: [Github gist].

    February 9, 2018 - 1 minute read -
    gist python stopwatch timing blog
  • Why bar charts should always start at zero?

    Why bar charts should always start at zero?

    February 8, 2018

    In the data visualization world, not starting a bar chart at zero is a “BIG NO”. Some people protest. “How come can anyone tell me how to start my bar chart? The Paper/Screen can handle anything! If I want to start a bar chart at 10, nobody can stop me!”

    Data visualization is a language. Like any language, data visualization has its set of rules, grammar if you wish. Like in any other language, you are free to break any rule, but if you do so, don’t be surprised if someone underestimates you. I’m not a native English speaker. I certainly break many English grammar rules when I write or speak. However, I never argue if someone knowledgeable corrects me. If you agree that one should try respecting grammar rules of a spoken language, you have to agree to respect the grammar of any other language, including data visualization.

    Natan Yau from flowingdata.com has a very informative post

    Screenshot of flowingdata.com post "Bar Chart Baselines Start at Zero"

    that explores this exact point. Read it.

    Another related discussion is called “When to use the start-at-zero rule” and is also worth reading.

    Also, do remember is that the zero point has to be a meaningful one. That is why, you cannot use a bar chart to depict the weather because, unless you operate in Kelvin, the zero temperature is meaningless and changes according to the arbitrary choice the temperature scale.

    Yet another thing to remember is that

    It’s true that every rule has its exception. It’s just that with this particular rule, I haven’t seen a worthwhile reason to bend it yet.

    (citing Natan Yau)

    February 8, 2018 - 2 minute read -
    bar plot data visualisation Data Visualization dataviz blog
  • Gender salary gap in the Israeli high-tech — now the code

    Gender salary gap in the Israeli high-tech — now the code

    February 7, 2018

    Several people have asked me about the technology I used to create the graphs in my recent post about the gender salary gap in the Israeli high-tech. Like 99% of the graphs I create, I used matplotlib. I have uploaded the notebook that I used for that post to Github. Here’s the link. The published version uses seaborn style settings. The original one uses a slightly customized style.

    February 7, 2018 - 1 minute read -
    code data visualisation Data Visualization dataviz jupyter matplotlib python seaborn blog
  • The Monty Hall Problem simulator

    The Monty Hall Problem simulator

    February 6, 2018

    A couple of days ago, I told to my oldest daughter about the Monty Hall problem, the famous probability puzzle with a counter-intuitive solution. My daughter didn’t believe me. Even when I told her all about the probabilities, the added information, and the other stuff, she still couldn’t “feel” it. I looked for an online simulator and couldn’t find anything that I liked. So, I decided to create a simulation Jupyter notebook.

    Illustration: Screenshot of a Jupyter notebook that shows the output of one round of Monty Hall simulation

    I’ve uploaded the notebook to GitHub, in case someone else wants to play with it [link].

    February 6, 2018 - 1 minute read -
    gambling jupyter jupyter-notebook monty-hall-problem probability statistics blog
  • In defense of double-scale and double Y axes

    In defense of double-scale and double Y axes

    February 5, 2018

    If you had a chance to talk to me about data visualization, you know that I dislike the use of double Y-axis for anything except for presenting different units of the same measurement (for example inches and meters). Of course, I’m far from being a special case. Double axis ban is a standard stand among all the people in the field of data visualization education. Nevertheless, double-scale axes (mostly Y-axis) are commonly used both in popular and technical publications. One of my data visualization students in the Azrieli College of Engineering of Jerusalem told me that he continually uses double Y scales when he designs dashboards that are displayed on a tiny screen in a piece of sophisticated hardware. He claimed that it was impossible to split the data into separate graphs, due to space constraints, and that the engineers that consume those charts are professional enough to overcome the shortcomings of the double scales. I couldn’t find any counter-argument.

    When I tried to clarify my position on that student’s problem, I found an interesting article by Financial Times commentator John Auther, called “Lies, Damned Lies and Statistics.” In this article, John Auther reviews the many problems a double scale can create. He also shows different alternatives (such as normalization). However, at the end of that article, John Auther also provides strong and valid arguments in favor of the moderate use of double scales. John Auther notices strange dynamics of two metrics

    [caption id=”attachment_2022” align=”alignnone” width=”1466”]A chart with two Y axes - one for EURJPY exchange rate and the other for SPX Index

    Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)[/caption]

    It is extraordinary that two measures with almost nothing in common with each other should move this closely together. In early 2007 I noticed how they were moving together, and ended up writing an entire book attempting to explain how this happened.

    It is relatively easy to modify chart scales so that “two measures with almost nothing in common […] move […] closely together”. However, it is hard to argue with the fact that it was the double scale chart that triggered that spark in the commentator’s head. He acknowledges that normalizing (or rebasing, as he puts it) would have resulted in a similar picture

    [caption id=”attachment_2023” align=”alignnone” width=”1462”]Graph that depicts the dynamics of two metrics, brought to the same scale

    Screenshot from the article https://t.co/UYVqZpSzdS (Financial Times)[/caption]

    But

    However, there is one problem with rebasing, which is that it does not take account of the fact that a stock market is naturally more variable than a foreign exchange market. Eye-balling this chart quickly, the main message might be that the S&P was falling faster than the euro against the yen. The more important point was that the two were as correlated as ever. Both stock markets and foreign exchange carry trades were suffering grievous losses, and they were moving together — even if the S&P was moving a little faster.

    I am not a financial expert, so I can’t find an easy alternative that will provide the insight John Auther is looking for while satisfying my purist desire to refrain from using double Y axes. The real question, however, is whether such an alternative is something one should be looking for. In many fields, double scales are the accepted language. Thanks to the standard language, many domain experts are able to exchange ideas and discover novel insights. Reinventing the language might bring more harm than good. Thus, my current recommendations regarding double scales are:

    Avoid double scales when possible, unless its a commonly accepted practice. In which case, be persistent and don’t lie.

    February 5, 2018 - 3 minute read -
    best-practice data visualisation Data Visualization dataviz double-scale opinion blog
  • What is the best way to collect feedback after a lecture or a presentation?

    What is the best way to collect feedback after a lecture or a presentation?

    February 4, 2018

    I consider teaching and presenting an integral part of my job as a data scientist. One way to become better at teaching is to collect feedback from the learners. I tried different ways of collecting feedback: passing a questionnaire, Polldaddy surveys or Google forms, or simply asking (no, begging) the learners to send me an e-mail with the feedback. Nothing really worked. The response rate was pretty low. Moreover, most of the feedback was a useless set of responses such as “it was OK”, “thank you for your time”, “really enjoyed”. You can’t translate this kind of feedback to any action.

    Recently, I figured out how to collect the feedback correctly. My recipe consists of three simple ingredients.

    Collecting feedback. The recipe.

    working time: 5 minutes

    Ingredients

    * Open-ended mandatory questions: 1 or 2
    * Post-it notes: 1 - 2 per a learner
    * Preventive amnesty: to taste
    

    Procedure

    Our goal is to collect constructive feedback. We want to improve and thus, are mainly interested in aspects that didn’t work well. In other words, we want the learners to provide constructive criticism. Sometimes, we may learn from things that worked well. You should decide whether you have enough time to ask for positive feedback. If your time is limited, skip it. Criticism is more valuable than praises.

    Pass post-it notes to your learners.

    Next, start with preventive amnesty, followed by mandatory questions, followed by another portion of preventive amnesty. This is what I say to my learners.

    [Preventive amnesty] Criticising isn’t easy. We all tend to see criticism as an attack and to react accordingly. Nobody likes to be attacked, and nobody likes to attack. I know that you mean well. I know that you don’t want to attack me. However, I need to improve.

    [Mandatory question] Please, write at least two things you would improve about this lecture/class. You cannot pass on this task. You are not allowed to say “everything is OK”. You will not leave this room unless you handle me a post-it with two things you liked the less about this class/lecture.

    [Preventive amnesty] I promise that I know that you mean good. You are not attacking me, you are giving me a chance to improve.

    That’s it.

    When I teach using the Data Carpentry methods, each of my learners already has two post-it notes that they use to signal whether they are done with an assignment (green) or are stuck with it (red). In these cases, I ask them to use these notes to fill in their responses – one post-it note for the positive feedback, and another one for the criticism. It always works like a charm.

    A pile of green and red post-it notes with feedback on them

    February 4, 2018 - 2 minute read -
    data science feedback presentation presenting teaching blog
  • Data is the new

    Data is the new

    February 3, 2018

    I stumbled upon a rant titled Data is not the new oil — Tech Insights

    You’ve heard it many times and so have I: “Data is the new oil” Well it isn’t. At least not yet. I don’t care how I get oil for my car or heating. I simply decide what to cook and where to drive when I want. I’m unconcerned which mechanism is used to refine oil […]

    Funny, in my own rant “data is not the new gold”, I claimed that “oil” was a better analogy for data than gold. Obviously, any “X is the new Y” sentences are problematic but it’s still funny how we like them.

    February 3, 2018 - 1 minute read -
    data science rant blog
  • Yes, your friends are more successful than you are. On

    Yes, your friends are more successful than you are. On "The Majority Illusion in Social Networks"

    February 2, 2018

    Recently, I re-read “The Majority Illusion in Social Networks” (by Lerman, Yan and Wu).

    The starting point of this paper is the friendship paradox – a situation when a node in a network has fewer friends that its friends have. The authors expand this paradox to what they call “the majority illusion” – a situation in which a node may observe that the majority of its friends have a particular property, despite the fact that such a property is rare in the entire network.

    An illustration of the “majority illusion” paradox. The two networks are identical, except for which three nodes are colored. These are the “active” nodes and the rest are “inactive.” In the network on the left, all “inactive” nodes observe that at least half of their neighbors are “active,” while in the network on the right, no “inactive” node makes this observation.F

    Besides pointing out the existence of majority illusion phenomenon, the authors used synthetic networks to characterize the situations in which this phenomenon is most prevalent.

    Quoting the authors:

    the paradox is stronger in networks in which the better-connected nodes are active, and also in networks with a heterogeneous degree distribution. […] The paradox is strongest in networks where low degree nodes have the tendency to connect to high degree nodes. […] Activating the high degree nodes in such networks biases the local observations of many nodes, which in turn impacts collective phenomena

    The conditions listed in the quote above describe a lot of known social networks. The last sentence in that quote is of a special interest. It explains the contagious nature of many actions, from sharing a meme to buying a new car.

    February 2, 2018 - 2 minute read -
    life social-network-analysis blog
  • Analysis of A Beautiful Storm: Internal Communication at Automattic

    Analysis of A Beautiful Storm: Internal Communication at Automattic

    February 2, 2018

    My teammate’s post on data.blog

    February 2, 2018 - 1 minute read -
    blog
  • Gender salary gap in the Israeli high-tech

    Gender salary gap in the Israeli high-tech

    February 1, 2018

    A large and popular Israeli Facebook group, “The High-Tech Troubles,” has recently surveyed its participants. The responders provided personal, demographic, and professional information. The group owners have published the aggregated results of that survey. In this post, I analyze a particular aspect of these findings, namely, how the responders’ gender and experience affect their salary. It is worth noting that this survey is by no means a representative one. It’s most noticeable but not the only problem is the participation bias. Another problem is the fact that the result tables do not contain any information about the number of responders in any group. Without this information, it is impossible to compute confidence intervals of any findings. Despite these problems, the results are interesting and worth noting.

    The data that I used in my analysis is available in this spreadsheet. The survey organizers promise that they excluded groups and categories with insufficiently few answers, and we have to trust them in that. The results are divided into twenty professional categories such as ‘Account Management,’ ‘Data Science’, ‘Support’ and ‘CXO’ (which stands for an executive role). The salary groups are organized in exponential bins according to the years of experience: 0-1, 1-2, 2-4, 4-7; and more than seven years of experience. Some of the cell values are missing, I assume that these are the categories with too few responders. I took a look at the gap between the salary reported by women and the compensation reported by men.

    Let’s take a look at the most complete set of data – the groups of people with 1-2 years of experience. As we may see from the figure below, in thirteen out of twenty groups (65%), women get less money than men.
    Gender compensation gap, 1-2 years of experience. Women earn less in 13 of 20 categories

    Among the workers with 1 - 2 years of experience, the most discriminating fields are executives and security researchers. It is interesting to note the difference between two closely related fields: Data Science and BI/Data Analysts. The former is considered a more lucrative position. On average, the male data scientists get 11% more than their female colleagues, while male data analysts get 13% less than their female counterparts. I wonder how this difference relates to my (very limited) observation that most of the people who call themselves a BI expert are females, while most of the data scientists whom I know are males.

    As we have seen, there is no much gender equality for the young professionals. What happens when people gain experience? How does the gender compensation gap change after eight years of professional life? The situation is even worse. In fourteen, out of sixteen available fields, women get less money than men. The only field in which it pays to be a woman is the executive roles, where the women get 19% more than the men.

    Gender compensation gap, more than 7 years of experience. Women earn less in 14 of 16 categories

    To complete the picture, let’s look at the gap dynamics over the years in all the occupation fields in that report.

    Gender gap dynamics. 20 professional fields over different experience bins

    What do we learn from these findings?

    These findings are real. We cannot use the non-representativity of these data, and the lack of confidence intervals to dismiss these findings. I don’t expect the participants to lies, neither do I not expect any bias from the participation patterns. It is true that I can’t obtain confidence intervals for these results. However, the fact that the vast majority of the groups lie on one side of the equality line suggests the overall validity of the gender gap notion. How can we fix this gap? I frankly don’t know. As a father of three daughters (9, 12, and 14 years old), I talk to them about this gap. I make sure they are aware of this problem so that, when it’s their turn to negotiate compensation, they are aware of the systematic bias. I hope that this knowledge will give them the right tools to fight for justice.

    February 1, 2018 - 3 minute read -
    Data Visualization gender gender-inequality Israel salary work blog
  • Don't take career advises from people who mistreat graphs this badly

    Don't take career advises from people who mistreat graphs this badly

    January 4, 2018

    Recently, I stumbled upon a report called “Understanding Today’s Chief Data Scientist” published by an HR company called Heidrick & Struggles. This document tries to draw a profile of the modern chief data scientist in today’s Big Data Era. This document contains the ugliest pieces of data visualization I have seen in my life. I can’t think of a more insulting graphical treatment of data. Publishing graph like these ones in a document that tries to discuss careers in data science is like writing a profile of a Pope candidate while accompanying it with pornographic pictures.

    Before explaining my harsh attitude, let’s first ask an important question.

    What is the purpose of graphs in a report?

    There are only two valid reasons to include graphs in a report. The first reason is to provide a meaningful glimpse into the document. Before a person decided whether he or she wants to read a long document, they want to know what is it about, what were the methods used, and what the results are. The best way to engage the potential reader to provide them with a set of relevant graphs (a good abstract or introduction paragraph help too). The second reason to include graphs in a document is to provide details that cannot be effectively communicating by text-only means.

    That’s it! Only two reasons. Sometimes, we might add an illustration or two, to decorate a long piece of text. Adding illustrations might be a valid decision provided that they do not compete with the data and it is obvious to any reader that an illustration is an illustration.

    Let the horror begin!

    The first graph in the H&S report stroke me with its absurdness.

    Example of a bad chart. I have no idea what it means

    At first glance, it looks like an overly-artistic doughnut chart. Then, you want to understand what you are looking at. “OK”, you say to yourself, “there were 100 employees who belonged to five categories. But what are those categories? Can someone tell me? Please? Maybe the report references this figure with more explanations? Nope. Nothing. This is just a doughnut chart without a caption or a title. Without a meaning.

    I continued reading.

    Two more bad charts. The graphs are meaningless!

    OK, so the H&S geniuses decided to hide the origin or their bar charts. Had they been students in a dataviz course I teach, I would have given them a zero. Ooookeeyy, it’s not a college assignment, as long as we can reconstruct the meaning from the numbers and the labels, we are good, right? I tried to do just that and failed. I tried to use the numbers in the text to help me filling the missing information and failed. All in all, these two graphs are a meaningless graphical junk, exactly like the first one.

    The fourth graph gave me some hope.

    Not an ideal pie chart but at least we can understand it

    Sure, this graph will not get the “best dataviz” award, but at least I understand what I’m looking at. My hope was too early. The next graph was as nonsense as the first three ones.

    Screenshot with an example of another nonsense graph

    Finally, the report authors decided that it wasn’t enough to draw smartly looking color segments enclosed in a circle. They decided to add some cool looking lines. The authors remained faithful to their decision to not let any meaning into their graphical aidsScreenshot with an example of a nonsense chart

    .

    Can’t we treat these graphs as illustrations?

    Before co-founding the life-changing StackOverflow, Joel Spolsky was, among other things, an avid blogger. His blog, JoelOnSoftware, was the first blog I started following. Joel writes mostly about the programming business and. In order not to intimidate the readers with endless text blocks, Joel tends to break the text with illustrations. In many posts, Joel uses pictures of a cute Husky as an illustration. Since JoelOnSoftware isn’t a cynology blog, nobody gets confused by the sudden appearance of a Husky. Which is exactly what an illustration is - a graphical relief that doesn’t disturb. But what would happen if Joel decided to include a meaningless class diagram? Sure a class diagram may impress the readers. The readers will also want to understand it and its connection to the text. Once they fail, they will feel angry, and rightfully so

    Two screenshots of Joel's blog. One with a Husky, another one with a meaningless diagram

    The bottom line

    The bottom line is that people have to respect the rules of the domain they are writing about. If they don’t, their opinion cannot be trusted. That is why you should not take any pieces of advice related to data (or science) from H&S. Don’t get me wrong. It’s OK not to know the “grammar” of all the possible business domains. I, for example, know nothing about photography or dancing; my English is far from being perfect. That is why, I don’t write about photography, dancing or creative writing. I write about data science and visualization. It doesn’t mean I know everything about these fields. However, I did study a lot before I decided I could write something without ridiculing myself. So should everyone.

    January 4, 2018 - 4 minute read -
    best-practice career critique data science Data Visualization dataviz blog Career advice
  • AI and the War on Poverty, by Charles Earl

    AI and the War on Poverty, by Charles Earl

    January 2, 2018

    It’s such a joy to work with smart and interesting people. My teammate, Charles Earl, wrote a post about machine learning and poverty. It’s not short, but it’s worth reading.

    A.I. and Big Data Could Power a New War on Poverty is the title of on op-ed in today’s New York Times by Elisabeth Mason. I fear that AI and Big Data is more likely to fuel a new War on the Poor unless a radical rethinking occurs. In fact this algorithmic War on the Poor […]

    via AI and the War on Poverty — Charlescearl’s Weblog

    January 2, 2018 - 1 minute read -
    ai artificial-intelligence machine learning blog
  • Одна голова хорошо, а две лучше; или как не забросить свой блог

    Одна голова хорошо, а две лучше; или как не забросить свой блог

    December 31, 2017

    Запись моего доклада на WordCamp Moscow (август 2017г.) доступна онлайн.

    The recording of my presentation at WordCamp Moscow (Aug 2017) is finally available online: Two Heads are Better Than One - on blogging persistence (Russian)

    https://videopress.com/v/QEUQ1aKw

    December 31, 2017 - 1 minute read -
    blogging persistence presentation research russian video blog
  • Older posts Newer posts