• Do New Year's Resolutions Work? Data Suggests They Do!

    Do New Year's Resolutions Work? Data Suggests They Do!

    December 26, 2017

    My latest post on https://data.blog. I enjoyed preparing it, and like its results very much. Happy New Year, everyone.

    December 26, 2017 - 1 minute read -
    blog
  • The Keys to Effective Data Science Projects — Operationalize

    The Keys to Effective Data Science Projects — Operationalize

    December 20, 2017

    Recently, I’ve stumbled upon an interesting series of posts about effective management of data science projects. One of the posts in the series says:

    “Operationalization” – a term only a marketer could love. It really just means “people using your solution”.

    The main claim of that post is that, at some point, bringing actual users to your data science project may be more important than improving the model. This is exactly what I meant in my “when good enough is good enough” post (also available on YouTube)

    December 20, 2017 - 1 minute read -
    blog data science project-management Career advice
  • We're Reading About Artificially Intelligent Harry Potter Fan Fiction, Verifying Online Identities, and More

    We're Reading About Artificially Intelligent Harry Potter Fan Fiction, Verifying Online Identities, and More

    December 19, 2017
    December 19, 2017 - 1 minute read -
    blog
  • Buzzword shift

    Buzzword shift

    December 18, 2017

    Many years ago, I tried to build something that today would have been called “Google Trends for Pubmed”. One thing that I’ve found during that process was how the emergence of HIV-related research reduced the number of cancer studies and how, several years later, the HIV research boom settled down, while letting the cancer research back.

    I recalled about that project of mine when I took a look at the Google Trends data for, a once popular buzz-phrases, “data mining” and pattern recognition. Sic transit gloria mundi.

    Screenshot of Google Trends data for (in decreasing order): "Machine Learning" , "Data Science", "Data Mining", "Pattern Recognition"

    It’s not surprising that “Data Science” was the less popular term in 2004. As I already mentioned, “Data Science” is a relatively new term. What does surprise me is the fact that in the past, “Machine Learning” was so less popular that “Data Mining”. Even more surprising is the fact that Google Trends ranks “Machine Learning” almost twice as high, as “Data Science”. I was expecting to see the opposite.

    “Pattern Recognition,” that, in 2004, was as (not) popular as “Machine Learning” become even less popular today. Does that mean that nobody is searching for patterns anymore? Not at all. The 2004 pattern recognition experts are now machine learning professors senior data scientists or if they work in academia, machine learning professors.

    PS: does anybody know the reason behind the apparent seasonality in “Data Mining” trends?

    December 18, 2017 - 1 minute read -
    data-mining data science machine learning pattern-recognition trend blog
  • On alert fatigue 

    On alert fatigue 

    December 17, 2017

    I developed an anomaly detection system for Automattic internal dashboard. When presenting this system (“When good enough is just good enough”), I used to tell that in our particular case, the cost of false alerts was almost zero. I used to explain this claim by the fact that no automatic decisions were made based on the alerts, and that the only subscribers of the alert messages were a limited group of colleagues of mine. Automattic CFO, Stu West, who was the biggest stakeholder in this project, asked me not to stop claiming the “zero cost” claim. When the CFO of the company you work for asks you to do something, you comply. So, I stopped saying “zero cost” but I still listed the error costs as a problem I can safely ignore for the time being. I didn’t fully believe Stu, which is evident from the speaker notes of my presentation deck:

    [caption id=”attachment_1942” align=”alignnone” width=”2298”]Screenshot of the presentation speaker notes.

    My speaker notes. Note how “error costs” was the first problem I dismissed.[/caption]

    I recalled about Stu’s request to stop talking about “zero cost” of false alerts today. Today, I noticed more than 10 unread messages in the Slack channel that receives my anomaly alerts. The oldest unread message was two weeks old. The only reason this could happen is that I stopped caring about the alerts because there were too many of them. I witnessed the classical case of “alert fatigue”, described in “The Boy Who Cried Wolf”, many centuries ago.

    The lesson of this story is that there is no such a thing as zero-cost false alarms. Lack of specificity is a serious problem.

    Screenshot: me texting Stu that he was right

    Feature image by Ray Hennessy

    December 17, 2017 - 2 minute read -
    a2f2 alert anomaly-detection data science fatigue machine learning blog
  • What's the most important thing about communicating uncertainty?

    What's the most important thing about communicating uncertainty?

    December 14, 2017

    Sigrid Keydana, in her post Plus/minus what? Let’s talk about uncertainty (talk) — recurrent null, said

    What’s the most important thing about communicating uncertainty? You’re doing it

    Really?

    Here, for example, a graph from a blog post

    Thousands of randomly looking points. From https://myscholarlygoop.wordpress.com/2017/11/20/the-all-encompassing-figure/

    The graph clearly “communicates” the uncertainty but does it really convey it? Would you consider the lines and their corresponding confidence intervals very uncertain had you not seen the points?

    What if I tell you that there’s a 30% Chance of Rain Tomorrow? Will you know what it means? Will a person who doesn’t operate on numbers know what it means? The answer, to both these questions, is “no”, as is shown by Gigerenzer and his collaborators in a 2005 paper).

    Screenshot: many images for the 2016 US elections

    Communicating uncertainty is not a new problem. Until recently, the biggest “clients” of uncertainty communication research were the weather forecasters. However, the recent “data era” introduced uncertainty to every aspect of our personal and professional lives. From credit risk to insurance premiums, from user classification to content recommendation, the uncertainty is everywhere. Simply “doing” uncertainty communication, as Sigrid Keydana from the Recurrent Null blog suggested isn’t enough. The huge public surprise caused by the 2016 US presidential election is the best evidence for that. Proper uncertainty communication is a complex topic. A good starting point to this complex topic is a paper Visualizing Uncertainty About the Future by David Spiegelhalter.

    December 14, 2017 - 2 minute read -
    best-practice data science Data Visualization dataviz uncertainty blog
  • Doing the Math on Key Words and Top Level Domains

    Doing the Math on Key Words and Top Level Domains

    December 12, 2017

    My post on data.blog

    December 12, 2017 - 1 minute read -
    blog
  • The Y-axis doesn't have to be on the left

    The Y-axis doesn't have to be on the left

    December 10, 2017

    Line charts are great to convey the evolution of a variable over the time. This is a typical chart. It has three key components, the X-axis that represents the time, the Y-axis that represents the tracked value, and the line itself.

    A typical line chart. The Y-axis is on the left

    Usually, you will see the Y-axis at the left part of the graph. Unless you design for a Right-To-Left language environment, placing the Y-axis on the left makes perfect sense. However, left-side Y-axis isn’t a hard rule.

    In many cases, more importance is given to the most recent data point. For example, it might be interesting to know a stock price dynamics but today’s price is what determines how much money I can get by selling my stock portfolio.

    What happens if we move the axis to the right?

    A slightly improved version. The Y-axis is on the right, adjacent to the most recent data point

    Now, today’s price of XYZ stock is visible more clearly. Let’s make the most important values explicitly clear:

    The final version. The Y-axis is on the right, adjacent to the most recent data point. The axis ticks correspont to actual data points

    There are two ways to obtain right-sided Y axis in matplotib. The first way uses a combination of

    [code language=”python”]
    ax.yaxis.tick_right() ax.yaxis.set_label_position(“right”)
    [/code]

    The second one creates a “twin X” axis and makes sure the first axis is invisible. It might seem that the first option is easier. However, when combined with seaborn’s despine function, strange things happen. Thus, I perform the second option. Following is the code that I used to create the last version of the graph.

    [code language=”python”]
    np.random.seed(123) days = np.arange(1, 31)
    price = (np.random.randn(len(days)) * 0.1).cumsum() + 10

    fig = plt.figure(figsize=(10, 5))
    ax = fig.gca()
    ax.set_yticks([]) # Make 1st axis ticks disapear.
    ax2 = ax.twinx() # Create a secondary axis
    ax2.plot(days,price, ‘-‘, lw=3)
    ax2.set_xlim(1, max(days))
    sns.despine(ax=ax, left=True) # Remove 1st axis spines
    sns.despine(ax=ax2, left=True, right=False)
    tks = [min(price), max(price), price[-1]]
    ax2.set_yticks(tks) ax2.set_yticklabels([f’min:\n{tks[0]:.1f}’, f’max:\n{tks[1]:.1f}’, f’{tks[-1]:.1f}’])
    ax2.set_ylabel(‘price [$]’, rotation=0, y=1.1, fontsize=’x-large’)
    ixmin = np.argmin(price); ixmax = np.argmax(price);
    ax2.set_xticks([1, days[ixmin], days[ixmax], max(days)])
    ax2.set_xticklabels([‘Oct, 1’,f’Oct, {days[ixmin]}’, f’Oct, {days[ixmax]}’, f’Oct, {max(days)}’ ])
    ylm = ax2.get_ylim()
    bottom = ylm[0]
    for ix in [ixmin, ixmax]: y = price[ix] x = days[ix] ax2.plot([x, x], [bottom, y], ‘-‘, color=’gray’, lw=0.8) ax2.plot([x, max(days)], [y, y], ‘-‘, color=’gray’, lw=0.8)
    ax2.set_ylim(ylm) [/code]

    Next time when you create a “something” vs time graph, ask yourself whether the last available point has a special meaning to the viewer. If it does, consider moving the Y axis to the left part of your graph and see whether it becomes more readable.

    This post was triggered by a nice write-up by Plotting a Course: Line Charts by a new blogger David (he didn’t mention his last name) from https://thenumberist.wordpress.com/

    December 10, 2017 - 3 minute read -
    Data Visualization dataviz matplotlib python blog
  • Epitaphs in the Graveyard of Mathematics

    Epitaphs in the Graveyard of Mathematics

    December 2, 2017

    The excellent Ben Odrin wrote a hilarious post with fictitious tombstones of famous mathematicians. This is only one example :

    I decided to jump on that bandwagon

    not real Paul Errdos tombstone.

    December 2, 2017 - 1 minute read -
    erdos fun tombstone blog
  • The fastest way to get first N items in each group of a Pandas DataFrame

    The fastest way to get first N items in each group of a Pandas DataFrame

    November 27, 2017

    In my work, the speed of code writing and reading is usually more important than the speed of its execution. Right now, I’m facing a challenge of optimizing the running time of a fairly complex data science project. After a lot of profiling, I identified the major time consumers. One of such time-consuming steps involved grouping a Pandas DataFrame by a key, sorting each group by a score column, and taking first N elements in each group. The tables in this step are pretty small not more than one hundred elements. But since I have to perform this step many times, the running time accumulates to a substantial fraction.

    Let’s first construct a toy example

    [code lang=”python”]
    N = 100
    x = np.random.randint(1, 5, N).astype(int)
    y = np.random.rand(N)
    d = pd.DataFrame(dict(x=x, y=y))
    [/code]

    I’ll use %%timeit cell magic which runs a Jupyter cell many times, and measures the time it takes to run the code.

    [code lang=”python”]

    %%timeit
    d.groupby( ‘x’ ).apply( lambda t: t.head(K) ).reset_index(drop=True)

    [/code]

    This is the output:

    3.19 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    I suspected that head() was not the most efficient way to take the first lines. I tried .iloc

    [code lang=”python”]

    %%timeit
    d.groupby( ‘x’ ).apply( lambda t: t.iloc[0:K] ).reset_index(drop=True)

    [/code]

    2.92 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    A 10% improvement. Not bad but not excellent either. Then I realized that Pandas groupby object have their own head function

    [code lang=”python”]

    %%timeit
    d.groupby( ‘x’ ).head( K ).reset_index(drop=True)

    [/code]

    674 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    647 microseconds instead of 3.2 milliseconds. The improvement is by almost a factor of five!

    It’s not enough to have the right tool, it’s important to be aware of it, and to use it right. I wonder whether there is even faster way to obtain this job.

    November 27, 2017 - 2 minute read -
    data science optimization pandas python blog
  • How to make a graph less readable? Rotate the text labels

    How to make a graph less readable? Rotate the text labels

    November 23, 2017

    This is my “because you can” rant.

    Here, you can see a typical situation. You have some sales data that you want to represent using a bar plot.

    01_default

    Immediately, you notice a problem: the names on the X axis are not readable. One way to make the labels readable is to enlarge the graph.02_large_image

    Making larger graphs isn’t always possible. So, the next default solution is to rotate the text labels.

    03_rotated

    However, there is a problem. Rotated text is read more slowly than standard horizontal text. Don’t believe me? This is not an opinion but rather a result of empirical studies [ref], [ref]. Sometimes, rotated text is unavoidable. Most of the time, it is not.

    So, how do we make sure all the labels are readable without rotating them? One option is to move them up and down so that they don’t hinder each other. It is easily obtained with Python’s matplotlib

    [code language=”python”]
    plt.bar(range(len(people)), sales)
    plt.title(‘October sales’)
    plt.ylabel(‘$US’, rotation=0, ha=’right’)
    ticks_and_labels = plt.xticks(range(len(people)), people, rotation=0)
    for i, label in enumerate(ticks_and_labels[1]): label.set_y(label.get_position()[1] - (i % 2) * 0.05)
    [/code]

    (note, that I also rotated the Y axis label, for even more readability)

    05_alternate_labels

    Another approach that will work with even longer labels and that requires fewer code lines it to rotate the bars, not the labels.

    07_horizontal_plot

    … and if you don’t have a compelling reason for the data order, you might also consider sorting the bars. Doing so will not only make it prettier, it will also make it easier to compare between similar values. Use the graph above to tell whether Teresa Jackson’s sales were higher or lower than those of Marie Richardson’s. Now do the same comparison using the graph below.

    08_horizontal_plot_sorted

    To sum up: the fact you can does not mean you should. Sometimes, rotating text labels is the easiest solution. The additional effort needed to decipher the graph is the price your audience pays for your laziness. They might as well skip your graphs your message won’t stick.

    This was my because you can rant.

    Featured image by Flickr user gullevek

    November 23, 2017 - 2 minute read -
    because you can best-practice Data Visualization dataviz blog
  • On machine learning, job security, professional pride, and network trolling

    On machine learning, job security, professional pride, and network trolling

    November 21, 2017

    If you are a data scientist, I am sure you wondered whether deep neural networks will replace you at your job one day. Every time I read about reports of researchers who managed to trick neural networks, I wonder whether the researchers were thinking about their job security, or their professional pride while performing the experiments. I think that the first example of such a report is a 2014 paper by Christian Szegedy and his colleagues called “Intriguing properties of neural networks”. The main goal of this paper, so it seems, was to peek into the black box of neural networks. In one of the experiments, the authors designed minor, invisible perturbation of the original images. These perturbations diminished the classification accuracy of a trained model.

    Screen Shot 2017-11-21 at 16.50.05.png

    In the recent post “5 Ways to Troll Your Neural Network” Ben Orlin describes five different ways to “troll a network”.

    Image credit: Figure 5 from “Intriguing properties of neural networks”.

    November 21, 2017 - 1 minute read -
    data science deep-learning job-security machine learning neural-networks blog
  • Interactive Network Visualization in Python with NetworkX and PyQt5 Tutorial

    Interactive Network Visualization in Python with NetworkX and PyQt5 Tutorial

    November 20, 2017

    Unfortunately, there is no widely accepted, ready to use, standard way to interactively visualize networks in python. The following post shows yet another attempt to build an ad-hoc app.

    November 20, 2017 - 1 minute read -
    blog
  • Which of these two pictures should I choose for my gravatar?

    Which of these two pictures should I choose for my gravatar?

    November 16, 2017

    Which of these two pictures should I choose for my gravatar?

    Screen Shot 2017-11-10 at 20.49.24.png

    Both were taken by Luca Sartoni

    November 16, 2017 - 1 minute read -
    gravatar photo question blog
  • We're Reading About Simplifying Without Distortion and Adversarial Image Classification

    We're Reading About Simplifying Without Distortion and Adversarial Image Classification

    November 15, 2017

    Weekly reading list from the data.blog team

    November 15, 2017 - 1 minute read -
    blog
  • Another set of ruthless critique pieces

    Another set of ruthless critique pieces

    November 15, 2017

    You know that I like reading a ruthless critique of others’ work – I like telling myself that by doing so I learn good practices (in reality, I suspect I’m just a case what we call in Hebrew שמחה לאיד – the joy of some else’s failure).

    Anyhow, I’d like to share a set of posts by Lior Patcher in which he calls bullshit on several reputable people and concepts. Calling bullshit is easy. Doing so with arguments is not so. Lior Patcher worked hard to justify his opinion.

    * The network nonsense of Albert-László Barabási. Albert-László Barabási is a renown network scientist. There’s a network model named after him. Some people claim that prof. Barabási will receive the Nobel prize one day. * The network nonsense of Manolis Kellis. Published one day after “The “The network nonsense of Albert-László Barabási”, this post critiques another renown scientist. Again, with a lot of solid-sounding arguments. * When average is not enough: part II. (“Where is part I?”, you may ask. Read the post to discover).

    Unfortunately, I don’t publish academic papers. But if I do, I will definitely want prof. Patcher read it, and let the world know what he thinks about it. For good and for bad.

    Speaking of calling bullshit. Believe it or not, University of Washington has a course with this exact title. The course is available onlinehttp://callingbullshit.org/ and is worth watching. I watched all the course’s videos during my last flight from Canada to Israel. The featured image of this post is a screenshot of this course’s homepage.

    November 15, 2017 - 2 minute read -
    barabasi critique patcher research social-network-analysis blog
  • Good information + bad visualization = BAD

    Good information + bad visualization = BAD

    November 14, 2017

    I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.

    via Math required for machine learning — Youth Innovation

    I would write a post about good practices in pie charts, but Robert Kosara, of https://eagereyes.org does this so well, I don’t really think I need to reinvent the weel. Pie charts are very powerful in conveying information. Make sure you use this tool well. I strongly suggest reading everything Robert Kosara has to say on this topic.

    November 14, 2017 - 1 minute read -
    bad-practice best-practice data science Data Visualization dataviz machine learning pie-chart blog
  • What are the best practices in planning & interpreting A/B tests?

    What are the best practices in planning & interpreting A/B tests?

    November 13, 2017

    Compiled by my teammate Yanir Serourssi, the following is a reading list an A/B tests that you should read even if you don’t plan to perform an A/B test anytime soon. The list is Yanir’s. The reviews are mine. Collective intelligence in action :-)

    * [If you don’t pay attention, data can drive you off a cliff](https://yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/)   In this post, Yanir lists seven common mistakes that are common to any data-based analysis. At some point, you might think that this is a list of trivial truths. Maybe it is. The fact that Yanir's points are trivial doesn't make them less correct. Awareness doesn't exist without knowledge. Unfortunately, knowledge doesn't assure awareness. Which is why reading trivial truths is a good thing to do from time to time.
    * [How to identify your marketing lies and start telling the truth](https://www.linkedin.com/pulse/how-identify-your-marketing-lies-start-telling-truth-tiberio-caetano)   This post was written by Tiberio Caetano, a data science professor at the University of Sidney. If I had to summarize this post with a single phrase, that would be "confounding factors". A confounding variable is a variable hidden from your eye that influences a measured effect. One example of a confounding variable is when you start an ad campaign for ice cream, your sales go up, and you conclude that the ad campaign was effective. What you forgot was that the ad campaign started at the beginning of the summer, when people start buying more ice cream anyhow.   See [this link](https://onlinecourses.science.psu.edu/stat507/node/34) for a detailed textbook-quality review of confounding variables.
    * [Seven rules of thumb for web site experimenters](http://www.exp-platform.com/Documents/2014%20experimentersRulesOfThumb.pdf)   I read this review back in 2014, shortly after it was published by, among others, researchers from Microsoft and LinkedIn. Judging by the title, one would expect yet another list of trivial truths in a self-promoting product blog. This is not the case here. In this paper, you will find several real-life case studies, many references to marketing studies, and no advertising of shady products or schemes.
    * [A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments](http://exp-platform.com/Documents/2017-08%20KDDMetricInterpretationPitfalls.pdf)   Another academic paper by Microsoft researchers. This one lists a lot of "dont's". Like in the previous link, every advice the authors give is based on established theory and backed up by real data.
    
    November 13, 2017 - 2 minute read -
    a-b-testing advice best-practice data science statistics stats blog
  • How to make a racist AI without really trying (a reblog)

    How to make a racist AI without really trying (a reblog)

    November 10, 2017

    Perhaps you heard about Tay, Microsoft’s experimental Twitter chat-bot, and how within a day it became so offensive that Microsoft had to shut it down and never speak of it again. And you assumed that you would never make such a thing, because you’re not doing anything weird like letting random jerks on Twitter re-train […]

    via How to make a racist AI without really trying — ConceptNet blog

    November 10, 2017 - 1 minute read -
    blog
  • Please leave a comment on this post 

    Please leave a comment on this post 

    November 9, 2017

    Please leave a comment on this post. It doesn’t matter what you want to write. It can be short or long. Any comment. I need to know that humans read this blog. If you feel really generous, tell me how you found this blog, what you think of it.

    November 9, 2017 - 1 minute read -
    перекличка blog
  • Data Science or Data Hype?

    Data Science or Data Hype?

    November 8, 2017

    In his blog post Big Data Or Big Hype? , Rick Ciesla is asking a question whether the “Big Data” phenomenon is “a real thing” or just a hype? I must admit that, until recently, I was sure that the term “Data Science” was a hype too – an overbroad term to describe various engineering and scientific activities. As time passes by, I become more and more confident that Data Science matures into a separate profession. I haven’t’ yet decided whether the word “science” is fully appropriate in this case is.

    We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning. Just how large of a data set constitutes Big Data? What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes? There’s a survey for that, courtesy of […]

    via Big Data Or Big Hype? — VenaData

    November 8, 2017 - 1 minute read -
    big-data data science opinion blog
  • Do you REALLY need the colors?

    Do you REALLY need the colors?

    November 7, 2017

    Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Look at this example from the seaborn documentation site

    >>> import seaborn as sns
    >>> sns.set_style("whitegrid")
    >>> tips = sns.load_dataset("tips")
    >>> ax = sns.barplot(x="day", y="total_bill", data=tips)
    

    Barplot example with colored bars

    This example shows the default barplot and is the first barplot. Can you see how easy it is to add colors to the different columns? But WHY? What do those colors represent? It looks like the only information that is encoded by the color is the bar category. We already have this information in the form of bar location. Having this colorful image adds nothing but a distraction. It is sad that this is the default behavior that seaborn developers decided to adopt.

    Look at the same example, without the colors

    >>> ax = sns.barplot(x="day", y="total_bill", color='gray', data=tips)
    

    Barplot example with gray bars

    Isn’t it much better? The sad thing is that a better version requires memorizing additional arguments and more typing.

    This was my because you can rant.

    November 7, 2017 - 1 minute read -
    bar plot because you can before-after colors Data Visualization dataviz blog
  • Numpy vs. Pandas: functions that look the same, share the same code but behave differently

    Numpy vs. Pandas: functions that look the same, share the same code but behave differently

    November 6, 2017

    I can’t imagine how my professional life would have looked like without pandas, THE data analysis library for Python. Pandas shares much of its functionality and syntax with numpy, a fundamental package for scientific computing with Python. The reason for that is that, under the hood, pandas uses numpy. This similarity is very convenient as it allows passing numpy arrays to many pandas functions and vice versa. However, sometimes it sabs you in the back. Here is a nice example that I discovered after hours (OK, minutes) of debugging.

    Let’s create a numpy vector with a single element in it:

    >>> import numpy as np
    
    >>> v = np.array([3.14]) 
    
    Now, let's compute the standard deviaiton of this vector. According to the [definition](https://en.wikipedia.org/wiki/Standard_deviation), we expect it to be equal zero.
    
    >>> np.std(v)
    0.0
    

    So far so good. No surprises.

    Now, let’s make a pandas Series out of our vector. A Series is basically a vector in which the elements can be indexed by arbitrary labels. What do you expect the standard deviation should be now?

    >>> import pandas as pd
    >>> s = pd.Series(v)
    >>> s.std()
    nan
    

    What? Not a number? What the hell? It’s not an empty vector! I didn’t ask to perform the corrected sample standard deviation. Wait a second…

    >> s.std(ddof=0)
     0.0
    

    Now I start getting it. Compare this

    >>> print(np.std.__doc__)
    Compute the standard deviation along the specified axis.
    ....
    ddof : int, optional
    Means Delta Degrees of Freedom. The divisor used in calculations
    is ``N - ddof``, where ``N`` represents the number of elements.
    <span style="color:#ff0000;"><strong>By default `ddof` is zero</strong></span>.
    

    … to this

    >>> print(pd.Series.std.__doc__)
    
    Return <span style="color:#ff0000;"><strong>sample</strong></span> standard deviation over requested axis.
    
    <span style="color:#ff0000;"><strong>Normalized by N-1 by default</strong></span>. This can be changed using the ddof argument
    ....
    <span style="color:#ff0000;"><strong>ddof : int, default 1</strong></span>
    degrees of freedom
    

    Formally, the pandas developers did nothing wrong. They decided that it makes sense to default for normalized standard deviation when working with data tables, unlike numpy that is supposedly meant to deal with arbitrary matrices of numbers. They made a decision, they wrote it at least three times in the documentation, and yet… I didn’t know that even after working with both the libraries for so long.

    To sum up:

    s.std()
    nan

    v.std()
    0.0
    s == v
    0 True
    dtype: bool

    Beware.

    November 6, 2017 - 2 minute read -
    bug coding numpy pandas programming blog
  • When scatterplots are better than bar charts, and why?

    When scatterplots are better than bar charts, and why?

    November 5, 2017

    From time to time, you might hear that graphical method A is better at representing problem X than method B. While in case of problem Z, the method B is much better than A, but C is also a possibility. Did you ever ask yourselves (or the people who tell you that) “Says WHO?”

    The guidelines like these come from theoretical and empirical studies. One such an example is a 1985 paper “Graphical perception and graphical methods for analyzing scientific data.” by Cleveland and McGill. I got the link to this paper from Varun Raj of https://varunrajweb.wordpress.com/.

    It looks like a very interesting and relevant paper, despite the fact that it has been it was published 22 years go. I will certainly read it. Following is the reading list that I compiled for my data visualization students more than two years ago. Unfortunately, they didn’t want to read any of these papers. Maybe some of the readers of this blog will …

    November 5, 2017 - 2 minute read -
    Data Visualization dataviz reading-list research blog
  • Because you can — a new series of data visualization rants

    Because you can — a new series of data visualization rants

    November 1, 2017

    Here’s an old joke:

    Q: Why do dogs lick their balls?
    A: Because they can.

    Canine behavior aside, the fact that you can do something doesn’t mean that you should to it. I already wrote about one such example, when I compared between chart legends to muttonchops.

    Citing myself:

    Chart legends are like Muttonchops — the fact that you can have them doesn’t mean you should.

    When reviewing my notes, I realized that I have more bad data visualization examples that share a common problem: effortless addition of elements or features.

    Stay tuned and check the because-you-can tag.

    Featured image by Unsplash user Nicolas Tessari

    November 1, 2017 - 1 minute read -
    because you can Data Visualization dataviz blog
  • Older posts Newer posts