The fastest way to get first N items in each group of a Pandas DataFrame

In my work, the speed of code writing and reading is usually more important than the speed of its execution. Right now, I’m facing a challenge of optimizing the running time of a fairly complex data science project. After a lot of profiling, I identified the major time consumers. One of such time-consuming steps involved grouping a Pandas DataFrame by a key, sorting each group by a score column, and taking first N elements in each group. The tables in this step are pretty small not more than one hundred elements. But since I have to perform this step many times, the running time accumulates to a substantial fraction.

Let’s first construct a toy example

N = 100
x = np.random.randint(1, 5, N).astype(int)
y = np.random.rand(N)
d = pd.DataFrame(dict(x=x, y=y))

I’ll use %%timeit cell magic which runs a Jupyter cell many times, and measures the time it takes to run the code.


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.head(K)
 ).reset_index(drop=True)

This is the output:

3.19 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

 

I suspected that head() was not the most efficient way to take the first lines. I tried .iloc


%%timeit
d.groupby(
 'x'
 ).apply(
 lambda t: t.iloc[0:K]
 ).reset_index(drop=True)

2.92 ms ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A 10% improvement. Not bad but not excellent either. Then I realized that Pandas groupby object have their own head function


%%timeit
d.groupby(
 'x'
 ).head(
 K
 ).reset_index(drop=True)

674 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

647 microseconds instead of 3.2 milliseconds. The improvement is by almost a factor of five!

It’s not enough to have the right tool, it’s important to be aware of it, and to use it right. I wonder whether there is even faster way to obtain this job.

How to make a graph less readable? Rotate the text labels

This is my “because you can” rant.

Here, you can see a typical situation. You have some sales data that you want to represent using a bar plot.

01_default

Immediately, you notice a problem: the names on the X axis are not readable. One way to make the labels readable is to enlarge the graph.02_large_image

Making larger graphs isn’t always possible. So, the next default solution is to rotate the text labels.

03_rotated

However, there is a problem. Rotated text is read more slowly than standard horizontal text. Don’t believe me? This is not an opinion but rather a result of empirical studies [ref], [ref]. Sometimes, rotated text is unavoidable. Most of the time, it is not.

So, how do we make sure all the labels are readable without rotating them? One option is to move them up and down so that they don’t hinder each other. It is easily obtained with Python’s matplotlib

plt.bar(range(len(people)), sales)
plt.title('October sales')
plt.ylabel('$US', rotation=0, ha='right')
ticks_and_labels = plt.xticks(range(len(people)), people, rotation=0)
for i, label in enumerate(ticks_and_labels[1]):
    label.set_y(label.get_position()[1] - (i % 2) * 0.05)

(note, that I also rotated the Y axis label, for even more readability)

05_alternate_labels

Another approach that will work with even longer labels and that requires fewer code lines it to rotate the bars, not the labels.

07_horizontal_plot

… and if you don’t have a compelling reason for the data order, you might also consider sorting the bars. Doing so will not only make it prettier, it will also make it easier to compare between similar values. Use the graph above to tell whether Teresa Jackson’s sales were higher or lower than those of Marie Richardson’s. Now do the same comparison using the graph below.

08_horizontal_plot_sorted

To sum up: the fact you can does not mean you should. Sometimes, rotating text labels is the easiest solution. The additional effort needed to decipher the graph is the price your audience pays for your laziness. They might as well skip your graphs your message won’t stick.

This was my because you can rant.

Featured image by Flickr user gullevek

On machine learning, job security, professional pride, and network trolling

If you are a data scientist, I am sure you wondered whether deep neural networks will replace you at your job one day. Every time I read about reports of researchers who managed to trick neural networks, I wonder whether the researchers were thinking about their job security, or their professional pride while performing the experiments. I think that the first example of such a report is a 2014 paper by Christian Szegedy and his colleagues called “Intriguing properties of neural networks“. The main goal of this paper, so it seems, was to peek into the black box of neural networks. In one of the experiments, the authors designed minor, invisible perturbation of the original images. These perturbations diminished the classification accuracy of a trained model.

Screen Shot 2017-11-21 at 16.50.05.png

In the recent post “5 Ways to Troll Your Neural Network” Ben Orlin describes five different ways to “troll a network”.

Image credit: Figure 5 from “Intriguing properties of neural networks“.

Another set of ruthless critique pieces

You know that I like reading a ruthless critique of others’ work — I like telling myself that by doing so I learn good practices (in reality, I suspect I’m just a case what we call in Hebrew שמחה לאיד — the joy of some else’s failure).

Anyhow, I’d like to share a set of posts by Lior Patcher in which he calls bullshit on several reputable people and concepts. Calling bullshit is easy. Doing so with arguments is not so. Lior Patcher worked hard to justify his opinion.

 

Unfortunately, I don’t publish academic papers. But if I do, I will definitely want prof. Patcher read it, and let the world know what he thinks about it. For good and for bad.

Speaking of calling bullshit. Believe it or not, University of Washington has a course with this exact title. The course is available online http://callingbullshit.org/ and is worth watching. I watched all the course’s videos during my last flight from Canada to Israel. The featured image of this post is a screenshot of this course’s homepage.

 

 

 

Good information + bad visualization = BAD

I went through my Machine Learning tag feed. Suddenly, I stumbled upon a pie chart that looked so terrible, I was sure the post would be about bad practices in data visualization. I was wrong. The chart was there to convey some information. The problem is that it is bad in so many ways. It is very hard to appreciate the information in a post that shows charts like that. Especially when the post talks about data science that relies so much on data visualization.

via Math required for machine learning — Youth Innovation

I would write a post about good practices in pie charts, but Robert Kosara, of https://eagereyes.org does this so well, I don’t really think I need to reinvent the weel. Pie charts are very powerful in conveying information. Make sure you use this tool well. I strongly suggest reading everything Robert Kosara has to say on this topic.

 

 

What are the best practices in planning & interpreting A/B tests?

Compiled by my teammate Yanir Serourssi, the following is a reading list an A/B tests that you should read even if you don’t plan to perform an A/B test anytime soon. The list is Yanir’s. The reviews are mine. Collective intelligence in action 🙂

  • If you don’t pay attention, data can drive you off a cliff
    In this post, Yanir lists seven common mistakes that are common to any data-based analysis. At some point, you might think that this is a list of trivial truths. Maybe it is. The fact that Yanir’s points are trivial doesn’t make them less correct. Awareness doesn’t exist without knowledge. Unfortunately, knowledge doesn’t assure awareness. Which is why reading trivial truths is a good thing to do from time to time.
  • How to identify your marketing lies and start telling the truth
    This post was written by Tiberio Caetano, a data science professor at the University of Sidney. If I had to summarize this post with a single phrase, that would be “confounding factors”. A confounding variable is a variable hidden from your eye that influences a measured effect. One example of a confounding variable is when you start an ad campaign for ice cream, your sales go up, and you conclude that the ad campaign was effective. What you forgot was that the ad campaign started at the beginning of the summer, when people start buying more ice cream anyhow.
    See this link for a detailed textbook-quality review of confounding variables.
  • Seven rules of thumb for web site experimenters
    I read this review back in 2014, shortly after it was published by, among others, researchers from Microsoft and LinkedIn. Judging by the title, one would expect yet another list of trivial truths in a self-promoting product blog. This is not the case here. In this paper, you will find several real-life case studies, many references to marketing studies, and no advertising of shady products or schemes.
  • A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments
    Another academic paper by Microsoft researchers. This one lists a lot of “dont’s”. Like in the previous link, every advice the authors give is based on established theory and backed up by real data.

Data Science or Data Hype?

In his blog post Big Data Or Big Hype? , Rick Ciesla is asking a question whether the “Big Data” phenomenon is “a real thing” or just a hype? I must admit that, until recently, I was sure that the term “Data Science” was a hype too — an overbroad term to describe various engineering and scientific activities. As time passes by, I become more and more confident that Data Science matures into a separate profession. I haven’t’ yet decided whether the word “science” is fully appropriate in this case is.

We have certainly heard a lot about Big Data in recent years, especially with regards to data science and machine learning. Just how large of a data set constitutes Big Data? What amount of data science and machine learning work involves truly stratospheric volumes of bits and bytes? There’s a survey for that, courtesy of […]

via Big Data Or Big Hype? — VenaData

Do you REALLY need the colors?

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Look at this example from the seaborn documentation site

>>> import seaborn as sns
>>> sns.set_style("whitegrid")
>>> tips = sns.load_dataset("tips")
>>> ax = sns.barplot(x="day", y="total_bill", data=tips)

Barplot example with colored bars

This example shows the default barplot and is the first barplot. Can you see how easy it is to add colors to the different columns? But WHY? What do those colors represent? It looks like the only information that is encoded by the color is the bar category. We already have this information in the form of bar location. Having this colorful image adds nothing but a distraction. It is sad that this is the default behavior that seaborn developers decided to adopt.

Look at the same example, without the colors

>>> ax = sns.barplot(x="day", y="total_bill", color='gray', data=tips)

Barplot example with gray bars

Isn’t it much better? The sad thing is that a better version requires memorizing additional arguments and more typing.

This was my because you can rant.

 

Numpy vs. Pandas: functions that look the same, share the same code but behave differently

I can’t imagine how my professional life would have looked like without pandas, THE data analysis library for Python. Pandas shares much of its functionality and syntax with numpy, a fundamental package for scientific computing with Python. The reason for that is that, under the hood, pandas uses numpy. This similarity is very convenient as it allows passing numpy arrays to many pandas functions and vice versa. However, sometimes it sabs you in the back. Here is a nice example that I discovered after hours (OK, minutes) of debugging.

Let’s create a numpy vector with a single element in it:

>>> import numpy as np

>>> v = np.array([3.14]) 

Now, let's compute the standard deviaiton of this vector. According to the definition, we expect it to be equal zero.
>>> np.std(v)
0.0

So far so good. No surprises.

Now, let’s make a pandas Series out of our vector. A Series is basically a vector in which the elements can be indexed by arbitrary labels. What do you expect the standard deviation should be now?

>>> import pandas as pd
>>> s = pd.Series(v)
>>> s.std()
nan

What? Not a number? What the hell? It’s not an empty vector! I didn’t ask to perform the corrected sample standard deviation. Wait a second…

>> s.std(ddof=0)
 0.0

Now I start getting it. Compare this

>>> print(np.std.__doc__)
Compute the standard deviation along the specified axis.
....
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations
is ``N - ddof``, where ``N`` represents the number of elements.
By default `ddof` is zero.

… to this

>>> print(pd.Series.std.__doc__)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument
....
ddof : int, default 1
degrees of freedom

Formally, the pandas developers did nothing wrong. They decided that it makes sense to default for normalized standard deviation when working with data tables, unlike numpy that is supposedly meant to deal with arbitrary matrices of numbers. They made a decision, they wrote it at least three times in the documentation, and yet… I didn’t know that even after working with both the libraries for so long.

To sum up:

> s.std()
nan
>> v.std()
0.0
>> s == v
0 True
dtype: bool

 

Beware.

 

 

 

When scatterplots are better than bar charts, and why?

From time to time, you might hear that graphical method A is better at representing problem X than method B. While in case of problem Z, the method B is much better than A, but C is also a possibility. Did you ever ask yourselves (or the people who tell you that) “Says WHO?”

The guidelines like these come from theoretical and empirical studies. One such an example is a 1985 paper “Graphical perception and graphical methods for analyzing scientific data.” by Cleveland and McGill. I got the link to this paper from Varun Raj of https://varunrajweb.wordpress.com/.

It looks like a very interesting and relevant paper, despite the fact that it has been it was published 22 years go. I will certainly read it. Following is the reading list that I compiled for my data visualization students more than two years ago. Unfortunately, they didn’t want to read any of these papers. Maybe some of the readers of this blog will …

 

Because you can — a new series of data visualization rants

Here’s an old joke:

Q: Why do dogs lick their balls?
A: Because they can.

Canine behavior aside, the fact that you can do something doesn’t mean that you should to it. I already wrote about one such example, when I compared between chart legends to muttonchops.

Citing myself:

Chart legends are like Muttonchops — the fact that you can have them doesn’t mean you should.

Stay tuned and check the because-you-can tag.

Featured image by Unsplash user Nicolas Tessari

Although it is easy to lie with statistics, it is easier to lie without

I really recommend reading this (longish) post by Tom Breur called “Data Dredging” (and following his blog. The post is dedicated to overfitting — the most scaring problem in machine learning. Overfitting is easy to do and is hard to avoid. It is a serious problem when working with “small data” but is also a problem in the big data era. Read “Data Dredging” for an overview of the problem and its possible cures.

Quoting Tom Breur:

Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!

Happy dredging indeed.

 

Gartner: More than 40% of data science tasks will be automated by 2020. So what?

Recently, I gave a data science career advice, in which I suggested the perspective data scientists not to study data science as a career move. Two of my main arguments were (and still are):

  • The current shortage of data scientists will go away, as more and more general purpose tools are developed.
  • When this happens, you’d better be an expert in the underlying domain, or in the research methods. The many programs that exist today are too shallow to provide any of these.

Recently, the research company Gartner published a press release in which they claim that “More than 40 percent of data science tasks will be automated by 2020, resulting in increased productivity and broader usage of data and analytics by citizen data scientists, according to Gartner, Inc.” Gartner’s main argument is similar to mine: the emergence of ready-to-use tools, algorithm-as-a-service platforms and the such will reduce the amount of the tedious work many data scientists perform for the majority of their workday: data processing, cleaning, and transformation. There are also more and more prediction-as-a-service platforms that provide black boxes that can perform predictive tasks with ever increasing complexity. Once good plug-and-play tools are available, more and more domain owners, who are not necessary data scientists, will be able to use them to obtain reasonably good results. Without the need to employ a dedicated data scientist.

Data scientists won’t disappear as an occupation. They will be more specialized.

I’m not saying that data scientists will disappear in the way coachmen disappeared from the labor market. My claim is that data scientists will cease to be perceived as a panacea by the typical CEO/CTO/CFO. Many tasks that are now performed by the data scientists will shift to business developers, programmers, accountants and other domain owners who will learn another skill — operating with numbers using ready to use tools. An accountant can use Excel to balance a budget, identify business strengths, and visualize trends. There is no reason he or she cannot use a reasonably simple black box to forecast sales, identify anomalies, or predict churn.

So, what is the future of data science occupation? Will the emergence of out-of-box data science tools make data scientists obsolete? The answer depends on the data scientists, and how sustainable his or her toolbox is. In the past, bookkeeping used to rely on manual computations. Has the emergence of calculators, and later, spreadsheet programs, result in the extinction of bookkeepers as a profession? No, but most of them are now busy with tasks that require more expertise than just adding the numbers.

The similar thing will happen, IMHO, with data scientists. Some of us will develop a specialization in a business domain — gain a better understanding of some aspect of a company activity. Others will specialize in algorithm optimization and development and will join the companies for which algorithm development is the core business. Others will have to look for another career. What will be the destiny of a particular person depends mostly on their ability to adapt. Basic science, solid math foundation, and good research methodology are the key factors the determine one’s career sustainability. The many “learn data science in 3 weeks” courses might be the right step towards a career in data science. A right, small step in a very long journey.

Featured image: Alex Knight on Unsplash

Untitled

I teach data visualization to in Azrieli College of Engineering in Jerusalem. Yesterday, during my first lesson, I was talking about the different ways a chart design selection can lead to different conclusions, despite not affecting the actual data. One of the students hypothesized that the preception of a figure can change as a function of other graphs shown together. Which was exactly tested in a research I recently mentioned here. I felt very proud of that student, despite only meeting them one hour before that.

Who doesn’t like some merciless critique of others’ work?

Stephen Few is the author of (among others) “Show Me The Numbers“. Besides writing about what should be done, in the field of data visualization, Dr. Few also writes a lot about what should not be done. He does that in a sharp, merciless way which makes it very interesting reading (although, sometimes Dr. Few can be too harsh). This time, it was the turn of the Tableau blog team to be at the center of Stephen Few’s attention, and not for the good reason.

If Tableau wishes to call this research, then I must qualify it as bad research. It produced no reliable or useful findings. Rather than a research study, it would be more appropriate to call this “someone having fun with an eye tracker.”

Reading merciless critique by knowledgeable experts is an excellent way to develop that “inner voice” that questions all your decisions and makes sure you don’t make too many mistakes. Despite the fear to be fried, I really that some day I’ll be able to know what Stephen Few things of my work.

http://www.perceptualedge.com/blog/?p=2718

 

Disclaimer: Stephen Few was very generous to allow me using the illustrations from his book in my teaching.

Featured image is Public domain image by Alan Levine from here

Why is it (almost) impossible to set deadlines for data science projects?

In many cases, attempts to set a deadline to a data science project result in a complete fiasco. Why is that? Why, in many software projects, managers can have a reasonable time estimate for the completion but in most data science projects they can’t? The key points to answer this question are complexity and, to a greater extent, missing information. By “complexity” I don’t (only) mean the computational complexity. By “missing information” I don’t mean dirty data. Let us take a look at these two factors, one by one.

Complexity

Illustration: famous xkcd comic. Two programmers play during the compilation time
Think of this. Why most properly built bridges remain functional for decades and sometimes for centuries, while the rule in every non-trivial program is that “there is always another bug?”. I read this analogy in Joel Spolsky’s post written in 2001. The answer Joel provides is:

Once you’ve written a subroutine, you can call it as often as you want. This means that almost everything we do as software developers is something that has never been done before. This is very different than what construction workers do.

There was a substantial progress in the computer engineering theory since 2001 when Joel wrote its post. We have better static analysis tools, better coverage tools, and better standard practices. Nevertheless, bug-free software only exists in Programming 101 books.

What about data science projects? Aren’t they essentially a sort of software project? Yes, they are, and as such, the above quote is relevant for them too. However, we can add another statement:

Once you’ve collected data, you can process it as often as you want. This means that almost everything we do as data scientists is something that has never been done before.

You see, to account for project uncertainty, we need to multiply the number of uncertainty factors of a software project by the number of uncertainty factors associated with the data itself. The bottom line is an exponential complexity growth.

Missing information

Now, let’s talk about another, even bigger problem, the missing information. I’m not talking about “dirty data” — a situation where some values in the dataset are missing, input errors, or fields that change their meaning over the time. These are severe problems but not as tough as the problem I’m about to talk about in this post.

When a software engineer writes a plotting program, they know when it doesn’t work: the image is either created or not. And if the image isn’t created, the programmer knows that something wrong and has to be fixed. When a programmer writes a compression program, they know when they made a mistake: if the program does not compress a file, or if the result isn’t readable. The programmer knows that there must be a fixable bug in his or her code.

What about a data science project? Let’s say you’re starting an advertisement targetting project. The project manager gives you the information source and the performance metric. A successful model has to have a performance of 80 or more (the nature of the performance score isn’t important here). You start working. You clean your data, normalize it, build a nice decision tree, and get a score of 60, which is way too low. You explore your data, discover problems in it, retrain the tree and get 63. You talk to the team that collects the data, find more problems, build a random forest, train it and get a score of 66. You buy some computation time, create a deep learning network on AWS, train it for a week, and get 66 again.

Illustration: a blindfolded man wandering around

What do you do now? Is it possible that somewhere in your code there is a bug? It certainly is. Is it possible that you can improve the performance by deploying a better model? Probably. However, it is also possible that the data does not contain enough information. The problem, of course, is that you don’t know that. In practice, you hit your head against the wall until you get the results, or give up, or fired. And this is THE most significant problem with data science (and any research) project: your problem is a black box. You only know what you know, but you have no idea what you don’t. A research project is like exploring a forest with your eyes shut: when you hit a tree, you don’t know whether this is the last tree in the forest and you’re out, or you’re in the middle of a tropical jungle.

I hope that the theoretical data science research will narrow this gap. Meanwhile, the project managers will have to live with this great degree of uncertainty.

 

PS. As in any opinion post, I may be mistaken. If you think I am, please let me know in the comments section below.

The xckd image: https://xkcd.com/303/ under CC-nc. The wandering man image: Illustration: a blindfolded man wandering around. By Flickr user Molly under CC-by-nc-nd

What is the best thing that can happen to your career?

Today, I’ve read a tweet by Sinan Aral (@sinanaral) from the MIT:

 

I’ve just realized that Ikigai is what happened to my career as a data scientist. There was no point in my professional life where I felt boredom or lack of motivation. Some people think that I’m good at what I’m doing. If they are right (which I hope they are), It is due to my love for what I have been doing since 2001. I am so thankful for being able to do things that I love, I care about, and am good at. Not only that, I’m being paid for that! The chart shared by Sinan Aral in his tweet should be guiding anyone in their career choices.

 

Featured image is taken from this article. Original image credit: Toronto Star Graphic 

Can the order in which graphs are shown change people’s conclusions?

When I teach data visualization, I love showing my students how simple changes in the way one visualizes his or her data may drive the potential audience to different conclusions. When done correctly, such changes can help the presenters making their point. They also can be used to mislead the audience. I keep reminding the students that it is up to them to keep their visualizations honest and fair.  In his recent post, Robert Kosara, the owner of https://eagereyes.org/, mentioned another possible way that may change the perceived conclusion. This time, not by changing a graph but by changing the order of graphs exposed to a person. Citing Robert Kosara:

Priming is when what you see first influences how you perceive what comes next. In a series of studies, [André Calero Valdez, Martina Ziefle, and Michael Sedlmair] showed that these effects also exist in the particular case of scatterplots that show separable or non-separable clusters. Seeing one kind of plot first changes the likelihood of you judging a subsequent plot as the same or another type.

via IEEE VIS 2017: Perception, Evaluation, Vision Science — eagereyes

As any tool, priming can be used for good or bad causes. Priming abuse can be a deliberate exposure to non-relevant information in order to manipulate the audience. A good way to use priming is to educate the listeners of its effect, and repeatedly exposing them to alternate contexts. Alternatively, reminding the audience of the “before” graph, before showing them the similar “after” situation will also create a plausible effect of context setting.

P.S. The paper mentioned by Kosara is noticeable not only by its results (they are not as astonishing as I expected from the featured image) but also by how the authors report their research, including the failures.

 

Featured image is Figure 1 from Calero Valdez et al. Priming and Anchoring Effects in Visualization

Advice for aspiring data scientists and other FAQs — Yanir Seroussi

It seems that career in data science is the hottest topic many data scientists are asked about. To help an aspiring data scientist, I’m reposting here a FAQ by my teammate Yanir Seroussi

Aspiring data scientists and other visitors to this site often repeat the same questions. This post is the definitive collection of my answers to such questions (which may evolve over time). How do I become a data scientist? It depends on your situation. Before we get into it, have you thought about why you want […]

via Advice for aspiring data scientists and other FAQs — Yanir Seroussi

How to be a better teacher?

If you know me in person or follow my blog, you know that I have a keen interest in teaching. Indeed, besides being a full-time data scientist at Automattic, I teach data visualization anywhere I can. Since I started teaching, I became much better in communication, which is one of the required skills of a good data scientist.
In my constant strive for improving what I do, I joined the Data Carpentry instructor training. Recently, I got my certification as a data carpentry instructor.

Certificate of achievement. Data Carpentry instructor

Software Carpentry (and it’s sibling project Data Carpentry) aims to teach researchers the computing skills they need to get more done in less time and with less pain. “Carpentry” instructors are volunteers who receive a pretty extensive training and who are committed to evidence-based teaching techniques. The instructor training had a powerful impact on how I approach teaching. If teaching is something that you do or plan to do, invest three hours of your life watching this video in which Greg Wilson, “Carpentries” founder, talks about evidence-based teaching and his “Carpentries” project.

I also recommend reading these papers, which provide a brief overview of some evidence-based results in teaching:

What you need to know to start a career as a data scientist

It’s hard to overestimate how I adore StackOverflow. One of the recent blog posts on StackOverflow.blog is “What you need to know to start a career as a data scientist” by Julia Silge. Here are my reservations about that post:

1. It’s not that simple (part 1)

You might have seen my post “Don’t study data science as a career move; you’ll waste your time!“. Becoming a good data scientist is much more than making a decision and “studying it”.

2. Universal truths mean nothing

The first section in the original post is called “You’ll learn new things”. This is a universal truth. If you don’t “learn new things” every day, your professional career is stalling. Taken from the word of classification models, telling a universal truth has a very high sensitivity but very low specificity. In other words, it’s a useless waste of ink.

3. Not for developers only

The first section starts as follows: “When transitioning from a role as a developer to a position focused on data, …”. Most of the data scientists I know were never developers. I, for example, started as a pharmacist, computational chemist, and bioinformatician. I know several physicists, a historian and a math teacher who are now successful data scientists.

4. SQL skills are overrated

Another quote from the post: “Strong SQL skills are table stakes for data scientists and data engineers”. The thing is that in many cases, we use SQL mostly to retrieve data. Most of the “data scienc-y” work requires analytical tools and the flexibility that are not available in most of the SQL environments. Good familiarity with industry-standard tools and libraries are more important than knowing SQL. Statistics is way more important than knowing SQL. Julia Silge did indeed mention the tools (numpy/R) but didn’t emphasize them enough.

5. Communication importance is hard to overestimate

Again, quoting the post:

The ability to communicate effectively with people from diverse backgrounds is important.

Yes, Yes, and one thousand times Yes. Effective communication is a non-trivial task that is often overlooked by many professionals. Some people are born natural communicators. Some, like me, are not. If there’s one book that you can afford buying to improve your communication skills, I recommend buying “Trees, maps and theorems” by Jean-luc Doumont. This is a small, very expensive book that changed the way I communicate in my professional life.

6. It’s not that simple (part 2)

After giving some very general tips, Julia proceeds to suggest her readers checking out the data science jobs at StackOverflow Jobs site. The impression that’s made is that becoming a data scientist is a relatively simple task. It is not. At the bare minimum, I would mention several educational options that are designed for people trying to become data scientists. One such an option is Thinkful (I’m a mentor at Thinkful). Udacity and Coursera both have data science programs too. The point is that to become a data scientist, you have to study a lot. You might notice a potential contradiction between point 1 above and this paragraph. A short explanation is that becoming a data scientist takes a lot of time and effort. The post “Teach Yourself Programming in Ten Years” which was written in 2001 about programming is relevant in 2017 about data science.

Featured image is based on a photo by Jase Ess on Unsplash

Graffiti from Chișinău, Moldova

I’ve stumbled upon a nice post by Jackie Hadel where she shared some graffiti pictures from – Chișinău, the town I was born at. I left Chișinău in 1990 and first visited it in this March. I also took several graffiti pictures which I will share here. Chișinău is also known by its Russian name Kishinev.

Graffiti in Chisinau. Kishinevers, put your all efforts to rebuild your native city

This is a partially restored post-WWII writing that says “Kishinevers, give your all efforts to rebuild [your] native town”. Kishinev was ruined almost completely during the World War II. Right now, after the USSR collapse more than 25 years ago, the city still looks as if it needs to be restored.

Graffiti in Chisinau. Pythagorean theorem.

Being a data scientist, I liked this graffiti for the maths. It’s the Pythagorean theorem, in case you missed it.

Swastika on a tombstone in Chisinau

Swastika on a tombstone in the old Jewish cemetery. One of the saddest places I visited in this city.

Graffiti in Chisinau. Building-size graffity.

A mega-graffiti?

Graffiti in Chisinau. Writing that says "I love Moldova" (in Romanian)

“I love Moldova”. I love it too.

See the original post that prompted me to share these pictures: CHISINAU, MOLDOVA GRAFFITI: LEFT IN RUIN, YOU MAKE ME HAPPY — TOKIDOKI (NOMAD)

15july17 Chisinau, Moldova 🇲🇩

 

 

Come to PyData at the Bar Ilan University to hear me talking about anomaly detection

On June 12th, I’ll be talking about anomaly detection and future forecasting when “good enough” is good enough. This lecture is a part of PyCon Israel that takes place between June 11 and 14 in the Bar Ilan University. The conference agenda is very impressive. If “python” or “data” are parts of your professional life, come to this conference!