# Data science interviews: lessons learned

In this post I share how I spent my time during my data scientist job search. I then use this information to point out things I would have done differently and things I'm glad I did. The content of this blogpost was first presented at R-Ladies London on 2019-02-21.

# NBA free throws analysis

This is why despite knowing little to nothing about basketball, I still knew who Lebron James was. In this post I analysed 6,798,209 data points on NBA free throws from 2006 to 2018. I focused on the Players variable to rank NBA players by the total number of shots for all ten seasons and amongst seasons. There is much scoring variability amongst top ten players with some playing less but scoring more and others attempting much more but being consistent. The better strategy, judging by the ranking, is to attempt more often and stay consistent throughout the season.

In this blog post, I explain the relationship between the education index and R downloads per 1,000 capita across the globe. To do this, I identified an education based metric for all countries. I then regressed R downloads per 1,000 capita onto the education index. I found that the education index (values from 2016) explains about 64.6% of variation in R downloads per 1,000 capita. There were several outliers including Turkmenistan, Tajikistan, Libya, Burkina Faso, and Chad.

In this blogpost I analysed a dataset containing R software downloads spanning from October 2017 to 2018. Unsurprisingly, I found that the most populated countries have the most total downloads. Using the tidyverse, lubridate, and tmappackages, I found out which countries most download the software. Since large countries with large populations will have more total downloads, I decided to inspect number downloads per 1,000 capita. This extra step revealed that small, developed countries, such as Hong Kong, Switzerland, Iceland, Singapore, Liechtenstein, the Netherlands and Denmark have the most downloads per 1,000 capita. An exception to this were the US and Australia with at least one download per 1,000 capita despite being larger countries. Lastly, I looked into which months had the most R downloads by sub-regions and I find that almost everywhere, the summer isn't a very popular season for R downloads.

# Clustering: cereals, supermarket shelves, and sugar

In this post I had a lot of fun using hierarchical clustering to label breakfast cereals according to their nutritional profile. The main differences between clusters led me to label them as unhealthy and healthy. There were three interesting findings. First, unhealthy cereals were labelled so because they had on average, 19% more calories, 30% less protein, 99% more sodium, 62% less fiber, twice as much sugar, 41% less potasium, and 89% more vitamins (red herring for parents amirite?). Second, I show how cereal names are misleading. Some cereal names containing healthy words such as bran sometimes have more sugar and salt than Fruity Pebbles!. Third, the middle shelf in this particular supermarket where the data was collected tended to be less cluttered and stocked three times as many unhealthy cereals which were mostly geared towards children.