Every day data analysis: moods during the first month as a data scientist consultant

In this post I analyse my moods during my first month as a data scientist consulting in London.

Read more

Share Comments

Data science interviews: lessons learned

In this post I share how I spent my time during my data scientist job search. I then use this information to point out things I would have done differently and things I'm glad I did. The content of this blogpost was first presented at R-Ladies London on 2019-02-21.

Read more

Share Comments

NBA free throws analysis

This is why despite knowing little to nothing about basketball, I still knew who Lebron James was. In this post I analysed 6,798,209 data points on NBA free throws from 2006 to 2018. I focused on the Players variable to rank NBA players by the total number of shots for all ten seasons and amongst seasons. There is much scoring variability amongst top ten players with some playing less but scoring more and others attempting much more but being consistent. The better strategy, judging by the ranking, is to attempt more often and stay consistent throughout the season.

Read more

Share Comments

Log-linear regression: who tends to download R software

In this blog post, I explain the relationship between the education index and R downloads per 1,000 capita across the globe. To do this, I identified an education based metric for all countries. I then regressed R downloads per 1,000 capita onto the education index. I found that the education index (values from 2016) explains about 64.6% of variation in R downloads per 1,000 capita. There were several outliers including Turkmenistan, Tajikistan, Libya, Burkina Faso, and Chad.

Read more

Share Comments

Maps: R-Downloads around the world

In this blogpost I analysed a dataset containing R software downloads spanning from October 2017 to 2018. Unsurprisingly, I found that the most populated countries have the most total downloads. Using the `tidyverse`, `lubridate`, and `tmap`packages, I found out which countries most download the software. Since large countries with large populations will have more total downloads, I decided to inspect number downloads per 1,000 capita. This extra step revealed that small, developed countries, such as Hong Kong, Switzerland, Iceland, Singapore, Liechtenstein, the Netherlands and Denmark have the most downloads per 1,000 capita. An exception to this were the US and Australia with at least one download per 1,000 capita despite being larger countries. Lastly, I looked into which months had the most R downloads by sub-regions and I find that almost everywhere, the summer isn't a very popular season for R downloads.

Read more

Share Comments

Clustering: cereals, supermarket shelves, and sugar

In this post I had a lot of fun using hierarchical clustering to label breakfast cereals according to their nutritional profile. The main differences between clusters led me to label them as unhealthy and healthy. There were three interesting findings. First, unhealthy cereals were labelled so because they had on average, 19% more calories, 30% less protein, 99% more sodium, 62% less fiber, twice as much sugar, 41% less potasium, and 89% more vitamins (red herring for parents amirite?). Second, I show how cereal names are misleading. Some cereal names containing healthy words such as bran sometimes have more sugar and salt than Fruity Pebbles!. Third, the middle shelf in this particular supermarket where the data was collected tended to be less cluttered and stocked three times as many unhealthy cereals which were mostly geared towards children.

Read more

Share Comments

Regression trees: predicting property prices

In this post I explore a property dataset from Ames, Iowa. The data describes a set of features for houses and includes sale price. My goal was to understand what features are linked with sale price for this specific dataset using regression trees. To do this, I first prepared the data by dealing with missing values and created other variables to better interpet the results. After preparing the data, I used regression trees to answer the question. One of the benefits of regression trees is that the output can be illustrated and easily interpreted. I found that the variable: overall quality is most closely linked to sale price. Other features such as living area and basement size are also important. I also found that neighborhoods NorthRidge Heights, Northridge and, Stone Brook have the most expensive houses.

Read more

Share Comments

Exploring Kiva loans

In this post I analyse 671,205 Kiva loans from around the world. Most loans are requested by single females, EU requests the fewest loans, weekly repayment is an unpopular form of paying back loans and entertainment, wholesale, manufacturing, and construction amount less than 2.2% of sectors. The main uses for Kiva loans are agriculture, retail, and food with some variations amongst regions. Half of the loans are 4.22 USD or less and are funded by 12 or less lenders. The median time between posting on the Kiva platform and disbursing it to the borrower is 16.89 days. I mainly use the tidyverse, stringr, and quantmode packages.

Read more

Share Comments