To see the the code used in this post, visit my GitHub repository for this site
- Objectives: To find what what kind of clusters are there in the cereals dataset available on Kaggle.
- Challenge: To implement a clustering algorithm for the first time.
- Data points: 1232
- Language: R
Can I cluster the cereals dataset? If so, how many clusters are there? How can I make sense of those clusters?
3 Dataset description
The dataset is available on Kaggle. It contains nutrition data on 80 cereal products. I chose this dataset because the data is unlabelled so I could label the data through clustering. It also had plenty of numeric variables and I wanted to focus on numeric variables using Euclidean distances. Not all variables in the dataset were numeric: the cereal name, the manufacturer, the type of cereal (hot or cold) were of course, strings. There was also a numeric variable
Shelf corresponding to the supemarket display shelf that doesn’t make much sense to add as a numeric variable, so I removed it from the analysis. After leaving just numeric variables, I was left with 12 variables.
I first scaled the data because I will calculate the Euclidean distance between each cereal but the variables are on different scales. When the data are scaled, the mean for each variable will be zero with a standard deviation of one.
4 Hierarchical clustering
I first implemented hierarchical clustering with three different linkage methods. Generally, the hierarchical clustering algorithm works by first linking the two observations that are closest together. It then chooses the closest observation based on a distance between the current observation and the distance to each element in a pair. That distance may vary according to the type of linkage as I show in the next tree diagrams (or dendrograms).
I wanted balanced clusters so I chose the complete linkage method, which classified observations more or less into two similar groups with one cluster having 30 cereals and the other 47.
5.1 What are the differences between clusters?
Cluster two (47 cereals), has on average, 19% more calories, 30% less protein, 99% more sodium, 62% less fiber, twice as much sugar, 41% less potasium, 89% more vitamins, and 36% lower ratings. Since one of the aims of clustering is to label unlabelled data, based on the differences between the two clusters, I will label cluster one as
healthy and cluster two as
unhealthy. It’s interesting to see that
unhealthy cereals have a worse nutritional profile than
healthy cereals but are very enriched with 90% more vitamins.
5.2 Can we classify cereals into either cluster based on their names?
Let’s look at the clusters and see which specific cereals belong to each cluster. Figure 5.1 shows cereals plotted against sugar and sodium content per serving. Check out how cereal names don’t necessarily indicate to which cluster it belongs to. For example, Total Raisin Bran has a high sugar and sodium content, higher than a perhaps intuitively unhealthy cereal like Lucky Charms. Another example is the word Bran, which years of marketing has led us to associate with healthy cereals. Not always the case! Some Bran cereals like 100% Natural Bran, Bran Chex, Raisin Nut Bran do have a relatively low sugar and sodium content. Other Bran products like Raisin Bran, Fruitful Bran, and Post Nat Raisin Bran have a higher sugar content than Fruity Pebbles!
5.3 Which shelves have the most unhealthy cereals?
From my brief food marketing experience, I remember learning that brands pay a premium to have their product on certain shelves because they are more likely to be seen by shoppers. Figure 5.2 illustrates types of cereals according to shelf (y-axis) and sugar (x-axis). The top shelf (3), has an equal quantity of unhealthy and healthy cereals. It’s also pretty cluttered with 36 cereals in total. My guess is that brands don’t pay extra for their cereals to be stocked on this shelf. The bottom shelf (1) has six more unhealthy cereals. Notice that cereals on this shelf are to some extent “basics” whether they’re unhealthy or not: aatmeal, Frosted Flakes, Rice Krispies, Corn Flakes. The middle shelf appears to be the premium one with three times more unhealthy than healthy cereals. For the most part, middle-shelf unhealthy cereals are also the ones geared towards children: Cap’n Crunch, Lucky Charms, Fruity Pebbles, Cocoa Puffs.
In this post I had a lot of fun using hierarchical clustering to label breakfast cereals according to their nutritional profile. The main differences between clusters led me to label them as unhealthy and healthy. There were three interesting findings. First, unhealthy cereals were labelled so because they had on average, 19% more calories, 30% less protein, 99% more sodium, 62% less fiber, twice as much sugar, 41% less potasium, and 89% more vitamins (red herring for parents amirite?). Second, I show how cereal names are misleading. Some cereal names containing “healthy” words such as “bran” sometimes have more sugar and salt than Fruity Pebbles!. Third, the middle shelf in this particular supermarket where the data was collected tended to be less cluttered and stocked three times as many unhealthy cereals which were mostly geared towards children.