Clustering: cereals, supermarket shelves, and sugar

1 Summary

To see the the code used in this post, visit my GitHub repository for this site

  • Objectives: To find what what kind of clusters are there in the cereals dataset available on Kaggle.
  • Challenge: To implement a clustering algorithm for the first time.
  • Data points: 1232
  • Language: R

2 Question

Can I cluster the cereals dataset? If so, how many clusters are there? How can I make sense of those clusters?

3 Dataset description

The dataset is available on Kaggle. It contains nutrition data on 80 cereal products. I chose this dataset because the data is unlabelled so I could label the data through clustering. It also had plenty of numeric variables and I wanted to focus on numeric variables using Euclidean distances. Not all variables in the dataset were numeric: the cereal name, the manufacturer, the type of cereal (hot or cold) were of course, strings. There was also a numeric variable Shelf corresponding to the supemarket display shelf that doesn’t make much sense to add as a numeric variable, so I removed it from the analysis. After leaving just numeric variables, I was left with 12 variables.

I first scaled the data because I will calculate the Euclidean distance between each cereal but the variables are on different scales. When the data are scaled, the mean for each variable will be zero with a standard deviation of one.

4 Hierarchical clustering

I first implemented hierarchical clustering with three different linkage methods. Generally, the hierarchical clustering algorithm works by first linking the two observations that are closest together. It then chooses the closest observation based on a distance between the current observation and the distance to each element in a pair. That distance may vary according to the type of linkage as I show in the next tree diagrams (or dendrograms).

This tree diagram shows the complete linkage process. Here, the algorithm chooses the closest MAXIMUM distance between the considered observation and the current pair. The height in the plot shows the distance between two observations. Each horizontal line represents the number of clusters. By choosing two clusters, we can say that the maximum distance between clusters is equal two or less than 11.06.

Figure 4.1: This tree diagram shows the complete linkage process. Here, the algorithm chooses the closest MAXIMUM distance between the considered observation and the current pair. The height in the plot shows the distance between two observations. Each horizontal line represents the number of clusters. By choosing two clusters, we can say that the maximum distance between clusters is equal two or less than 11.06.

This is the average linkage with maximum height 7.57. Here, the algorithm chooses the closest AVERAGE distance between the considered observation and the current pair. Although the maximum distance between obsverations is shorter than in complete linkage, average linkage leads to one cluster with three observations and another with 74.

Figure 4.2: This is the average linkage with maximum height 7.57. Here, the algorithm chooses the closest AVERAGE distance between the considered observation and the current pair. Although the maximum distance between obsverations is shorter than in complete linkage, average linkage leads to one cluster with three observations and another with 74.

Single linkage has even a smaller distance between observations with the maximum being 4.16. Here, the algorithm chooses the closest MINIMUM distance between the considered observation and the current pair: 76 observations are in one cluster and one is in another.

Figure 4.3: Single linkage has even a smaller distance between observations with the maximum being 4.16. Here, the algorithm chooses the closest MINIMUM distance between the considered observation and the current pair: 76 observations are in one cluster and one is in another.

I wanted balanced clusters so I chose the complete linkage method, which classified observations more or less into two similar groups with one cluster having 30 cereals and the other 47.

5 Results

5.1 What are the differences between clusters?

cluster

calories

protein

fat

sodium

fiber

carbo

sugars

potass

vitamins

weight

cups

rating

n

1

95.667

3.100

1.100

99.500

3.440

13.417

4.200

128.133

18.333

0.961

0.720

54.648

30

2

114.043

2.191

0.957

198.085

1.330

15.351

8.660

75.617

34.574

1.073

0.885

35.018

47

Cluster two (47 cereals), has on average, 19% more calories, 30% less protein, 99% more sodium, 62% less fiber, twice as much sugar, 41% less potasium, 89% more vitamins, and 36% lower ratings. Since one of the aims of clustering is to label unlabelled data, based on the differences between the two clusters, I will label cluster one as healthy and cluster two as unhealthy. It’s interesting to see that unhealthy cereals have a worse nutritional profile than healthy cereals but are very enriched with 90% more vitamins.

5.2 Can we classify cereals into either cluster based on their names?

Let’s look at the clusters and see which specific cereals belong to each cluster. Figure 5.1 shows cereals plotted against sugar and sodium content per serving. Check out how cereal names don’t necessarily indicate to which cluster it belongs to. For example, Total Raisin Bran has a high sugar and sodium content, higher than a perhaps intuitively unhealthy cereal like Lucky Charms. Another example is the word Bran, which years of marketing has led us to associate with healthy cereals. Not always the case! Some Bran cereals like 100% Natural Bran, Bran Chex, Raisin Nut Bran do have a relatively low sugar and sodium content. Other Bran products like Raisin Bran, Fruitful Bran, and Post Nat Raisin Bran have a higher sugar content than Fruity Pebbles!

Across all numerical variables, sugar and sodium was the biggest contrast between both clusters. This plot shows both clusters healthy (blue) and unhealthy (red) plotted against the sugar and sodium content per serving. Unhealthy cereals tend to have a higher sugar and sodium content (upper right side), while healthy cereals tend to have a less of these ingredients (lower left side). There are exceptions of course: All-Bran, for example, is labelled as a healthy cereal has much more sodium than several unhealthy cereals.

Figure 5.1: Across all numerical variables, sugar and sodium was the biggest contrast between both clusters. This plot shows both clusters healthy (blue) and unhealthy (red) plotted against the sugar and sodium content per serving. Unhealthy cereals tend to have a higher sugar and sodium content (upper right side), while healthy cereals tend to have a less of these ingredients (lower left side). There are exceptions of course: All-Bran, for example, is labelled as a healthy cereal has much more sodium than several unhealthy cereals.

5.3 Which shelves have the most unhealthy cereals?

From my brief food marketing experience, I remember learning that brands pay a premium to have their product on certain shelves because they are more likely to be seen by shoppers. Figure 5.2 illustrates types of cereals according to shelf (y-axis) and sugar (x-axis). The top shelf (3), has an equal quantity of unhealthy and healthy cereals. It’s also pretty cluttered with 36 cereals in total. My guess is that brands don’t pay extra for their cereals to be stocked on this shelf. The bottom shelf (1) has six more unhealthy cereals. Notice that cereals on this shelf are to some extent “basics” whether they’re unhealthy or not: aatmeal, Frosted Flakes, Rice Krispies, Corn Flakes. The middle shelf appears to be the premium one with three times more unhealthy than healthy cereals. For the most part, middle-shelf unhealthy cereals are also the ones geared towards children: Cap’n Crunch, Lucky Charms, Fruity Pebbles, Cocoa Puffs.

Supermarket shelves and sugar content. The middle shelf has three times as many unhealthy cereals. The top shelf has equal amounts of types of cereal with a broad range along the sugar content.

Figure 5.2: Supermarket shelves and sugar content. The middle shelf has three times as many unhealthy cereals. The top shelf has equal amounts of types of cereal with a broad range along the sugar content.

Shelf

Healthy

Unhealthy

Total

3

18

18

36

2

5

16

21

1

7

13

20

6 Conclusion

In this post I had a lot of fun using hierarchical clustering to label breakfast cereals according to their nutritional profile. The main differences between clusters led me to label them as unhealthy and healthy. There were three interesting findings. First, unhealthy cereals were labelled so because they had on average, 19% more calories, 30% less protein, 99% more sodium, 62% less fiber, twice as much sugar, 41% less potasium, and 89% more vitamins (red herring for parents amirite?). Second, I show how cereal names are misleading. Some cereal names containing “healthy” words such as “bran” sometimes have more sugar and salt than Fruity Pebbles!. Third, the middle shelf in this particular supermarket where the data was collected tended to be less cluttered and stocked three times as many unhealthy cereals which were mostly geared towards children.

Share Comments
comments powered by Disqus