Regression trees: predicting property prices

1 Summary

To see the all the code used in this post, visit my GitHub repository for this site

  • Objectives: To predict property prices using decision trees, specifically, the classification and regression algorithm.
  • Challenge: First time applying decision trees.
  • Data points: 118260
  • Language: R

2 Question

You’re interested in buying or building a house in Ames, Iowa 1. You want to understand which characteristics are most closely linked to sale price since that can inform your purchasing decision. Perhaps you might even be able to find a less expensive house and modify certain features to increase the sale price.

3 Dataset description

The dataset House Prices: Advanced Regression Techniques was put together by Dean De Cock. It has 79 explanatory variables describing 1,460 homes in Ames, Iowa with 38 numeric and 43 character variables. The codebook for all the variables can be found here. As I go along, I’ll explain the most relevant ones.

3.1 Missing values

Naturally, the dataset contains missing values. Missing values need to be dealt with because often regression (and other models) require complete observations. Dealing with missing data depends on why the data are missing. This article explains four reasons why data could be missing. When the data are missing at random (MAR) or completely at random (MCAR), observations with missing values can be removed without introducing bias into the model. Sometimes, however, if the dataset is not too big and we don’t want to lose observations, or even if it is big, yet we still don’t want to remove observations, we can impute data. Imputing means replacing missing values by making some educated guesses. This article summarises how to impute data depending on why it is missing. If the data are not missing at random, then the imputation mechanism has to be modelled.

About 19 variables have missing values - based on the codebook, the reason why so many houses have pool quality (PoolQC) missing is because NA, means there is no pool. Since this variable is ordinal, I can revalue it to make it numerical and 0 will mean the property has no pool. Other features (MiscFeature), Alley, Fence, and fireplace quality (FireplaceQu) are missing because of similar reasons. We don’t know why LotFrontage is missing but we will impute the median for properties in the same neighborhood. I learned a lot about imputation and missing values from Erik Bruin’s kernel on Kaggle.

3.2 Correlation

Correlation, \(Cor(X,Y)\), measures the strength of the linear relationship between two variables \(X\) and \(Y\). The correlation between SalePrice and another variable, let’s say, OverallQual, is the covariance of the separately normalised data between the two variables.

cov(scale(homes$SalePrice), scale(homes$OverallQual))

[1,] 0.791

Since covariance units are OverallQual * SalePrice, calculating the correlation is more helpful as is unit free. If we created a model with only one variable as the predictor of SalePrice, let’s say, kitchen quality (KitchenQual) and normalised the data, the regression slope would be the correlation between the two variables.

norm_fit <- lm(scale(SalePrice) ~ scale(KitchenQual), data = homes)
round(coefficients(norm_fit), digits = 2)
   (Intercept) scale(KitchenQual) 
          0.00               0.66 

Here is the correlation matrix for variables that have a relationship stronger than 0.5 with SalePrice. We can also see the two correlations (SalePrice & OverallQual; SalePrice & KitchenQual) mentioned above.

Variables are arranged in descending order according to the strength of the relationship with `SalePrice`.

Figure 3.1: Variables are arranged in descending order according to the strength of the relationship with SalePrice.

There are 17 variables that have a correlation stronger than 0.5 with `SalesPrice. When variables are highly correlated amongst each other, it’s better to remove some of them as they don’t necessarily add additional information and it could lead to multicollinearity. The correlation plot highlights some obvious pairs related to each other:

  • GarageArea and GarageCars: makes sense, a bigger garage can hold more cars.
  • X1stFlrSF and TotalBsmtSF: the total area of the first floor and basement, this also seems reasonable since basements are underneath the same floor and would tend to have a similar area.
  • TotRmsAbvGrd and GrLivArea: the total number of rooms and area above ground, again ok, more rooms would be linked to a bigger living area.
  • YearsSinceBuilt and YearsSinceGarageBuilt: since garages are usually built at the same time as the house.

4 Regression trees - why use them?

The tool I’m using to answer the question is regression trees. They are also known as classification and regression trees (CART) or the recursing and partitioning (RPART) algorithm. The reasons I’m choosing this tool or algorithm to answer the question is because a) I’ve never applied them to a dataset and b) regression trees are interpretable and allow for easy-to-follow plots that might come in handy. I will be using the rpart package in R. `rpart, builds a model in two stages:

First stage:

The variable which can best2 split the data into two groups is identified. The data are then separated into two groups and the whole process is repeated recursively or indefinitely until the sub-groups reach a minimum size, or until no further improvements can be made. When the split is made, similarity amongst the observations can more or less homogenous. This homogeneity is also called purity and it can be measured. The impurity measure of a node specifies how mixed the resulting subset is.

Second stage:

The tree is trimmed back or prunned using cross-validation. We identify the lowest cross-validated error or the smallest within one standard error of the tree with lowest cross-validated error. In this case, the tree with seven splits and eight nodes is has the

# Train the model
homes_model <- rpart(formula = SalePrice ~ ., data = homes_train, method = "anova")

4.1 Variable importance

Using the rpart function, we are able to rank which variables are most predictive of SalePrice. The following plot ranks these variables in descending order.

Overall quality (`OverallQual`) is the most predictive variable - it was also most correlated with `SalePrice`. It's followed by basement size (`TotalBsmtSF`) and neighborhood type.

Figure 4.1: Overall quality (OverallQual) is the most predictive variable - it was also most correlated with SalePrice. It’s followed by basement size (TotalBsmtSF) and neighborhood type.

Table 4.1: Understanding the complexity parameter table: CP stands for complexity parameter, which can be thought as the minimum benefit a split nsplit can add to a tree and equivalent to the decrease of the rel.error. The rel.error stands for 1-RSquared, similar to linear regression, where it explains how much variability in the data is explained by the model. The xerror is the relative sum-of-squared errors in tenfold cross-validation. xstd is the variation in prediction across ten validation samples. We are going to choose the smallest tree (by splits) whose error is no more than one standard error above the error of the best model. The smallest error is 0.3242, adding one standard error: 0.3242+0.0373=0.3615. The model that has an xerror smaller than 0.3615 is the one with 4 splits with an xerror of 0.3604.
CP nsplit rel.error xerror xstd
0.4458 0 1.0000 1.0034 0.0965
0.1119 1 0.5542 0.5579 0.0545
0.0779 2 0.4424 0.4670 0.0542
0.0394 3 0.3645 0.4001 0.0361
0.0268 4 0.3251 0.3604 0.0357
0.0212 6 0.2715 0.3619 0.0410
0.0153 7 0.2503 0.3455 0.0378
0.0147 8 0.2351 0.3368 0.0372
0.0106 9 0.2204 0.3260 0.0373
0.0100 10 0.2098 0.3242 0.0373
This is the optimised model according to the criterion of choosing the smallest model within one standard error of the smallest `xerror`.

Figure 4.2: This is the optimised model according to the criterion of choosing the smallest model within one standard error of the smallest xerror.

Now I will compute two measures of error (MSE and RMSE) on both the baseline and optimised models using the test data. I will choose the model with the smallest MSE and RMSE on this unseen data.

#Compute RMSE baseline model
rmse(actual=homes_test$SalePrice, #Actual values
     predicted = pred_base ) #Predicted values

[1] 44.35

#Compute MAE baseline model
mae(actual=homes_test$SalePrice, #Actual values
     predicted = pred_base ) #Predicted values

[1] 29.38

#Compute RMSE optimised model
rmse(actual=homes_test$SalePrice, #Actual values
     predicted = pred_opt) #Predicted values

[1] 49.59

#Compute MAE optimised model
mae(actual=homes_test$SalePrice, #Actual values
    predicted = pred_opt) #Predicted values

[1] 35.49

It seems the baseline model with 10 splits resulted in a lower test MSE and RMSE than the optimised model.

5 Results

I chose the model with 10 splits and 11 nodes because it had the lowest performance metrics on unseen data. I was surprised because I had expected the smaller model to perform better.

The tree with 11 nodes and 10 splits had the lowest performance metrics.

Figure 5.1: The tree with 11 nodes and 10 splits had the lowest performance metrics.

The most influential variable for SalePrice was OverallQual. This variable “rates the overall material and finish of the house”, values equal or above to 8 correspond to “very good”, “excellent”, and “very excellent”. On one hand, for houses with a rating of 8 or above in OverallQual, the next most decisive variable is TotalBsmtSF, which is the “total square feet of basement area”. If it’s above 1850, the house is classified depending on Neighborhood_type. Fancy houses are in neighborhoods: NridgHt, NoRidge, StoneBr. All other neighborhoods are Not_fancy. Houses below 1850 feet are further classified depending on LotArea.

On the other hand, for houses with a rating below 8 for OverallQual, the same variable decides again classifying them above or equal to 7, which is “good”. Either way, houses will be classified again by GrLivArea: “above grade living area in square feet”, where smaller houses will depend on basement size (TotalBsmtSF) or on MS_type, which describes the type of dwelling. Modern dwellings include: split foyer, 1-story built in 1946 or newer. Less modern dwellings are 1945 or older in some cases.

In response to the question: what makes a house more valuable than others? The simple response is quality and area. Many variables in the tree are related to area: basement area, garage area, and total living area. Even the size of the lot plays an important role, so size is essential. But more important than size, in this case, is the overall quality of the house. It seems that having a property in good shape pays off. Would it be worth to find a house in a somewhat fancy neighborhood and work on improving its finish and materials to increase quality. Who knows? That is a causal question. There is a strong relationship between the variables I describe but correlation does not imply causation. As a final note it’s important to add that because the dataset is from Ames, Iowa, the results of this analysis are limited to that area.

6 Conclusion

In this post I explored a property dataset from Ames, Iowa. The data described a set of features for houses and included sale price. My goal was to understand what features are linked with sale price for this specific dataset using regression trees. To do this, I first prepared the data by dealing with missing values and created other variables to better interpet the results. After preparing the data, I used regression trees to answer the question. One of the benefits of regression trees is that the output can be illustrated and easily interpreted. I found that the variable: overall quality is most closely linked to sale price. Other features such as living area and basement size are also important. I also found that neighborhoods NorthRidge Heights, Northridge and, Stone Brook have the most expensive houses.

  1. That’s where the dataset is from.

  2. The best tree is the smallest tree with lowest cross-validated error.

Share Comments
comments powered by Disqus