To see the all the code used in this post, visit my GitHub repository for this site
- Objectives: To predict property prices using decision trees, specifically, the classification and regression algorithm.
- Challenge: First time applying decision trees.
- Data points: 118260
- Language: R
You’re interested in buying or building a house in Ames, Iowa 1. You want to understand which characteristics are most closely linked to sale price since that can inform your purchasing decision. Perhaps you might even be able to find a less expensive house and modify certain features to increase the sale price.
3 Dataset description
The dataset House Prices: Advanced Regression Techniques was put together by Dean De Cock. It has 79 explanatory variables describing 1,460 homes in Ames, Iowa with 38 numeric and 43 character variables. The codebook for all the variables can be found here. As I go along, I’ll explain the most relevant ones.
3.1 Missing values
Naturally, the dataset contains missing values. Missing values need to be dealt with because often regression (and other models) require complete observations. Dealing with missing data depends on why the data are missing. This article explains four reasons why data could be missing. When the data are missing at random (MAR) or completely at random (MCAR), observations with missing values can be removed without introducing bias into the model. Sometimes, however, if the dataset is not too big and we don’t want to lose observations, or even if it is big, yet we still don’t want to remove observations, we can impute data. Imputing means replacing missing values by making some educated guesses. This article summarises how to impute data depending on why it is missing. If the data are not missing at random, then the imputation mechanism has to be modelled.
About 19 variables have missing values - based on the codebook, the reason why so many houses have pool quality (
PoolQC) missing is because
NA, means there is no pool. Since this variable is ordinal, I can revalue it to make it numerical and
0 will mean the property has no pool. Other features (
Fence, and fireplace quality (
FireplaceQu) are missing because of similar reasons. We don’t know why
LotFrontage is missing but we will impute the median for properties in the same neighborhood. I learned a lot about imputation and missing values from Erik Bruin’s kernel on Kaggle.
Correlation, \(Cor(X,Y)\), measures the strength of the linear relationship between two variables \(X\) and \(Y\). The correlation between
SalePrice and another variable, let’s say,
OverallQual, is the covariance of the separately normalised data between the two variables.
Since covariance units are
SalePrice, calculating the correlation is more helpful as is unit free. If we created a model with only one variable as the predictor of
SalePrice, let’s say, kitchen quality (
KitchenQual) and normalised the data, the regression slope would be the correlation between the two variables.
norm_fit <- lm(scale(SalePrice) ~ scale(KitchenQual), data = homes) round(coefficients(norm_fit), digits = 2)
(Intercept) scale(KitchenQual) 0.00 0.66
Here is the correlation matrix for variables that have a relationship stronger than 0.5 with
SalePrice. We can also see the two correlations (
KitchenQual) mentioned above.
There are 17 variables that have a correlation stronger than 0.5 with `SalesPrice. When variables are highly correlated amongst each other, it’s better to remove some of them as they don’t necessarily add additional information and it could lead to multicollinearity. The correlation plot highlights some obvious pairs related to each other:
GarageCars: makes sense, a bigger garage can hold more cars.
TotalBsmtSF: the total area of the first floor and basement, this also seems reasonable since basements are underneath the same floor and would tend to have a similar area.
GrLivArea: the total number of rooms and area above ground, again ok, more rooms would be linked to a bigger living area.
YearsSinceGarageBuilt: since garages are usually built at the same time as the house.
4 Regression trees - why use them?
The tool I’m using to answer the question is regression trees. They are also known as classification and regression trees (CART) or the recursing and partitioning (RPART) algorithm. The reasons I’m choosing this tool or algorithm to answer the question is because a) I’ve never applied them to a dataset and b) regression trees are interpretable and allow for easy-to-follow plots that might come in handy. I will be using the
rpart package in R. `rpart, builds a model in two stages:
The variable which can best2 split the data into two groups is identified. The data are then separated into two groups and the whole process is repeated recursively or indefinitely until the sub-groups reach a minimum size, or until no further improvements can be made. When the split is made, similarity amongst the observations can more or less homogenous. This homogeneity is also called purity and it can be measured. The impurity measure of a node specifies how mixed the resulting subset is.
The tree is trimmed back or prunned using cross-validation. We identify the lowest cross-validated error or the smallest within one standard error of the tree with lowest cross-validated error. In this case, the tree with seven splits and eight nodes is has the
# Train the model homes_model <- rpart(formula = SalePrice ~ ., data = homes_train, method = "anova")
4.1 Variable importance
rpart function, we are able to rank which variables are most predictive of
SalePrice. The following plot ranks these variables in descending order.
Now I will compute two measures of error (MSE and RMSE) on both the baseline and optimised models using the test data. I will choose the model with the smallest MSE and RMSE on this unseen data.
#Compute RMSE baseline model rmse(actual=homes_test$SalePrice, #Actual values predicted = pred_base ) #Predicted values
#Compute MAE baseline model mae(actual=homes_test$SalePrice, #Actual values predicted = pred_base ) #Predicted values
#Compute RMSE optimised model rmse(actual=homes_test$SalePrice, #Actual values predicted = pred_opt) #Predicted values
#Compute MAE optimised model mae(actual=homes_test$SalePrice, #Actual values predicted = pred_opt) #Predicted values
It seems the baseline model with 10 splits resulted in a lower test MSE and RMSE than the optimised model.
I chose the model with 10 splits and 11 nodes because it had the lowest performance metrics on unseen data. I was surprised because I had expected the smaller model to perform better.
The most influential variable for
OverallQual. This variable “rates the overall material and finish of the house”, values equal or above to 8 correspond to “very good”, “excellent”, and “very excellent”. On one hand, for houses with a rating of 8 or above in
OverallQual, the next most decisive variable is
TotalBsmtSF, which is the “total square feet of basement area”. If it’s above 1850, the house is classified depending on
Fancy houses are in neighborhoods: NridgHt, NoRidge, StoneBr. All other neighborhoods are
Not_fancy. Houses below 1850 feet are further classified depending on
On the other hand, for houses with a rating below 8 for
OverallQual, the same variable decides again classifying them above or equal to 7, which is “good”. Either way, houses will be classified again by
GrLivArea: “above grade living area in square feet”, where smaller houses will depend on basement size (
TotalBsmtSF) or on
MS_type, which describes the type of dwelling. Modern dwellings include: split foyer, 1-story built in 1946 or newer. Less modern dwellings are 1945 or older in some cases.
In response to the question: what makes a house more valuable than others? The simple response is quality and area. Many variables in the tree are related to area: basement area, garage area, and total living area. Even the size of the lot plays an important role, so size is essential. But more important than size, in this case, is the overall quality of the house. It seems that having a property in good shape pays off. Would it be worth to find a house in a somewhat fancy neighborhood and work on improving its finish and materials to increase quality. Who knows? That is a causal question. There is a strong relationship between the variables I describe but correlation does not imply causation. As a final note it’s important to add that because the dataset is from Ames, Iowa, the results of this analysis are limited to that area.
In this post I explored a property dataset from Ames, Iowa. The data described a set of features for houses and included sale price. My goal was to understand what features are linked with sale price for this specific dataset using regression trees. To do this, I first prepared the data by dealing with missing values and created other variables to better interpet the results. After preparing the data, I used regression trees to answer the question. One of the benefits of regression trees is that the output can be illustrated and easily interpreted. I found that the variable: overall quality is most closely linked to sale price. Other features such as living area and basement size are also important. I also found that neighborhoods NorthRidge Heights, Northridge and, Stone Brook have the most expensive houses.