Running a LASSO Regression Analysis

A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step:

regression_coef_progmean_squared_error

Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.

Leave a Reply

Your email address will not be published. Required fields are marked *