Wesleyan’s Regression Modeling in Practice – Week 2

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:

sale_price_histogram_pythonsale_price_ground_living_area

Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:

sale_price_histogramqqnorm_sale_price

To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *