Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn from sklearn import preprocessing # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x) #call in data set data = pandas.read_csv('homes_train.csv') print (data['SalePrice'].describe()) |

1 2 3 4 5 6 7 8 9 |
count 1460.00 mean 180921.20 std 79442.50 min 34900.00 25% 129975.00 50% 163000.00 75% 214000.00 max 755000.00 Name: SalePrice, dtype: float64 |

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.

So we can center the variables as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
data['GrLivArea'] = preprocessing.scale(data['GrLivArea'], with_mean='True', with_std='False') data['SalePrice'] = preprocessing.scale(data['SalePrice'], with_mean='True', with_std='False') print(data['GrLivArea'].mean()) print(data['SalePrice'].mean()) # convert variables to numeric format using convert_objects function data['GrLivArea'] = pandas.to_numeric(data['GrLivArea'], errors='coerce') data['SalePrice'] = pandas.to_numeric(data['SalePrice'], errors='coerce') # view the centering data['SalePrice'].diff().hist() # BASIC LINEAR REGRESSION scat1 = seaborn.regplot(x="SalePrice", y="GrLivArea", scatter=True, data=data) plt.xlabel('Sale Price') plt.ylabel('Ground Living Area') plt.title ('Scatterplot for the Association Between Sale Price and Ground Living Area') print(scat1) |

1 2 3 |
print ("OLS regression model for the association between sale price and ground living area") reg1 = smf.ols('SalePrice ~ GrLivArea', data=data).fit() print (reg1.summary()) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
OLS regression model for the association between sale price and ground living area OLS Regression Results ======================================================================== Dep. Variable: SalePrice R-squared: 0.502 Model: OLS Adj. R-squared: 0.502 Method: Least Squares F-statistic: 1471. Date: Mon, 03 Oct 2016 Prob (F-statistic): 4.52e-223 Time: 00:13:00 Log-Likelihood: -18035. No. Observations: 1460 AIC: 3.607e+04 Df Residuals: 1458 BIC: 3.608e+04 Df Model: 1 Covariance Type: nonrobust ======================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------ Intercept 1.857e+04 4480.755 4.144 0.000 9779.612 2.74e+04 GrLivArea 107.1304 2.794 38.348 0.000 101.650 112.610 ======================================================================== Omnibus: 261.166 Durbin-Watson: 2.025 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3432.287 Skew: 0.410 Prob(JB): 0.00 Kurtosis: 10.467 Cond. No. 4.90e+03 ======================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.9e+03. This might indicate that there are strong multicollinearity or other numerical problems. |

Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:

1 2 3 4 5 6 7 8 9 10 11 |
house = read.csv('train.csv') house_model = lm(house$SalePrice ~ house$GrLivArea, house) summary(house_model) plot(house$GrLivArea, house$SalePrice) hist(house$SalePrice) shapiro.test(house$SalePrice) ## Plot using a qqplot qqnorm(house$SalePrice) qqline(house$SalePrice, col = 2) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Call: lm(formula = house$SalePrice ~ house$GrLivArea, data = house) Residuals: Min 1Q Median 3Q Max -462999 -29800 -1124 21957 339832 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18569.026 4480.755 4.144 3.61e-05 *** house$GrLivArea 107.130 2.794 38.348 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 56070 on 1458 degrees of freedom Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018 F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16 |

To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.