Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.
The data set, sample, procedure, and methods were detailed in week 1’s post.

import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn from sklearn import preprocessing # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x) #call in data set data = pandas.read_csv('homes_train.csv') print (data['SalePrice'].describe()) 

count 1460.00 mean 180921.20 std 79442.50 min 34900.00 25% 129975.00 50% 163000.00 75% 214000.00 max 755000.00 Name: SalePrice, dtype: float64 
There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

data['GrLivArea'] = preprocessing.scale(data['GrLivArea'], with_mean='True', with_std='False') data['SalePrice'] = preprocessing.scale(data['SalePrice'], with_mean='True', with_std='False') print(data['GrLivArea'].mean()) print(data['SalePrice'].mean()) # convert variables to numeric format using convert_objects function data['GrLivArea'] = pandas.to_numeric(data['GrLivArea'], errors='coerce') data['SalePrice'] = pandas.to_numeric(data['SalePrice'], errors='coerce') # view the centering data['SalePrice'].diff().hist() # BASIC LINEAR REGRESSION scat1 = seaborn.regplot(x="SalePrice", y="GrLivArea", scatter=True, data=data) plt.xlabel('Sale Price') plt.ylabel('Ground Living Area') plt.title ('Scatterplot for the Association Between Sale Price and Ground Living Area') print(scat1) 

print ("OLS regression model for the association between sale price and ground living area") reg1 = smf.ols('SalePrice ~ GrLivArea', data=data).fit() print (reg1.summary()) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

OLS regression model for the association between sale price and ground living area OLS Regression Results ======================================================================== Dep. Variable: SalePrice Rsquared: 0.502 Model: OLS Adj. Rsquared: 0.502 Method: Least Squares Fstatistic: 1471. Date: Mon, 03 Oct 2016 Prob (Fstatistic): 4.52e223 Time: 00:13:00 LogLikelihood: 18035. No. Observations: 1460 AIC: 3.607e+04 Df Residuals: 1458 BIC: 3.608e+04 Df Model: 1 Covariance Type: nonrobust ======================================================================== coef std err t P>t [95.0% Conf. Int.]  Intercept 1.857e+04 4480.755 4.144 0.000 9779.612 2.74e+04 GrLivArea 107.1304 2.794 38.348 0.000 101.650 112.610 ======================================================================== Omnibus: 261.166 DurbinWatson: 2.025 Prob(Omnibus): 0.000 JarqueBera (JB): 3432.287 Skew: 0.410 Prob(JB): 0.00 Kurtosis: 10.467 Cond. No. 4.90e+03 ======================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.9e+03. This might indicate that there are strong multicollinearity or other numerical problems. 
Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my Rsquared and adjusted Rsquared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).
My pvalue of 4.52e223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (pvalue = 3.61e05) and the ground floor living space (pvalue = 2e16) appear to be contributing to the significance – with both pvalues 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.
From the graph the dataset appears to be skewed on the sale price data – the mean is 1124 from zero (where we’d like it to be) so the data was centered.
I realise I still need to examine the residuals and test for normality (normal or lognormal distribution).
Note the linear regression can also be done in R as follows:

house = read.csv('train.csv') house_model = lm(house$SalePrice ~ house$GrLivArea, house) summary(house_model) plot(house$GrLivArea, house$SalePrice) hist(house$SalePrice) shapiro.test(house$SalePrice) ## Plot using a qqplot qqnorm(house$SalePrice) qqline(house$SalePrice, col = 2) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Call: lm(formula = house$SalePrice ~ house$GrLivArea, data = house) Residuals: Min 1Q Median 3Q Max 462999 29800 1124 21957 339832 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 18569.026 4480.755 4.144 3.61e05 *** house$GrLivArea 107.130 2.794 38.348 < 2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 56070 on 1458 degrees of freedom Multiple Rsquared: 0.5021, Adjusted Rsquared: 0.5018 Fstatistic: 1471 on 1 and 1458 DF, pvalue: < 2.2e16 
To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.