Wesleyan’s Regression Modeling in Practice – Week 2

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:


Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:


To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

Wesleyan’s Regression Modeling in Practice – Week 1

For Wesleyan’s Regression Modeling in Practice week 1 assignment I am required to write up the sample, the procedure, and the measures section of a classical research paper. I’ve been trying to decide recently whether to move house or not, stay in the current house, sell the current house, move to another house, stay in the same area, move areas. So many decisions, so much choice so I want to do some regression modeling to help me with this decision. From kaggle.com I found an interesting problem and decided to write this up as my research data set for this assignment – House Prices: Advanced Regression Techniques.


The sample is taken from the Ames Assessor’s Office computing assessed value for individual residential properties sold in Ames, Iowa from 2006 to 2010. Participants (N=2930) represented individual residential property sales in the Ames area.
The data analytic sample for this study included participants who had sold an individual residential property. Also if a home was sold multiple times in the 5 year period only the most recent property sale was included. (N=1,320).


Data were collected by trained Ames Assessor’s Office Representatives during 2006–2010 through computer-assisted personal interviews (CAPI). At the selling time of the house one party involved in the sale of the property would be contacted and the required variables were submitted by way of questions in an interview in respondents’ homes following informed consent procedures.


The house sale price was assessed using 79 variables based on the type of dwelling involved in the sale (16 different types of dwellings were found). The zoning of the house with its 8 types of zones. 20 continuous variables relate to various area dimensions for each observation. In addition to the typical lot size and total dwelling square footage found on most common home listings, other more specific variables are quantified in the data set. Area measurements on the basement, main living area, and even porches are broken down into individual categories based on quality and type. 14 discrete variables typically quantify the number of items occurring within the house. Most are specifically focused on the number of kitchens, bedrooms, and bathrooms (full and half) located in the basement and above grade (ground) living areas of the home. Additionally, the garage capacity and construction/remodeling dates are also recorded. There are a large number of categorical variables (23 nominal, 23 ordinal) associated with this data set. They range from 2 to 28 classes with the smallest being STREET (gravel or paved) and
the largest being NEIGHBORHOOD (areas within the Ames city limits). The nominal variables typically identify various types of dwellings, garages, materials, and environmental conditions while the ordinal variables typically rate various items within the property.
Dependant Variable: Sale Price – the price the house sold for.
Independant Variable:


Kaggle’s House Prices: Advanced Regression Techniques
Ames Assessor’s Original Publication
Data Documentation