Running an Analysis of Variance

Carrying on from the Hypothesis developed in Developing a Research Question I am trying to ascertain if there is a statistically significant relationship between the location and the sales price of a house in Ames Iowa. I have chosen to explore this in python. The tools used are pandas, numpy, and statsmodels.

Load in the data set and ensure the interested in variables are converted to numbers or categories where necessary. I decide to use ANOVA (Analysis of Variance) to test and TukeyHSD (Tukey Honest Significant Difference) for post-hoc testing my data set and my hypothesis.

This tells us that there are 25 neighbourhoods in the dataset.

We can create our ANOVA model with the smf.ols function and we will tilda SalePrice (dependent variable) with Neighborhood (independent variable) to build our model. We can then get the model fit using the fit function on the model and use the summary function to get our F-statistic and associated p value which we hope will be less than 0.05 so that we can reject our null hypothesis that there is no significant association between neighbourhood and sale price, therefore we can accept our alternate hypothesis that there is a significant relationship.

We get the output below which tells us that for 1460 observations with an F-statistic of 71.78 the p-value is 1.56e-225 meaning that the chance of this happening by chance is very very very small – 224 zero after the decimal point followed by 156, so we can safely reject the null hypothesis and accept the alternative hypothesis. Our adjusted R-squared is also .538 so our model is giving up a nearly 54% value for accuracy in including more than half of our training samples correctly. So our alternative hypothesis is that there IS a significant relationship between sale price and location (neighbourhood).

We know there is a significant relationship between neighbourhood and sale price but we don’t know which neighbourhood – remember we have 25 of these that can be different from eachother. So we must do some post-hoc testing. I will use the tukey hsd for this investigation

We can check the reject column below to see if we should reject any variations between neighbourhoods – but with 25 neighbourhoods, there are 25*24/2  = 300 relationships to check so there is a lot of output. Note we can output a box-plot to help visualise this too – see below the data for this output.

To visualise this we can use the pandas boxplot function although we probably have to tidy up the indices on the neighborhood (x) axis: