Running an Analysis of Variance

Carrying on from the Hypothesis developed in Developing a Research Question I am trying to ascertain if there is a statistically significant relationship between the location and the sales price of a house in Ames Iowa. I have chosen to explore this in python. The tools used are pandas, numpy, and statsmodels.

Load in the data set and ensure the interested in variables are converted to numbers or categories where necessary. I decide to use ANOVA (Analysis of Variance) to test and TukeyHSD (Tukey Honest Significant Difference) for post-hoc testing my data set and my hypothesis.

This tells us that there are 25 neighbourhoods in the dataset.

We can create our ANOVA model with the smf.ols function and we will tilda SalePrice (dependent variable) with Neighborhood (independent variable) to build our model. We can then get the model fit using the fit function on the model and use the summary function to get our F-statistic and associated p value which we hope will be less than 0.05 so that we can reject our null hypothesis that there is no significant association between neighbourhood and sale price, therefore we can accept our alternate hypothesis that there is a significant relationship.

We get the output below which tells us that for 1460 observations with an F-statistic of 71.78 the p-value is 1.56e-225 meaning that the chance of this happening by chance is very very very small – 224 zero after the decimal point followed by 156, so we can safely reject the null hypothesis and accept the alternative hypothesis. Our adjusted R-squared is also .538 so our model is giving up a nearly 54% value for accuracy in including more than half of our training samples correctly. So our alternative hypothesis is that there IS a significant relationship between sale price and location (neighbourhood).

We know there is a significant relationship between neighbourhood and sale price but we don’t know which neighbourhood – remember we have 25 of these that can be different from eachother. So we must do some post-hoc testing. I will use the tukey hsd for this investigation

We can check the reject column below to see if we should reject any variations between neighbourhoods – but with 25 neighbourhoods, there are 25*24/2  = 300 relationships to check so there is a lot of output. Note we can output a box-plot to help visualise this too – see below the data for this output.

To visualise this we can use the pandas boxplot function although we probably have to tidy up the indices on the neighborhood (x) axis:

box_plot

Wesleyan’s Regression Modeling in Practice – Week 2

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:

sale_price_histogram_pythonsale_price_ground_living_area

Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:

sale_price_histogramqqnorm_sale_price

To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

Lab 4 – Python for Data Analytics

For any data scientist Python is a must, but Python alone will not go very far on its own. Pandas is the data analytics library that allows Python to deliver the functionality which comes out of the box in R.

Setting up Python & Pandas is now made very easy with Anaconda, and the running of Python can be made very intuitive with Jupyter Notebook.

Steps:

  • Download Python & Install. – python 2.7 was used here
  • Download Anaconda & Install
  • Open Command Prompt after installation
  • set PATH=%PATH%;c:\Python27;
  • conda –version
  • conda install pandas
  • conda install ipython
  • conda install pip
  • jupyter notebook

For more details, please see the full tutorial to install Pandas here.

A simple, intuitive, and powerful introduction to Pandas can be found here.

The graphics matplotlib library is discussed here.

Statistical analysis made easy in Python with SciPy and Pandas DataFrames.

5 Questions which can teach you Multiple Regressions (with R and Python).

Data files useful to run analysis on:

Iris Data

Parasite Data

Lab 4 – Amazon EMR

Amazon EMR is based on Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

EC2 (Elastic Compute Cloud) and S3 (Simple Secure Storage) will also be employed in this lab.

Elastic Map Reduce:

https://aws.amazon.com/elasticmapreduce

Getting Started Tutorial:

http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-gs.html