Johns Hopkins R Programming Course introduces SWIRL

When reviewing Coursera’s excellent Johns Hopkins R Programming course this evening delivered by Roger D. Peng it introduced me to swirl.

What I hear you ask is swirl? Well it is an interactive R tutorial which can be run from R or R Studio.

Previously I’ve been getting my students to run the Try R interactive tutorial from codeschool but I think from now on it will be swirl.

To enter into the swirl interactive R tutorial – open up R or R Studio and type the following and enjoy the hours spent practicing R:

I must say I am really impressed with this course and it only costs $43 per month – I managed to get through the first 3 weeks of lectures, quizzes, and practicals this evening. This is the second of nine courses in this specialisation. I am so impressed that I bought Professor Peng’s books, course notes, videos, datasets. If there were t-shirts, I would have bought one too.

Machine Learning for Data Analysis Course Passed

Today I passed Wesleyan University’s Machine Learning for Data Analysis Course on Coursera. This course was a great Python & SAS course and part 4 of their Data Analysis and Interpretation Specialisation. So only the Capstone project left for me to do. The lecturers Lisa Dierker and Jen Rose know their stuff and the practicals each week are fun to do. This month’s Programming for Big Data course in DBS will contain some of the practicals and research I did for this course.

Cluster Analysis of the Iris Dataset

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.


The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.


We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

Machine Learning Foundations Course Passed

Yes today I passed my Machine Learning Certificate Course in Machine Learning Foundations a Case Study approach from the University of Washington on Coursera. This course was a great introduction to Graphlab and really fun to do the modules from all 6 weeks. Graphlab allowed me to do regression analysis, classification analysis, sentiment analysis and machine learning with easy to use apis. The lecturers Carlos Guestrin and Emily Fox were fantastically enthusiastic making the course really enjoyable to do. I look forward to rolling this knowledge into my lectures in DBS over the coming months. Hopefully I have the time to complete the Specialization and Capstone project on Coursera too in the coming months.

Running an Analysis of Variance

Carrying on from the Hypothesis developed in Developing a Research Question I am trying to ascertain if there is a statistically significant relationship between the location and the sales price of a house in Ames Iowa. I have chosen to explore this in python. The tools used are pandas, numpy, and statsmodels.

Load in the data set and ensure the interested in variables are converted to numbers or categories where necessary. I decide to use ANOVA (Analysis of Variance) to test and TukeyHSD (Tukey Honest Significant Difference) for post-hoc testing my data set and my hypothesis.

This tells us that there are 25 neighbourhoods in the dataset.

We can create our ANOVA model with the smf.ols function and we will tilda SalePrice (dependent variable) with Neighborhood (independent variable) to build our model. We can then get the model fit using the fit function on the model and use the summary function to get our F-statistic and associated p value which we hope will be less than 0.05 so that we can reject our null hypothesis that there is no significant association between neighbourhood and sale price, therefore we can accept our alternate hypothesis that there is a significant relationship.

We get the output below which tells us that for 1460 observations with an F-statistic of 71.78 the p-value is 1.56e-225 meaning that the chance of this happening by chance is very very very small – 224 zero after the decimal point followed by 156, so we can safely reject the null hypothesis and accept the alternative hypothesis. Our adjusted R-squared is also .538 so our model is giving up a nearly 54% value for accuracy in including more than half of our training samples correctly. So our alternative hypothesis is that there IS a significant relationship between sale price and location (neighbourhood).

We know there is a significant relationship between neighbourhood and sale price but we don’t know which neighbourhood – remember we have 25 of these that can be different from eachother. So we must do some post-hoc testing. I will use the tukey hsd for this investigation

We can check the reject column below to see if we should reject any variations between neighbourhoods – but with 25 neighbourhoods, there are 25*24/2  = 300 relationships to check so there is a lot of output. Note we can output a box-plot to help visualise this too – see below the data for this output.

To visualise this we can use the pandas boxplot function although we probably have to tidy up the indices on the neighborhood (x) axis:


Developing a Research Question

While trying to buy a house in Dublin I realised I had no way of knowing if I was paying a fair price for a house, if I was getting it for a great price, or if I was over-paying for the house. The data scientist in me would like to develop an algorithm, a hypothesis, a research question, so that my decisions are based on sound science and not on gut instinct. So for the last couple of weeks I have been developing algorithms to determine this fair price value. So my research questions is:
Is house sales price associated with socio-economic location?

I stumbled upon similar research by Dean DeCock from 2009 in his research determining the house price for Ames Iowa. So that is the data set that I will use. See the Kaggle page House Prices Advanced Regression Techniques to get the data.

I would like to study the association between the neighborhood (location) and the house price, to determine does location influence the sale price and is the difference in means between different locations significant.

This dataset has 79 independent variables with sale price being the dependent variable. Initially I am only focusing on one independent variable – the neighborhood, so I can reduce the dataset variables down to two, to simplify the computation my analysis of variance needs to perform.

Now that I have determined I am going to study location, I decide that I might further want to look at the bands of house size, not just the house size (square footage), but if I can turn those into categories of square footage, less than 1000, between 1000 and 1250 square feet, 1250 to 1500, > 1500 to see if there is a variance in the mean among these categories.

I can now take the above ground living space variable (square footage) and add it to my codebook. I will also add any other variables related to square footage for first floor, second floor, basement etc…

I then search google scholar, kaggle, dbs library for previous study in these areas, finding: a paper from 2001 discussing previous research in Dublin, however it was done in 2001 when a bubble was about to begin, and a big property crash in 2008 that was not conceived.
Secondly Dean De Cock’s research on house prices in Iowa

Based on my literature review I believe that there might be a statistically significant association between house location (neighborhood) and sales price. Secondary I believe there will be a statistically significant association between size bands (square footage band) and sales price. I further believe that might be an interaction effect between location & square footage bands and sales price which I would like to investigate too.

So I have developed three null hypotheses:
* There is NO association between location and sales price
* There is NO association between bands of square footage and sales price
* There is NO interaction effect in association between location, bands of square footage and sales price.

Running a LASSO Regression Analysis

A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step:


Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.

Wesleyan’s Regression Modeling in Practice – Week 2

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:


Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:


To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

Wesleyan’s Machine Learning for Data Analysis Week 2


Week 2’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford Connecticut Area in conjunction with Coursera was to build a random forest to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I continued using Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width.

Using Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

Now load our Iris dataset of 150 rows of 5 variables:

Now we begin our modelling and prediction. We define our predictors and target as follows:

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the random forest classifier class to do this.

Finally we make our predictions on our test data set and verify the accuracy.

Next we figure out the relative importance of each of the attributes:
# fit an Extra Trees model to the data

Finally displaying the performance of the random forest was achieved with the following:

And the plot success was output:


Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary or categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating the type of Iris based on petal width, petal length, sepal width, sepal length.

The explanatory variables with the highest relative importance scores were petal width (42.8%), petal length (40.9%), sepal length (9.6%), and finally sepal width (6.7%). The accuracy of the random forest was 95%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.

Wesleyan’s Regression Modeling in Practice – Week 1

For Wesleyan’s Regression Modeling in Practice week 1 assignment I am required to write up the sample, the procedure, and the measures section of a classical research paper. I’ve been trying to decide recently whether to move house or not, stay in the current house, sell the current house, move to another house, stay in the same area, move areas. So many decisions, so much choice so I want to do some regression modeling to help me with this decision. From I found an interesting problem and decided to write this up as my research data set for this assignment – House Prices: Advanced Regression Techniques.


The sample is taken from the Ames Assessor’s Office computing assessed value for individual residential properties sold in Ames, Iowa from 2006 to 2010. Participants (N=2930) represented individual residential property sales in the Ames area.
The data analytic sample for this study included participants who had sold an individual residential property. Also if a home was sold multiple times in the 5 year period only the most recent property sale was included. (N=1,320).


Data were collected by trained Ames Assessor’s Office Representatives during 2006–2010 through computer-assisted personal interviews (CAPI). At the selling time of the house one party involved in the sale of the property would be contacted and the required variables were submitted by way of questions in an interview in respondents’ homes following informed consent procedures.


The house sale price was assessed using 79 variables based on the type of dwelling involved in the sale (16 different types of dwellings were found). The zoning of the house with its 8 types of zones. 20 continuous variables relate to various area dimensions for each observation. In addition to the typical lot size and total dwelling square footage found on most common home listings, other more specific variables are quantified in the data set. Area measurements on the basement, main living area, and even porches are broken down into individual categories based on quality and type. 14 discrete variables typically quantify the number of items occurring within the house. Most are specifically focused on the number of kitchens, bedrooms, and bathrooms (full and half) located in the basement and above grade (ground) living areas of the home. Additionally, the garage capacity and construction/remodeling dates are also recorded. There are a large number of categorical variables (23 nominal, 23 ordinal) associated with this data set. They range from 2 to 28 classes with the smallest being STREET (gravel or paved) and
the largest being NEIGHBORHOOD (areas within the Ames city limits). The nominal variables typically identify various types of dwellings, garages, materials, and environmental conditions while the ordinal variables typically rate various items within the property.
Dependant Variable: Sale Price – the price the house sold for.
Independant Variable:


Kaggle’s House Prices: Advanced Regression Techniques
Ames Assessor’s Original Publication
Data Documentation