You Tube Video of Lecture: https://www.youtube.com/watch?v=VXyRfgnzL2o&list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ

Homework Chapter 3: https://docs.google.com/forms/d/e/1FAIpQLSddZI25daO1htr5iT5idAfVTJALs23CWNh_4aDAmcEW-oZQHw/viewform

]]>You Tube Video of Lecture: https://www.youtube.com/watch?v=IXXHH6ztsSA&list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ

Homework Chapter 2: https://elearning.dbs.ie/mod/url/view.php?id=433560

]]>You Tube Video of Lecture: https://www.youtube.com/watch?v=G721cooZXgs&list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ

Install Python: http://www.pythonlearn.com/install.php

Homework Chapter 1: https://elearning.dbs.ie/mod/url/view.php?id=433552

]]>Based on the thought process – judging a book by its cover? Is it possible to judge a movie by its online presence?

This hackathon idea is hosted on Kaggle and made available by Chuan Sun to allow budding data scientist’s to test out their ideas See https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset for more information.

This morning the DBS Hackathon group will investigate this dataset.

]]>So people have been taking part in the DBS Hackathons for over a year now and they are if not great fun, they appear to be popular. I also fully embraced swirl (Interactive R Learning) and have built it into my students’ continuous assessments. Plus when the data is all collected it provides information to be able to do some analytics. I did promise to make the data available in an anonymised fashion. So below is 514 different submissions by students over the last month or so.

The dataset provided shows the course completed, an anonymised unique id for the student, an anonymised email address, the date and time completed and lastly whether the student was male or female.

]]>

]]>

Swirl really does make the DBS Hackathons and Lectures Practical.

Goto R Studio and run the following 7 lines of R code.

if (!require("swirl")) install.packages("swirl") library(swirl) install_course_github('darrenredmond', 'R_ProgrammingDR') install_course_github('darrenredmond', 'DBS_Data_Analysis') install_course_github('darrenredmond', 'DBS_Hackathons') install_course_github('darrenredmond', 'Data_Mining_DR') swirl()

Welcome to the world of interactive learning via R for Data Analytics, Data Mining, and R Programming.

There are currently 27 tutorials in the 4 modules that you can claim credit against my course in Dublin Business School – or you can just do them for fun.

Each module is interactive, practical based, and will teach you the subject with some great examples. To date 220 tutorials have been completed and credit claimed for them.

Enjoy!

]]>When reviewing Coursera’s excellent Johns Hopkins R Programming course this evening delivered by Roger D. Peng it introduced me to swirl.

What I hear you ask is swirl? Well it is an interactive R tutorial which can be run from R or R Studio.

Previously I’ve been getting my students to run the Try R interactive tutorial from codeschool but I think from now on it will be swirl.

To enter into the swirl interactive R tutorial – open up R or R Studio and type the following and enjoy the hours spent practicing R:

if (!require("swirl")) install.packages("swirl") packageVersion("swirl") library(swirl) install_from_swirl("R Programming") swirl()

I must say I am really impressed with this course and it only costs $43 per month – I managed to get through the first 3 weeks of lectures, quizzes, and practicals this evening. This is the second of nine courses in this specialisation. I am so impressed that I bought Professor Peng’s books, course notes, videos, datasets. If there were t-shirts, I would have bought one too.

]]>Fivethirtyeight.com have release a dataset to Kaggle that they received as a result of a series of freedom of information requests from the New York Taxi Commission (NYTC) called Uber Pickups in New York City. They want us kagglers to investigate the data and one kaggler Rob Harrand came up with a kernal called Uber-Duper animation. During our DBS Analytics meetup yesterday every single pun on uber was used and we looked and used a few of the kernals with some ideas on how we could improve on them.

This can be generated with the following R-code.

# really nice way to install packages if they are not installed using require. if (!require('ggplot2')) install.packages('ggplot2') if (!require('readr')) install.packages('readr') if (!require('animation')) install.packages('animation') # Data visualization library(ggplot2) library(readr) # Animation library(animation) # set the directory to be - change to your directory. setwd("~/dev/r/uber") # Input data files are available in the "input/" directory. uber = read.csv('input/uber-raw-data-sep14.csv', stringsAsFactors = F) # create a new column for date. uber$Date = sapply(strsplit(uber$Date.Time, split = " "), function(x) x[[1]][1]) uber$Date = as.Date(uber$Date, format = "%m/%d/%Y") uber = uber[order(uber$Date),] # used to size the initial canvas min_long = min(uber$Lon) max_long = max(uber$Lon) min_lat = min(uber$Lat) max_lat = max(uber$Lat) l = length(uber$Date.Time) i = 1 # create an animated gif - which takes the data 25000 at a time # for each panel of the gif - also include the day in the gif. saveGIF(while (i <= l) { print(m <- ggplot(data=uber[1:i,],aes(Lon[1:i],Lat[1:i])) + geom_point(size=0.06, color="white", alpha = 0.2) + scale_x_continuous(limits=c(min_long, max_long)) + scale_y_continuous(limits=c(min_lat, max_lat)) + theme(panel.background = element_rect(fill = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + annotate("text", x =-73.0, y=41.2, label = uber$Date[i], colour = 'white', size = 8)) i = i+25000 }, movie.name = "uber.gif", interval = 0.1, convert = "convert", ani.width = 800, ani.height = 800)

Note image-magick must be installed on your computer to be able to do this.

To install on a mac:

brew install imagemagick

On windows download Image Magick and make sure it is added to the path.

As a programmer immediately I realise that this can be improved – the animated gif being generated is only generating one month of data for 2014, but there are six months of data – so loading in the 6 data sets and binding them into one will allow me to create an image over 6 months – hey I could change the colours as the months change. I can give the x and y axes proper names for Longitude and Latitude. I notice also that the function is skipping every 25000 data points so that out of 1 million data points it is generating 40 images and merging them into 1 animated gif. When all six datasets are merged there are in excess of 4.5 million observations – this is 180 images merged into 1 gif file – with 40 images it was almost 2mb – so this could be a 9mb file. Perhaps I can generate per 250000 – so I could parameterise this offset and so I convert this call to the animation saveGif code into a function called generate uber plot, with parameterised colours for the months, offset to change the number of frames in the animation – thus creating the following animation

]]>Twice in the last week I have been at conferences or awards and had to listen to people giving a talk and state that ‘Data is the new oil’ and stand back and expect people to look at them in awe as if they were Einstein putting forth the Theory of Relativity, or Archimedes shouting ‘Eureka’ (I realise the latter may not have happened the way the myth tells it).

Please this is nothing new. It is 10 years since Clive Humby of Tesco Clubcard fame wrote a paper describing this term in Data is the New Oil (DITNO). It is 16 years since Gartner’s Doug Laney developed his theory on 3 V’s of Big Data in a paper called ** 3-D Data Management: Controlling Data Volume, Velocity and Variety**. David McCandless of Information is Beautiful fame 6 years ago in a TED talk at least had the decency to refer to DITNO and expand on the theory with his thoughts on Data is the New Soil.

So please, DITNO is 10 years old, and although it * is* more important today then it ever was, but please stop putting this theory forward as if it is ground breaking – move on – develop your own theory don’t just agree with 3 V’s do some research and like David above create your own DITNS, or as one of my students Svetlana did – critiquing 14 V’s of Big Data in her excellent research and paper

Enlighten me! I will have over 100 students this year telling me what Big Data is and isn’t – so make the paper interesting, make me think – this person ‘really gets it’. I will not name and shame the two speakers that used the cliches in their talks. Perhaps it was news to the other delegates and it was just me bored by the staleness of their talks. I hope my students don’t think that about my lectures – time to freshen up my material.

]]>I should be in bed, I know, I have a lot of work tomorrow but not being one to contradict the great Nate Silver and fivethirtyeight.com but has Trump just won this US presidential election?

It is 3:10am (10:10 EST) and as I look at the numbers does Trump have enough electoral votes? The TV has him at 150 – CBS News seems to have prematurely given Virgina’s 13 votes to Clinton – it looks 150 to 109 (although CBS have 122).

So Trump looks to have nailed Georgia (16), North Carolina (15), Michigan (16), Ohio (18). That is 65 – he now would only need 55 more. He is leading but very close in Winconsin (10), Arizona (11), and Florida (29). This would only leave him needing 5 votes and he is expected to win Alaska, Idaho, and Utah comfortably to give him 13 votes.

Does that have him winning by 18? Getting Trump to 278 and Hillary to 260.

At what point does he not even need Florida?

I am not even including the 80 electoral votes for Nevada, or the West Coast states of California, Oregon, Washington in these projections.

]]>The p-value helps us determine the significance of our results. The hypothesis test is used to test the validity of a claim being made about the overall population. The claim that is being tested we determine to be the null hypothesis. If the null hypothesis was concluded to be false this is referred to as the alternative hypothesis.

Hypothesis tests use the p-value to weight the strength of the evidence (what the data tells us about the overall population).

* A small *p*-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis – and accept the alternative hypothesis – this is a statistically significant outcome.

* A large *p*-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis – accept the null hypothesis, and reject the alternative hypothesis – no statistical significance

** p*-values very close to the cutoff (0.05) are considered to be marginal (could go either way) – further sampling should be performed where possible.

If possible another 19 samples can be tested and hope that in the other 19 cases the p-value becomes significantly smaller than 0.05, or significantly bigger – at least then we can say in 19/20 (95% certainty) we can reject or fail to reject the null hypothesis.

Always report the *p*-value so your audience can draw their own conclusions.

In the case you are insisting on an alpha different than 0.05 – say 0.01 – 99% certainty then you are enforcing a different cut-off value and the rules above for 0.05 become 0.01.

So you are making it harder to find a statistical significant result – in other words you are saying that you need further proof before you will accept an alternative hypothesis, and before you will fail to reject the null hypothesis.

Alternatively if you set the alpha to be 0.1 you make it easier to find a statistically significant result and may over-optimistically reject the null hypothesis because the p-value might be 0.08.

Changing the alpha up or down like this may make it harder or easier to make Type 1 (rejecting the null hypothesis when you should have failed to reject (accept) it – a false positive) and Type 2 errors (failing to reject (accepting) the null hypothesis when you should have rejected it – a false negative)

A simple example springing to mind of why even at a 100% certainty we still might fail to reject the null hypothesis would be a guy called Thomas when he hears that Jesus has rose from the dead and appeared to the other 10 apostles. Out of a population of all possible apostles who weren’t named Thomas (Judas was dead by this stage) who could have seen a person rise from the dead – Thomas still doubted – why because his null hypothesis was people simply did not resurrect themselves from the dead (it had never happened before – and I don’t think it has happened since either) – and unless he saw it with his own eyes he would never reject his null hypothesis and no amount of talk from the other 10 would make it statistically significant. Once Jesus appeared to him – then he was able to reject his null hypothesis and accept the alternative hypothesis that this was a statistically significant event and that Jesus had in fact arisen from the dead.

Or another way to look at this was that if 10/11 apostles witnessed, giving 91% apostles who saw and 9% apostles (Thomas) who didn’t – p-value of 0.09 and that at an alpha of .05 meant all 11 would have to see for Thomas to believe – therefore 11/11.

A less tongue an cheek example from the web might look like – Apache Pizza pizza place claims their delivery times are 30 minutes or less on average but you think it’s more than that. You conduct a hypothesis test because you believe the null hypothesis, H_{o}, that the mean delivery time is 30 minutes max, is incorrect. Your alternative hypothesis (H_{a}) is that the mean time is greater than 30 minutes. You randomly sample some delivery times and run the data through the hypothesis test, and your p-value turns out to be 0.001, which is much less than 0.05. In real terms, there is a probability of 0.001 that you will mistakenly reject the pizza place’s claim that their delivery time is less than or equal to 30 minutes. Since typically we are willing to reject the null hypothesis when this probability is less than 0.05, you conclude that the pizza place is wrong; their delivery times are in fact more than 30 minutes on average, and you want to know what they’re gonna do about it! (Of course, you could be wrong by having sampled an unusually high number of late pizza deliveries just by chance.)

Being a Irish child born hours after Halloween (All Hallows Eve), means being born on All Hallows Day or as we call it All Saints Day now that we are no longer pagan in Ireland. Oiche Shamhna as we say in Gaelic and it is the Gaelic word Samhain (November) which gives English the name samhainophobia – the morbid fear of Halloween. Well this is topical at this time of year and Kaggle have created a lovely problem to solve to aid peoples’ samhainophobia and to help us spot a ghost, a ghoul, or a goblin.

Perfect for budding new R students to practice some data analytics in R.

]]>Today I passed Wesleyan University’s Machine Learning for Data Analysis Course on Coursera. This course was a great Python & SAS course and part 4 of their Data Analysis and Interpretation Specialisation. So only the Capstone project left for me to do. The lecturers Lisa Dierker and Jen Rose know their stuff and the practicals each week are fun to do. This month’s Programming for Big Data course in DBS will contain some of the practicals and research I did for this course.

]]>

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

iris_data = pd.read_csv("iris_data.csv", header=None) iris_data iris_data_clean = iris_data.dropna() iris_data_clean iris_data_clean.describe() iris_cluster = iris_data_clean[['sepal_length','sepal_width','petal_length','petal_width']].copy() iris_cluster['sepal_length']=preprocessing.scale(iris_cluster['sepal_length'].astype('float64')) iris_cluster['sepal_width']=preprocessing.scale(iris_cluster['sepal_width'].astype('float64')) iris_cluster['petal_length']=preprocessing.scale(iris_cluster['petal_length'].astype('float64')) iris_cluster['petal_width']=preprocessing.scale(iris_cluster['petal_width'].astype('float64')) iris_train, iris_test = train_test_split(iris_cluster, test_size=.3, random_state=123) from scipy.spatial.distance import cdist clusters=range(1,6) meandist=[] clusters for k in clusters: model=KMeans(n_clusters=k) model.fit(iris_train) iris_assign=model.predict(iris_train) meandist.append(sum(np.min(cdist(iris_train, model.cluster_centers_, 'euclidean'), axis=1)) / iris_train.shape[0]) plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') # Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(iris_train) iris_assign=model3.predict(iris_train) from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(iris_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()]]>

Yes today I passed my Machine Learning Certificate Course in Machine Learning Foundations a Case Study approach from the University of Washington on Coursera. This course was a great introduction to Graphlab and really fun to do the modules from all 6 weeks. Graphlab allowed me to do regression analysis, classification analysis, sentiment analysis and machine learning with easy to use apis. The lecturers Carlos Guestrin and Emily Fox were fantastically enthusiastic making the course really enjoyable to do. I look forward to rolling this knowledge into my lectures in DBS over the coming months. Hopefully I have the time to complete the Specialization and Capstone project on Coursera too in the coming months.

]]>

Today in the hackathon you can practice / learn some Excel, R, & Python, Fusion Tables to perform some data manipulation, data analysis, and graphics.

In R to set your working directory, use the function setwd() or in Python use the os.chdir function to achieve the same.

Part A

Attempt Some R Questions to practice using R.

Next we can practice reading data sets.

Attached are two files for US baby names in 1900 and 2000.

In the files you’ll see that each year is a comma-separated file with 3 columns – name, sex, and number of births.

Part B

Amazon best sellers 2014

Froud ships 1907 ]]>

Carrying on from the Hypothesis developed in Developing a Research Question I am trying to ascertain if there is a statistically significant relationship between the location and the sales price of a house in Ames Iowa. I have chosen to explore this in python. The tools used are pandas, numpy, and statsmodels.

Load in the data set and ensure the interested in variables are converted to numbers or categories where necessary. I decide to use ANOVA (Analysis of Variance) to test and TukeyHSD (Tukey Honest Significant Difference) for post-hoc testing my data set and my hypothesis.

import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi data = pandas.read_csv('ames_house_price.csv', low_memory=False) #setting variables you will be working with to numeric data['SalePrice'] = data['SalePrice'].convert_objects(convert_numeric=True) data['GrLivArea'] = data['GrLivArea'].convert_objects(convert_numeric=True) ct1 = data.groupby('Neighborhood').size() print (ct1)

This tells us that there are 25 neighbourhoods in the dataset.

We can create our ANOVA model with the smf.ols function and we will tilda SalePrice (dependent variable) with Neighborhood (independent variable) to build our model. We can then get the model fit using the fit function on the model and use the summary function to get our F-statistic and associated p value which we hope will be less than 0.05 so that we can reject our null hypothesis that there is no significant association between neighbourhood and sale price, therefore we can accept our alternate hypothesis that there is a significant relationship.

# using ols function for calculating the F-statistic and associated p value model1 = smf.ols(formula='SalePrice ~ C(Neighborhood)', data=data) results1 = model1.fit() print (results1.summary())

We get the output below which tells us that for 1460 observations with an F-statistic of 71.78 the p-value is 1.56e-225 meaning that the chance of this happening by chance is very very very small – 224 zero after the decimal point followed by 156, so we can safely reject the null hypothesis and accept the alternative hypothesis. Our adjusted R-squared is also .538 so our model is giving up a nearly 54% value for accuracy in including more than half of our training samples correctly. So our alternative hypothesis is that there **IS** a significant relationship between sale price and location (neighbourhood).

OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.546 Model: OLS Adj. R-squared: 0.538 Method: Least Squares F-statistic: 71.78 Date: Mon, 10 Oct 2016 Prob (F-statistic): 1.56e-225 Time: 09:16:00 Log-Likelihood: -17968. No. Observations: 1460 AIC: 3.599e+04 Df Residuals: 1435 BIC: 3.612e+04 Df Model: 24 Covariance Type: nonrobust ============================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------------- Intercept 1.949e+05 1.31e+04 14.879 0.000 1.69e+05 2.21e+05 C(Neighborhood)[T.Blueste] -5.737e+04 4.04e+04 -1.421 0.155 -1.37e+05 2.18e+04 C(Neighborhood)[T.BrDale] -9.038e+04 1.88e+04 -4.805 0.000 -1.27e+05 -5.35e+04 C(Neighborhood)[T.BrkSide] -7.004e+04 1.49e+04 -4.703 0.000 -9.93e+04 -4.08e+04 C(Neighborhood)[T.ClearCr] 1.769e+04 1.66e+04 1.066 0.287 -1.49e+04 5.03e+04 C(Neighborhood)[T.CollgCr] 3094.8910 1.38e+04 0.224 0.823 -2.4e+04 3.02e+04 C(Neighborhood)[T.Crawfor] 1.575e+04 1.51e+04 1.042 0.298 -1.39e+04 4.54e+04 C(Neighborhood)[T.Edwards] -6.665e+04 1.42e+04 -4.705 0.000 -9.44e+04 -3.89e+04 C(Neighborhood)[T.Gilbert] -2016.3760 1.44e+04 -0.140 0.889 -3.03e+04 2.63e+04 C(Neighborhood)[T.IDOTRR] -9.475e+04 1.58e+04 -5.988 0.000 -1.26e+05 -6.37e+04 C(Neighborhood)[T.MeadowV] -9.629e+04 1.85e+04 -5.199 0.000 -1.33e+05 -6e+04 C(Neighborhood)[T.Mitchel] -3.86e+04 1.52e+04 -2.540 0.011 -6.84e+04 -8784.735 C(Neighborhood)[T.NAmes] -4.902e+04 1.36e+04 -3.609 0.000 -7.57e+04 -2.24e+04 C(Neighborhood)[T.NPkVill] -5.218e+04 2.23e+04 -2.344 0.019 -9.58e+04 -8510.657 C(Neighborhood)[T.NWAmes] -5820.8139 1.45e+04 -0.400 0.689 -3.43e+04 2.27e+04 C(Neighborhood)[T.NoRidge] 1.404e+05 1.56e+04 9.015 0.000 1.1e+05 1.71e+05 C(Neighborhood)[T.NridgHt] 1.214e+05 1.45e+04 8.390 0.000 9.3e+04 1.5e+05 C(Neighborhood)[T.OldTown] -6.665e+04 1.4e+04 -4.744 0.000 -9.42e+04 -3.91e+04 C(Neighborhood)[T.SWISU] -5.228e+04 1.7e+04 -3.080 0.002 -8.56e+04 -1.9e+04 C(Neighborhood)[T.Sawyer] -5.808e+04 1.45e+04 -3.999 0.000 -8.66e+04 -2.96e+04 C(Neighborhood)[T.SawyerW] -8315.0857 1.49e+04 -0.559 0.576 -3.75e+04 2.08e+04 C(Neighborhood)[T.Somerst] 3.051e+04 1.43e+04 2.129 0.033 2393.494 5.86e+04 C(Neighborhood)[T.StoneBr] 1.156e+05 1.7e+04 6.812 0.000 8.23e+04 1.49e+05 C(Neighborhood)[T.Timber] 4.738e+04 1.58e+04 3.007 0.003 1.65e+04 7.83e+04 C(Neighborhood)[T.Veenker] 4.39e+04 2.09e+04 2.101 0.036 2913.679 8.49e+04 ============================================================================== Omnibus: 618.883 Durbin-Watson: 1.956 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5526.438 Skew: 1.737 Prob(JB): 0.00 Kurtosis: 11.875 Cond. No. 48.8 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We know there is a significant relationship between neighbourhood and sale price but we don’t know which neighbourhood – remember we have 25 of these that can be different from eachother. So we must do some post-hoc testing. I will use the tukey hsd for this investigation

data_sub = data[['SalePrice', 'Neighborhood']].dropna() print ('means for sale price by neighbourhood') m1 = data_sub.groupby('Neighborhood').mean() print (m1) print ('standard deviations for sale price by major neighbourhood') sd1 = data_sub.groupby('Neighborhood').std() print (sd1) mc1 = multi.MultiComparison(data['SalePrice'], data['Neighborhood']) res1 = mc1.tukeyhsd() print(res1.summary())

We can check the reject column below to see if we should reject any variations between neighbourhoods – but with 25 neighbourhoods, there are 25*24/2 = 300 relationships to check so there is a lot of output. Note we can output a box-plot to help visualise this too – see below the data for this output.

means for sale price by neighbourhood SalePrice Neighborhood Blmngtn 194870 Blueste 137500 BrDale 104493 BrkSide 124834 ClearCr 212565 CollgCr 197965 Crawfor 210624 Edwards 128219 Gilbert 192854 IDOTRR 100123 MeadowV 98576 Mitchel 156270 NAmes 145847 NPkVill 142694 NWAmes 189050 NoRidge 335295 NridgHt 316270 OldTown 128225 SWISU 142591 Sawyer 136793 SawyerW 186555 Somerst 225379 StoneBr 310499 Timber 242247 Veenker 238772 standard deviations for sale price by major neighbourhood SalePrice Neighborhood Blmngtn 30393.23 Blueste 19091.88 BrDale 14330.18 BrkSide 40348.69 ClearCr 50231.54 CollgCr 51403.67 Crawfor 68866.40 Edwards 43208.62 Gilbert 35986.78 IDOTRR 33376.71 MeadowV 23491.05 Mitchel 36486.63 NAmes 33075.35 NPkVill 9377.31 NWAmes 37172.22 NoRidge 121412.66 NridgHt 96392.54 OldTown 52650.58 SWISU 32622.92 Sawyer 22345.13 SawyerW 55652.00 Somerst 56177.56 StoneBr 112969.68 Timber 64845.65 Veenker 72369.32 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================================= group1 group2 meandiff lower upper reject ------------------------------------------------------------- Blmngtn Blueste -57370.8824 -205327.1494 90585.3847 False Blmngtn BrDale -90377.1324 -159316.6978 -21437.5669 True Blmngtn BrkSide -70036.8306 -124623.7013 -15449.96 True Blmngtn ClearCr 17694.5462 -43160.8093 78549.9018 False Blmngtn CollgCr 3094.891 -47555.6594 53745.4414 False Blmngtn Crawfor 15753.8431 -39675.653 71183.3393 False Blmngtn Edwards -66651.1824 -118574.7463 -14727.6184 True Blmngtn Gilbert -2016.376 -54933.1834 50900.4313 False Blmngtn IDOTRR -94747.0986 -152739.031 -36755.1661 True Blmngtn MeadowV -96294.4118 -164181.4029 -28407.4206 True Blmngtn Mitchel -38600.7599 -94312.3419 17110.8221 False Blmngtn NAmes -49023.8024 -98807.5959 759.9912 False Blmngtn NPkVill -52176.4379 -133766.4471 29413.5713 False Blmngtn NWAmes -5820.8139 -59121.3267 47479.699 False Blmngtn NoRidge 140424.4347 83330.0192 197518.8502 True Blmngtn NridgHt 121399.741 68361.3761 174438.1059 True Blmngtn OldTown -66645.5815 -118133.3438 -15157.8192 True Blmngtn SWISU -52279.5224 -114498.9775 9939.9328 False Blmngtn Sawyer -58077.7472 -111310.1904 -4845.304 True Blmngtn SawyerW -8315.0857 -62796.9994 46166.8279 False Blmngtn Somerst 30508.9549 -22025.1032 83043.0129 False Blmngtn StoneBr 115628.1176 53408.6625 177847.5728 True Blmngtn Timber 47376.565 -10374.6478 105127.7779 False Blmngtn Veenker 43901.8449 -32685.0101 120488.7 False Blueste BrDale -33006.25 -181448.4174 115435.9174 False Blueste BrkSide -12665.9483 -155011.0916 129679.1951 False Blueste ClearCr 75065.4286 -69799.2935 219930.1506 False Blueste CollgCr 60465.7733 -80416.7722 201348.3189 False Blueste Crawfor 73124.7255 -69545.6724 215795.1234 False Blueste Edwards -9280.3 -150625.5153 132064.9153 False Blueste Gilbert 55354.5063 -86358.5908 197067.6034 False Blueste IDOTRR -37376.2162 -181061.5586 106309.1262 False Blueste MeadowV -38923.5294 -186879.7965 109032.7377 False Blueste Mitchel 18770.1224 -124010.1064 161550.3513 False Blueste NAmes 8347.08 -132226.1731 148920.3331 False Blueste NPkVill 5194.4444 -149528.9959 159917.8848 False Blueste NWAmes 51550.0685 -90306.7539 193406.8909 False Blueste NoRidge 197795.3171 54469.8634 341120.7708 True Blueste NridgHt 178770.6234 37012.0908 320529.1559 True Blueste OldTown -9274.6991 -150460.4033 131911.005 False Blueste SWISU 5091.36 -140351.6666 150534.3866 False Blueste Sawyer -706.8649 -142538.1252 141124.3954 False Blueste SawyerW 49055.7966 -93249.1306 191360.7238 False Blueste Somerst 87879.8372 -53690.7835 229450.4579 False Blueste StoneBr 172999.0 27555.9734 318442.0266 True Blueste Timber 104747.4474 -38840.9086 248335.8034 False Blueste Veenker 101272.7273 -50871.8085 253417.263 False BrDale BrkSide 20340.3017 -35550.1855 76230.7889 False BrDale ClearCr 108071.6786 46044.3103 170099.0468 True BrDale CollgCr 93472.0233 41419.1813 145524.8654 True BrDale Crawfor 106130.9755 49417.228 162844.723 True BrDale Edwards 23725.95 -29566.4191 77018.3191 False BrDale Gilbert 88360.7563 34100.1941 142621.3185 True BrDale IDOTRR -4369.9662 -63590.6074 54850.6749 False BrDale MeadowV -5917.2794 -74856.8449 63022.286 False BrDale Mitchel 51776.3724 -5213.1044 108765.8493 False BrDale NAmes 41353.33 -9856.4953 92563.1553 False BrDale NPkVill 38200.6944 -44267.1764 120668.5652 False BrDale NWAmes 84556.3185 29921.4873 139191.1497 True BrDale NoRidge 230801.5671 172459.5377 289143.5965 True BrDale NridgHt 211776.8734 157397.7573 266155.9895 True BrDale OldTown 23731.5509 -29136.3011 76599.4029 False BrDale SWISU 38097.61 -25268.6327 101463.8527 False BrDale Sawyer 32299.3851 -22269.0409 86867.8112 False BrDale SawyerW 82062.0466 26274.0638 137850.0294 True BrDale Somerst 120886.0872 66998.7291 174773.4453 True BrDale StoneBr 206005.25 142639.0073 269371.4927 True BrDale Timber 137753.6974 78768.7612 196738.6335 True BrDale Veenker 134278.9773 56757.5836 211800.3709 True BrkSide ClearCr 87731.3768 42185.1676 133277.5861 True BrkSide CollgCr 73131.7216 42528.4353 103735.0079 True BrkSide Crawfor 85790.6738 47797.0964 123784.2512 True BrkSide Edwards 3385.6483 -29281.4508 36052.7474 False BrkSide Gilbert 68020.4546 33796.6124 102244.2968 True BrkSide IDOTRR -24710.2679 -66353.3598 16932.824 False BrkSide MeadowV -26257.5811 -80844.4518 28329.2895 False BrkSide Mitchel 31436.0707 -6967.8775 69840.019 False BrkSide NAmes 21013.0283 -8133.309 50159.3655 False BrkSide NPkVill 17860.3927 -53048.0869 88768.8723 False BrkSide NWAmes 64216.0168 29401.8308 99030.2027 True BrkSide NoRidge 210461.2653 170077.4176 250845.1131 True BrkSide NridgHt 191436.5717 157025.0761 225848.0673 True BrkSide OldTown 3391.2492 -28578.6201 35361.1184 False BrkSide SWISU 17757.3083 -29596.081 65110.6975 False BrkSide Sawyer 11959.0834 -22750.7982 46668.965 False BrkSide SawyerW 61721.7449 25124.4528 98319.0369 True BrkSide Somerst 100545.7855 66916.7782 134174.7928 True BrkSide StoneBr 185664.9483 138311.559 233018.3375 True BrkSide Timber 117413.3956 76106.1873 158720.604 True BrkSide Veenker 113938.6755 48849.2813 179028.0698 True ClearCr CollgCr -14599.6552 -55345.3174 26146.0069 False ClearCr Crawfor -1940.7031 -48493.4664 44612.0603 False ClearCr Edwards -84345.7286 -126663.4225 -42028.0347 True ClearCr Gilbert -19710.9222 -63241.5922 23819.7477 False ClearCr IDOTRR -112441.6448 -162017.7979 -62865.4917 True ClearCr MeadowV -113988.958 -174844.3135 -53133.6024 True ClearCr Mitchel -56295.3061 -103183.5892 -9407.023 True ClearCr NAmes -66718.3486 -106381.3896 -27055.3075 True ClearCr NPkVill -69870.9841 -145710.6857 5968.7174 False ClearCr NWAmes -23515.3601 -67511.6712 20480.9511 False ClearCr NoRidge 122729.8885 74206.6672 171253.1098 True ClearCr NridgHt 103705.1948 60026.8377 147383.5519 True ClearCr OldTown -84340.1277 -126121.9466 -42558.3088 True ClearCr SWISU -69974.0686 -124434.9842 -15513.153 True ClearCr Sawyer -75772.2934 -119686.1151 -31858.4718 True ClearCr SawyerW -26009.632 -71429.9978 19410.7339 False ClearCr Somerst 12814.4086 -30250.1706 55878.9878 False ClearCr StoneBr 97933.5714 43472.6558 152394.487 True ClearCr Timber 29682.0188 -19612.335 78976.3725 False ClearCr Veenker 26207.2987 -44221.9359 96636.5333 False CollgCr Crawfor 12658.9522 -19423.1882 44741.0925 False CollgCr Edwards -69746.0733 -95297.8086 -44194.3381 True CollgCr Gilbert -5111.267 -32625.3213 22402.7873 False CollgCr IDOTRR -97841.9895 -134172.4026 -61511.5765 True CollgCr MeadowV -99389.3027 -150039.8531 -48738.7524 True CollgCr Mitchel -41695.6509 -74262.7362 -9128.5655 True CollgCr NAmes -52118.6933 -72981.5978 -31255.7889 True CollgCr NPkVill -55271.3289 -123196.0246 12653.3668 False CollgCr NWAmes -8915.7048 -37160.6929 19329.2832 False CollgCr NoRidge 137329.5437 102449.6503 172209.4372 True CollgCr NridgHt 118304.85 90557.727 146051.9731 True CollgCr OldTown -69740.4724 -94394.5664 -45086.3785 True CollgCr SWISU -55374.4133 -98130.6443 -12618.1824 True CollgCr Sawyer -61172.6382 -89288.9625 -33056.3139 True CollgCr SawyerW -11409.9767 -41825.6568 19005.7033 False CollgCr Somerst 27414.0639 643.5215 54184.6062 True CollgCr StoneBr 112533.2267 69776.9957 155289.4576 True CollgCr Timber 44281.674 8336.7541 80226.594 True CollgCr Veenker 40806.9539 -21018.4538 102632.3617 False Crawfor Edwards -82405.0255 -116461.4781 -48348.5729 True Crawfor Gilbert -17770.2192 -53322.6308 17782.1925 False Crawfor IDOTRR -110500.9417 -153242.6041 -67759.2793 True Crawfor MeadowV -112048.2549 -167477.7511 -56618.7587 True Crawfor Mitchel -54354.603 -93947.1003 -14762.1058 True Crawfor NAmes -64777.6455 -95473.1105 -34082.1804 True Crawfor NPkVill -67930.281 -139489.4529 3628.8908 False Crawfor NWAmes -21574.657 -57695.7055 14546.3915 False Crawfor NoRidge 124670.5916 83154.8384 166186.3447 True Crawfor NridgHt 105645.8979 69912.8092 141378.9866 True Crawfor OldTown -82399.4246 -115787.6732 -49011.1761 True Crawfor SWISU -68033.3655 -116355.68 -19711.051 True Crawfor Sawyer -73831.5904 -109852.119 -37811.0617 True Crawfor SawyerW -24068.9289 -61911.5555 13773.6977 False Crawfor Somerst 14755.1117 -20225.0645 49735.288 False Crawfor StoneBr 99874.2745 51551.96 148196.589 True Crawfor Timber 31622.7219 -10791.7575 74037.2013 False Crawfor Veenker 28148.0018 -37649.6565 93945.6601 False Edwards Gilbert 64634.8063 34842.166 94427.4466 True Edwards IDOTRR -28095.9162 -66181.0465 9989.214 False Edwards MeadowV -29643.2294 -81566.7933 22280.3345 False Edwards Mitchel 28050.4224 -6463.2456 62564.0905 False Edwards NAmes 17627.38 -6159.9909 41414.7509 False Edwards NPkVill 14474.7444 -54404.4434 83353.9323 False Edwards NWAmes 60830.3685 30361.4075 91299.3295 True Edwards NoRidge 207075.6171 170371.5955 243779.6387 True Edwards NridgHt 188050.9234 158042.9066 218058.9402 True Edwards OldTown 5.6009 -27167.9632 27179.1649 False Edwards SWISU 14371.66 -29885.2436 58628.5636 False Edwards Sawyer 8573.4351 -21776.2918 38923.1621 False Edwards SawyerW 58336.0966 25844.685 90827.5082 True Edwards Somerst 97160.1372 68052.7469 126267.5276 True Edwards StoneBr 182279.3 138022.3964 226536.2036 True Edwards Timber 114027.7474 76310.1719 151745.3229 True Edwards Veenker 110553.0273 47680.4635 173425.5911 True Gilbert IDOTRR -92730.7225 -132159.2548 -53302.1903 True Gilbert MeadowV -94278.0357 -147194.8431 -41361.2284 True Gilbert Mitchel -36584.3839 -72575.0117 -593.756 True Gilbert NAmes -47007.4263 -72891.2249 -21123.6278 True Gilbert NPkVill -50160.0619 -119791.0502 19470.9264 False Gilbert NWAmes -3804.4378 -35936.814 28327.9383 False Gilbert NoRidge 142440.8107 104344.6533 180536.9682 True Gilbert NridgHt 123416.117 91720.4851 155111.7489 True Gilbert OldTown -64629.2054 -93655.6519 -35602.759 True Gilbert SWISU -50263.1463 -95681.2653 -4845.0274 True Gilbert Sawyer -56061.3712 -88080.7081 -24042.0343 True Gilbert SawyerW -6298.7097 -40354.8962 27757.4768 False Gilbert Somerst 32525.3309 1681.0092 63369.6526 True Gilbert StoneBr 117644.4937 72226.3747 163062.6126 True Gilbert Timber 49392.941 10319.3245 88466.5576 True Gilbert Veenker 45918.2209 -17777.0794 109613.5213 False IDOTRR MeadowV -1547.3132 -59539.2456 56444.6192 False IDOTRR Mitchel 56146.3387 13039.4828 99253.1945 True IDOTRR NAmes 45723.2962 10611.3787 80835.2138 True IDOTRR NPkVill 42570.6607 -30991.2198 116132.5411 False IDOTRR NWAmes 88926.2847 48984.2602 128868.3092 True IDOTRR NoRidge 235171.5333 190291.7724 280051.2942 True IDOTRR NridgHt 216146.8396 176555.3151 255738.3641 True IDOTRR OldTown 28101.5171 -9387.2855 65590.3197 False IDOTRR SWISU 42467.5762 -8773.8256 93708.978 False IDOTRR Sawyer 36669.3514 -3181.7925 76520.4952 False IDOTRR SawyerW 86432.0128 44926.5967 127937.4289 True IDOTRR Somerst 125256.0534 86342.715 164169.3919 True IDOTRR StoneBr 210375.2162 159133.8144 261616.618 True IDOTRR Timber 142123.6636 96411.2666 187836.0606 True IDOTRR Veenker 138648.9435 70678.6042 206619.2828 True MeadowV Mitchel 57693.6519 1982.0699 113405.2338 True MeadowV NAmes 47270.6094 -2513.1841 97054.4029 False MeadowV NPkVill 44117.9739 -37472.0354 125707.9831 False MeadowV NWAmes 90473.5979 37173.0851 143774.1107 True MeadowV NoRidge 236718.8465 179624.431 293813.262 True MeadowV NridgHt 217694.1528 164655.7879 270732.5177 True MeadowV OldTown 29648.8303 -21838.932 81136.5926 False MeadowV SWISU 44014.8894 -18204.5657 106234.3446 False MeadowV Sawyer 38216.6645 -15015.7786 91449.1077 False MeadowV SawyerW 87979.326 33497.4124 142461.2396 True MeadowV Somerst 126803.3666 74269.3086 179337.4247 True MeadowV StoneBr 211922.5294 149703.0743 274141.9846 True MeadowV Timber 143670.9768 85919.7639 201422.1896 True MeadowV Veenker 140196.2567 63609.4017 216783.1117 True Mitchel NAmes -10423.0424 -41625.0118 20778.9269 False Mitchel NPkVill -13575.678 -85353.5743 58202.2183 False Mitchel NWAmes 32779.946 -3772.502 69332.3941 False Mitchel NoRidge 179025.1946 137133.5597 220916.8295 True Mitchel NridgHt 160000.5009 123831.385 196169.6169 True Mitchel OldTown -28044.8216 -61899.311 5809.6679 False Mitchel SWISU -13678.7624 -62324.3932 34966.8683 False Mitchel Sawyer -19476.9873 -55930.1051 16976.1305 False Mitchel SawyerW 30285.6742 -7968.9426 68540.2909 False Mitchel Somerst 69109.7148 33684.243 104535.1865 True Mitchel StoneBr 154228.8776 105583.2468 202874.5083 True Mitchel Timber 85977.3249 43194.8591 128759.7907 True Mitchel Veenker 82502.6048 16467.1359 148538.0738 True NAmes NPkVill -3152.6356 -70433.4807 64128.2096 False NAmes NWAmes 43202.9885 16543.5212 69862.4557 True NAmes NoRidge 189448.2371 155839.3869 223057.0872 True NAmes NridgHt 170423.5434 144292.1316 196554.9551 True NAmes OldTown -17621.7791 -40442.2128 5198.6545 False NAmes SWISU -3255.72 -44981.5289 38470.0889 False NAmes Sawyer -9053.9449 -35577.0581 17469.1683 False NAmes SawyerW 40708.7166 11759.4258 69658.0074 True NAmes Somerst 79532.7572 54440.7309 104624.7835 True NAmes StoneBr 164651.92 122926.1111 206377.7289 True NAmes Timber 96400.3674 61687.4719 131113.2628 True NAmes Veenker 92925.6473 31808.3102 154042.9843 True NPkVill NWAmes 46355.624 -23567.4101 116278.6582 False NPkVill NoRidge 192600.8726 119744.45 265457.2952 True NPkVill NridgHt 173576.1789 103852.7669 243299.591 True NPkVill OldTown -14469.1436 -83020.4068 54082.1197 False NPkVill SWISU -103.0844 -77041.6744 76835.5056 False NPkVill Sawyer -5901.3093 -75772.4696 63969.851 False NPkVill SawyerW 43861.3522 -26966.3609 114689.0653 False NPkVill Somerst 82685.3928 13344.8326 152025.9529 True NPkVill StoneBr 167804.5556 90865.9656 244743.1456 True NPkVill Timber 99553.0029 26180.7424 172925.2635 True NPkVill Veenker 96078.2828 7118.5594 185038.0063 True NWAmes NoRidge 146245.2486 107617.8829 184872.6142 True NWAmes NridgHt 127220.5549 94888.3844 159552.7254 True NWAmes OldTown -60824.7676 -90544.9756 -31104.5597 True NWAmes SWISU -46458.7085 -92323.3103 -594.1067 True NWAmes Sawyer -52256.9334 -84906.4985 -19607.3682 True NWAmes SawyerW -2494.2719 -37143.6587 32155.1149 False NWAmes Somerst 36329.7687 4831.6997 67827.8377 True NWAmes StoneBr 121448.9315 75584.3297 167313.5333 True NWAmes Timber 53197.3789 13605.6666 92789.0911 True NWAmes Veenker 49722.6588 -14291.7729 113737.0904 False NoRidge NridgHt -19024.6937 -57289.5191 19240.1317 False NoRidge OldTown -207070.0162 -243154.8936 -170985.1388 True NoRidge SWISU -192703.9571 -242927.3511 -142480.563 True NoRidge Sawyer -198502.1819 -237035.5664 -159968.7975 True NoRidge SawyerW -148739.5205 -188981.3845 -108497.6564 True NoRidge Somerst -109915.4799 -147478.1737 -72352.7861 True NoRidge StoneBr -24796.3171 -75019.7111 25427.077 False NoRidge Timber -93047.8697 -137616.1465 -48479.5929 True NoRidge Veenker -96522.5898 -163728.8029 -29316.3767 True NridgHt OldTown -188045.3225 -217292.7882 -158797.8568 True NridgHt SWISU -173679.2634 -219238.9515 -128119.5752 True NridgHt Sawyer -179477.4882 -211697.3205 -147257.656 True NridgHt SawyerW -129714.8268 -163959.5854 -95470.0681 True NridgHt Somerst -90890.7862 -121943.1909 -59838.3815 True NridgHt StoneBr -5771.6234 -51331.3115 39788.0648 False NridgHt Timber -74023.176 -113261.2591 -34785.0929 True NridgHt Veenker -77497.8961 -141294.22 -13701.5722 True OldTown SWISU 14366.0591 -29378.7314 58110.8496 False OldTown Sawyer 8567.8343 -21030.1235 38165.792 False OldTown SawyerW 58330.4957 26540.1669 90120.8245 True OldTown Somerst 97154.5363 68831.8714 125477.2013 True OldTown StoneBr 182273.6991 138528.9086 226018.4896 True OldTown Timber 114022.1465 76906.8036 151137.4894 True OldTown Veenker 110547.4264 48034.2881 173060.5647 True SWISU Sawyer -5798.2249 -51583.7033 39987.2536 False SWISU SawyerW 43964.4366 -3267.9245 91196.7978 False SWISU Somerst 82788.4772 37816.883 127760.0714 True SWISU StoneBr 167907.64 111926.593 223888.687 True SWISU Timber 99656.0874 48687.2772 150624.8976 True SWISU Veenker 96181.3673 24570.1713 167792.5633 True Sawyer SawyerW 49762.6615 15218.0766 84307.2464 True Sawyer Somerst 88586.7021 57203.957 119969.4472 True Sawyer StoneBr 173705.8649 127920.3864 219491.3433 True Sawyer Timber 105454.3122 65954.2867 144954.3378 True Sawyer Veenker 101979.5921 38021.8264 165937.3579 True SawyerW Somerst 38824.0406 5365.6695 72282.4117 True SawyerW StoneBr 123943.2034 76710.8422 171175.5645 True SawyerW Timber 55691.6508 14523.2415 96860.06 True SawyerW Veenker 52216.9307 -12784.467 117218.3284 False Somerst StoneBr 85119.1628 40147.5686 130090.757 True Somerst Timber 16867.6102 -21686.0702 55421.2905 False Somerst Veenker 13392.8901 -49984.7878 76770.5679 False StoneBr Timber -68251.5526 -119220.3628 -17282.7424 True StoneBr Veenker -71726.2727 -143337.4687 -115.0767 True Timber Veenker -3474.7201 -71239.795 64290.3548 False -------------------------------------------------------------

To visualise this we can use the pandas boxplot function although we probably have to tidy up the indices on the neighborhood (x) axis:

data_sub.boxplot(by='Neighborhood')]]>

While trying to buy a house in Dublin I realised I had no way of knowing if I was paying a fair price for a house, if I was getting it for a great price, or if I was over-paying for the house. The data scientist in me would like to develop an algorithm, a hypothesis, a research question, so that my decisions are based on sound science and not on gut instinct. So for the last couple of weeks I have been developing algorithms to determine this fair price value. So my research questions is:

**Is house sales price associated with socio-economic location?**

I stumbled upon similar research by Dean DeCock from 2009 in his research determining the house price for Ames Iowa. So that is the data set that I will use. See the Kaggle page House Prices Advanced Regression Techniques to get the data.

I would like to study the association between the neighborhood (location) and the house price, to determine does location influence the sale price and is the difference in means between different locations significant.

This dataset has 79 independent variables with sale price being the dependent variable. Initially I am only focusing on one independent variable – the neighborhood, so I can reduce the dataset variables down to two, to simplify the computation my analysis of variance needs to perform.

Now that I have determined I am going to study location, I decide that I might further want to look at the bands of house size, not just the house size (square footage), but if I can turn those into categories of square footage, less than 1000, between 1000 and 1250 square feet, 1250 to 1500, > 1500 to see if there is a variance in the mean among these categories.

I can now take the above ground living space variable (square footage) and add it to my codebook. I will also add any other variables related to square footage for first floor, second floor, basement etc…

I then search google scholar, kaggle, dbs library for previous study in these areas, finding: a paper from 2001 discussing previous research in Dublin, however it was done in 2001 when a bubble was about to begin, and a big property crash in 2008 that was not conceived. http://www.sciencedirect.com/science/article/pii/S0264999300000407

Secondly Dean De Cock’s research on house prices in Iowa http://ww2.amstat.org/publications/jse/v19n3/decock.pdf

Based on my literature review I believe that there might be a statistically significant association between house location (neighborhood) and sales price. Secondary I believe there will be a statistically significant association between size bands (square footage band) and sales price. I further believe that might be an interaction effect between location & square footage bands and sales price which I would like to investigate too.

So I have developed three null hypotheses:

* There is **NO** association between location and sales price

* There is **NO** association between bands of square footage and sales price

* There is **NO** interaction effect in association between location, bands of square footage and sales price.

A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step:

Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.

import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV import os data = pd.read_csv("iowa_house_data.csv") #upper-case all DataFrame column names data.columns = map(str.upper, data.columns) print(data.columns) data_clean = data #select predictor variables and target variable as separate data sets predvar= data_clean[['GRLIVAREA', 'LOTAREA', 'YEARBUILT', 'FIREPLACES', 'OVERALLQUAL', 'OVERALLCOND', 'TOTRMSABVGRD', 'YEARREMODADD', '1STFLRSF', '2NDFLRSF', 'YRSOLD', 'BSMTFINSF1', 'BSMTFINSF2', 'BSMTUNFSF', 'TOTALBSMTSF', 'MSSUBCLASS', 'MISCVAL', 'MOSOLD', 'GARAGECARS', 'GARAGEAREA', 'WOODDECKSF', 'OPENPORCHSF', 'ENCLOSEDPORCH', '3SSNPORCH', 'SCREENPORCH', 'POOLAREA', 'LOWQUALFINSF', 'BSMTFULLBATH', 'BSMTHALFBATH', 'FULLBATH', 'HALFBATH', 'BEDROOMABVGR', 'KITCHENABVGR']] target = data_clean.SALEPRICE # standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing print predvar for k in predvar.columns: print k predictors[k]=preprocessing.scale(predictors[k].astype('float64')) # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123) # specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train) # print variable names and regression coefficients var_imp = pd.DataFrame(data = {'predictors':list(predictors.columns.values),'coefficients':model.coef_}) var_imp['sort'] = var_imp.coefficients.abs() print(var_imp.sort_values(by='sort', ascending=False)) # plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')

coefficients predictors sort 4 0.36 OVERALLQUAL 0.36 0 0.26 GRLIVAREA 0.26 18 0.12 GARAGECARS 0.12 11 0.07 BSMTFINSF1 0.07 2 0.07 YEARBUILT 0.07 7 0.05 YEARREMODADD 0.05 8 0.05 1STFLRSF 0.05 15 -0.04 MSSUBCLASS 0.04 3 0.04 FIREPLACES 0.04 14 0.04 TOTALBSMTSF 0.04 20 0.02 WOODDECKSF 0.02 27 0.01 BSMTFULLBATH 0.01 1 0.01 LOTAREA 0.01 24 0.00 SCREENPORCH 0.00 25 0.00 POOLAREA 0.00 26 0.00 LOWQUALFINSF 0.00 31 0.00 BEDROOMABVGR 0.00 22 0.00 ENCLOSEDPORCH 0.00 28 0.00 BSMTHALFBATH 0.00 29 0.00 FULLBATH 0.00 30 0.00 HALFBATH 0.00 23 0.00 3SSNPORCH 0.00 16 0.00 MISCVAL 0.00 21 0.00 OPENPORCHSF 0.00 19 0.00 GARAGEAREA 0.00 17 0.00 MOSOLD 0.00 13 0.00 BSMTUNFSF 0.00 12 0.00 BSMTFINSF2 0.00 10 0.00 YRSOLD 0.00 9 0.00 2NDFLRSF 0.00 6 0.00 TOTRMSABVGRD 0.00 5 0.00 OVERALLCOND 0.00 32 0.00 KITCHENABVGR 0.00

training data R-square 0.777169556607 test data R-square 0.81016173881]]>

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn from sklearn import preprocessing # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x) #call in data set data = pandas.read_csv('homes_train.csv') print (data['SalePrice'].describe())

count 1460.00 mean 180921.20 std 79442.50 min 34900.00 25% 129975.00 50% 163000.00 75% 214000.00 max 755000.00 Name: SalePrice, dtype: float64

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.

So we can center the variables as follows:

data['GrLivArea'] = preprocessing.scale(data['GrLivArea'], with_mean='True', with_std='False') data['SalePrice'] = preprocessing.scale(data['SalePrice'], with_mean='True', with_std='False') print(data['GrLivArea'].mean()) print(data['SalePrice'].mean()) # convert variables to numeric format using convert_objects function data['GrLivArea'] = pandas.to_numeric(data['GrLivArea'], errors='coerce') data['SalePrice'] = pandas.to_numeric(data['SalePrice'], errors='coerce') # view the centering data['SalePrice'].diff().hist() # BASIC LINEAR REGRESSION scat1 = seaborn.regplot(x="SalePrice", y="GrLivArea", scatter=True, data=data) plt.xlabel('Sale Price') plt.ylabel('Ground Living Area') plt.title ('Scatterplot for the Association Between Sale Price and Ground Living Area') print(scat1)

print ("OLS regression model for the association between sale price and ground living area") reg1 = smf.ols('SalePrice ~ GrLivArea', data=data).fit() print (reg1.summary())

OLS regression model for the association between sale price and ground living area OLS Regression Results ======================================================================== Dep. Variable: SalePrice R-squared: 0.502 Model: OLS Adj. R-squared: 0.502 Method: Least Squares F-statistic: 1471. Date: Mon, 03 Oct 2016 Prob (F-statistic): 4.52e-223 Time: 00:13:00 Log-Likelihood: -18035. No. Observations: 1460 AIC: 3.607e+04 Df Residuals: 1458 BIC: 3.608e+04 Df Model: 1 Covariance Type: nonrobust ======================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------ Intercept 1.857e+04 4480.755 4.144 0.000 9779.612 2.74e+04 GrLivArea 107.1304 2.794 38.348 0.000 101.650 112.610 ======================================================================== Omnibus: 261.166 Durbin-Watson: 2.025 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3432.287 Skew: 0.410 Prob(JB): 0.00 Kurtosis: 10.467 Cond. No. 4.90e+03 ======================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.9e+03. This might indicate that there are strong multicollinearity or other numerical problems.

Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:

house = read.csv('train.csv') house_model = lm(house$SalePrice ~ house$GrLivArea, house) summary(house_model) plot(house$GrLivArea, house$SalePrice) hist(house$SalePrice) shapiro.test(house$SalePrice) ## Plot using a qqplot qqnorm(house$SalePrice) qqline(house$SalePrice, col = 2)

Call: lm(formula = house$SalePrice ~ house$GrLivArea, data = house) Residuals: Min 1Q Median 3Q Max -462999 -29800 -1124 21957 339832 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18569.026 4480.755 4.144 3.61e-05 *** house$GrLivArea 107.130 2.794 38.348 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 56070 on 1458 degrees of freedom Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018 F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16

To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

]]>On a bulletin board yesterday a Mayo man posed the following questions. Calculate the probabilities of:

- Mayo winning the All Ireland within the next 65 years
- Dublin getting three in a row

He will be delighted to know that the probability of Mayo winning an All Ireland in the next 65 years is almost 100% that they will, no matter what way the data is sliced.

They have won 3 / 131 so approximately 1 in 44.

They have won 3 / 15 finals they have appeared in so 1 in 5, (.2), and they have now been in 8 in a row without winning one.

They have been in 5 out of the last 15 finals = one in 3 = (.33)

Which led me onto the Dublin question:

As of today the Dubs getting 3 in a row without putting thought into it should be -> 1 in 33.

The 31 counties taking part (Kilkenny doesn’t and the shouldn’t be allowed hurl if they don’t play football) plus London and New York.

However Dublin only play in Leinster and winning that gets them to the quarter-final – so if they win Leinster then that is 1 in 8.

But they are not guaranteed to win Leinster – they have only won 9 out of the last 10 – so 90% chance of getting to the last 8 ->

So 9/10 * 1/8 = 9/80 = 0.1125

But this seems a bit to low to price Dublin to win next year.

From another view Dublin have won four of the last six = 4/6 = 2/3

But I s’pose this last algorithm is lacking any nerves of doing a threepeat – it is 93 years since Dublin did it. Kerry are the only team to have done it in the last 50 years, and they only did it twice in that time, and it has not been done in the last 30 years – only 2 teams in the last 30 years have been in a position to do it and both failed, and this included Kerry getting to 6 finals in a row, winning 4 in 6 and still failing to win 3 in a row.

And now what odds would I want to place a bet in a bookmakers – probably 1 in 4 sounds right – if they can beat any two out of Kerry, Mayo, and the Ulster champions that would win it for them.

]]>Week 2’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford Connecticut Area in conjunction with Coursera was to build a random forest to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I continued using Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width.

Using Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier

Now load our Iris dataset of 150 rows of 5 variables:

#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris()

Now we begin our modelling and prediction. We define our predictors and target as follows:

predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the random forest classifier class to do this.

classifier = RandomForestClassifier(n_estimators=25) classifier = classifier.fit(pred_train,tar_train)

Finally we make our predictions on our test data set and verify the accuracy.

predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

Out[1]: 0.94999999999999996

Next we figure out the relative importance of each of the attributes:

# fit an Extra Trees model to the data

model = ExtraTreesClassifier() model.fit(pred_train,tar_train) print(model.feature_importances_)

[ 0.09603246 0.06664688 0.40937484 0.42794582]

Finally displaying the performance of the random forest was achieved with the following:

trees=range(25) accuracy=np.zeros(25) for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla() plt.plot(trees, accuracy)

And the plot success was output:

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary or categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating the type of Iris based on petal width, petal length, sepal width, sepal length.

The explanatory variables with the highest relative importance scores were petal width (42.8%), petal length (40.9%), sepal length (9.6%), and finally sepal width (6.7%). The accuracy of the random forest was 95%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.

]]>For Wesleyan’s Regression Modeling in Practice week 1 assignment I am required to write up the sample, the procedure, and the measures section of a classical research paper. I’ve been trying to decide recently whether to move house or not, stay in the current house, sell the current house, move to another house, stay in the same area, move areas. So many decisions, so much choice so I want to do some regression modeling to help me with this decision. From kaggle.com I found an interesting problem and decided to write this up as my research data set for this assignment – House Prices: Advanced Regression Techniques.

The sample is taken from the Ames Assessor’s Office computing assessed value for individual residential properties sold in Ames, Iowa from 2006 to 2010. Participants (N=2930) represented individual residential property sales in the Ames area.

The data analytic sample for this study included participants who had sold an individual residential property. Also if a home was sold multiple times in the 5 year period only the most recent property sale was included. (N=1,320).

Data were collected by trained Ames Assessor’s Office Representatives during 2006–2010 through computer-assisted personal interviews (CAPI). At the selling time of the house one party involved in the sale of the property would be contacted and the required variables were submitted by way of questions in an interview in respondents’ homes following informed consent procedures.

The house sale price was assessed using 79 variables based on the type of dwelling involved in the sale (16 different types of dwellings were found). The zoning of the house with its 8 types of zones. 20 continuous variables relate to various area dimensions for each observation. In addition to the typical lot size and total dwelling square footage found on most common home listings, other more specific variables are quantified in the data set. Area measurements on the basement, main living area, and even porches are broken down into individual categories based on quality and type. 14 discrete variables typically quantify the number of items occurring within the house. Most are specifically focused on the number of kitchens, bedrooms, and bathrooms (full and half) located in the basement and above grade (ground) living areas of the home. Additionally, the garage capacity and construction/remodeling dates are also recorded. There are a large number of categorical variables (23 nominal, 23 ordinal) associated with this data set. They range from 2 to 28 classes with the smallest being STREET (gravel or paved) and

the largest being NEIGHBORHOOD (areas within the Ames city limits). The nominal variables typically identify various types of dwellings, garages, materials, and environmental conditions while the ordinal variables typically rate various items within the property.

**Dependant Variable: **Sale Price – the price the house sold for.

**Independant Variable: **

MSSubClass: Identifies the type of dwelling involved in the sale. 20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 40 1-STORY W/FINISHED ATTIC ALL AGES 45 1-1/2 STORY - UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 60 2-STORY 1946 & NEWER 70 2-STORY 1945 & OLDER 75 2-1/2 STORY ALL AGES 80 SPLIT OR MULTI-LEVEL 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES MSZoning: Identifies the general zoning classification of the sale. A Agriculture C Commercial FV Floating Village Residential I Industrial RH Residential High Density RL Residential Low Density RP Residential Low Density Park RM Residential Medium Density LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access to property Grvl Gravel Pave Paved Alley: Type of alley access to property Grvl Gravel Pave Paved NA No alley access LotShape: General shape of property Reg Regular IR1 Slightly irregular IR2 Moderately Irregular IR3 Irregular LandContour: Flatness of the property Lvl Near Flat/Level Bnk Banked - Quick and significant rise from street grade to building HLS Hillside - Significant slope from side to side Low Depression Utilities: Type of utilities available AllPub All public Utilities (E,G,W,& S) NoSewr Electricity, Gas, and Water (Septic Tank) NoSeWa Electricity and Gas Only ELO Electricity only LotConfig: Lot configuration Inside Inside lot Corner Corner lot CulDSac Cul-de-sac FR2 Frontage on 2 sides of property FR3 Frontage on 3 sides of property LandSlope: Slope of property Gtl Gentle slope Mod Moderate Slope Sev Severe Slope Neighborhood: Physical locations within Ames city limits Blmngtn Bloomington Heights Blueste Bluestem BrDale Briardale BrkSide Brookside ClearCr Clear Creek CollgCr College Creek Crawfor Crawford Edwards Edwards Gilbert Gilbert IDOTRR Iowa DOT and Rail Road MeadowV Meadow Village Mitchel Mitchell Names North Ames NoRidge Northridge NPkVill Northpark Villa NridgHt Northridge Heights NWAmes Northwest Ames OldTown Old Town SWISU South & West of Iowa State University Sawyer Sawyer SawyerW Sawyer West Somerst Somerset StoneBr Stone Brook Timber Timberland Veenker Veenker Condition1: Proximity to various conditions Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal RRNn Within 200' of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature--park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200' of East-West Railroad RRAe Adjacent to East-West Railroad Condition2: Proximity to various conditions (if more than one is present) Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal RRNn Within 200' of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature--park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200' of East-West Railroad RRAe Adjacent to East-West Railroad BldgType: Type of dwelling 1Fam Single-family Detached 2FmCon Two-family Conversion; originally built as one-family dwelling Duplx Duplex TwnhsE Townhouse End Unit TwnhsI Townhouse Inside Unit HouseStyle: Style of dwelling 1Story One story 1.5Fin One and one-half story: 2nd level finished 1.5Unf One and one-half story: 2nd level unfinished 2Story Two story 2.5Fin Two and one-half story: 2nd level finished 2.5Unf Two and one-half story: 2nd level unfinished SFoyer Split Foyer SLvl Split Level OverallQual: Rates the overall material and finish of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor OverallCond: Rates the overall condition of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor YearBuilt: Original construction date YearRemodAdd: Remodel date (same as construction date if no remodeling or additions) RoofStyle: Type of roof Flat Flat Gable Gable Gambrel Gabrel (Barn) Hip Hip Mansard Mansard Shed Shed RoofMatl: Roof material ClyTile Clay or Tile CompShg Standard (Composite) Shingle Membran Membrane Metal Metal Roll Roll Tar&Grv Gravel & Tar WdShake Wood Shakes WdShngl Wood Shingles Exterior1st: Exterior covering on house AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles Exterior2nd: Exterior covering on house (if more than one material) AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles MasVnrType: Masonry veneer type BrkCmn Brick Common BrkFace Brick Face CBlock Cinder Block None None Stone Stone MasVnrArea: Masonry veneer area in square feet ExterQual: Evaluates the quality of the material on the exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor ExterCond: Evaluates the present condition of the material on the exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Foundation: Type of foundation BrkTil Brick & Tile CBlock Cinder Block PConc Poured Contrete Slab Slab Stone Stone Wood Wood BsmtQual: Evaluates the height of the basement Ex Excellent (100+ inches) Gd Good (90-99 inches) TA Typical (80-89 inches) Fa Fair (70-79 inches) Po Poor (<70 inches NA No Basement BsmtCond: Evaluates the general condition of the basement Ex Excellent Gd Good TA Typical - slight dampness allowed Fa Fair - dampness or some cracking or settling Po Poor - Severe cracking, settling, or wetness NA No Basement BsmtExposure: Refers to walkout or garden level walls Gd Good Exposure Av Average Exposure (split levels or foyers typically score average or above) Mn Mimimum Exposure No No Exposure NA No Basement BsmtFinType1: Rating of basement finished area GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Rating of basement finished area (if multiple types) GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating Floor Floor Furnace GasA Gas forced warm air furnace GasW Gas hot water or steam heat Grav Gravity furnace OthW Hot water or steam heat other than gas Wall Wall furnace HeatingQC: Heating quality and condition Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor CentralAir: Central air conditioning N No Y Yes Electrical: Electrical system SBrkr Standard Circuit Breakers & Romex FuseA Fuse Box over 60 AMP and all Romex wiring (Average) FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair) FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor) Mix Mixed 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Bedrooms above grade (does NOT include basement bedrooms) Kitchen: Kitchens above grade KitchenQual: Kitchen quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality (Assume typical unless deductions are warranted) Typ Typical Functionality Min1 Minor Deductions 1 Min2 Minor Deductions 2 Mod Moderate Deductions Maj1 Major Deductions 1 Maj2 Major Deductions 2 Sev Severely Damaged Sal Salvage only Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality Ex Excellent - Exceptional Masonry Fireplace Gd Good - Masonry Fireplace in main level TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement Fa Fair - Prefabricated Fireplace in basement Po Poor - Ben Franklin Stove NA No Fireplace GarageType: Garage location 2Types More than one type of garage Attchd Attached to home Basment Basement Garage BuiltIn Built-In (Garage part of house - typically has room above garage) CarPort Car Port Detchd Detached from home NA No Garage GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage Fin Finished RFn Rough Finished Unf Unfinished NA No Garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage GarageCond: Garage condition Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage PavedDrive: Paved driveway Y Paved P Partial Pavement N Dirt/Gravel WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Ex Excellent Gd Good TA Average/Typical Fa Fair NA No Pool Fence: Fence quality GdPrv Good Privacy MnPrv Minimum Privacy GdWo Good Wood MnWw Minimum Wood/Wire NA No Fence MiscFeature: Miscellaneous feature not covered in other categories Elev Elevator Gar2 2nd Garage (if not described in garage section) Othr Other Shed Shed (over 100 SF) TenC Tennis Court NA None MiscVal: $Value of miscellaneous feature MoSold: Month Sold (MM) YrSold: Year Sold (YYYY) SaleType: Type of sale WD Warranty Deed - Conventional CWD Warranty Deed - Cash VWD Warranty Deed - VA Loan New Home just constructed and sold COD Court Officer Deed/Estate Con Contract 15% Down payment regular terms ConLw Contract Low Down payment and low interest ConLI Contract Low Interest ConLD Contract Low Down Oth Other SaleCondition: Condition of sale Normal Normal Sale Abnorml Abnormal Sale - trade, foreclosure, short sale AdjLand Adjoining Land Purchase Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit Family Sale between family members Partial Home was not completed when last assessed (associated with New Homes)

Kaggle’s House Prices: Advanced Regression Techniques

Ames Assessor’s Original Publication

Data Documentation

First assignment done for the University of Washington’s Machine Learning Foundations course in regression analysis. There were 9 questions to answer having done the slides and practicals for week 1. An interesting way to pass this assignment – one has until the 9th October to get above 80% – so 8 out of 9 required. One can do the assignment at most 3 times in every 8 hour period. Anyway I got the following 9 questions correct on the first attempt

Passed 9/9 points earned (100%) Quiz passed!

Q1. Which figure represents an overfitted model?

Q2. True or false: The model that best minimizes training error is the one that will perform best for the task of prediction on new data.

Q3. The following table illustrates the results of evaluating 4 models with different parameter choices on some data set. Which of the following models fits this data the best?

Model index | Parameters (intercept, slope) | Residual sum of squares (RSS) |

1 | (0,1.4) | 20.51 |

2 | (3.1,1.4) | 15.23 |

3 | (2.7, 1.9) | 13.67 |

4 | (0, 2.3) | 18.99 |

Q4. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? *(Note: you must select all parameters estimated as 0 to get the question correct.)*

Q5. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? *(Note: you must select all parameters estimated as 0 to get the question correct.)*

Q6. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? *(Note: you must select all parameters estimated as 0 to get the question correct.)*

Q7. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? *(Note: you must select all parameters estimated as 0 to get the question correct.)*

Q8. Would you ** not** expect to see this polot as a plot of training and test error curves?

Q9. True or false: One always prefers to use a model with more features since it better captures the true underlying process.

]]>Why SFrame & GraphLab Create

There are many excellent machine learning libraries in Python. One of the most popular one today is scikit-learn. Similarly, there are many tools for data manipulations in Python; a popular example is Pandas. However, most of these tools do not scale to large datasets.

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free. It can be installed with:

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free.

pip install -U sframe

GraphLab Create is free on a 1-year, renewable license for educational purposes, including Coursera. This software, however, has a paid license for commercial purposes. You can get the GraphLab Create academic license at the following link:

https://dato.com/learn/coursera/

I was able to signup with my dbs lecturer email address and get a valid license key and then download the product and install. It will work in conjunction with Anaconda and Jupyter Notebooks.

GraphLab Create is very actively used in industry by a large number of companies. This package was created by a machine learning company called Dato. This company is spin off from a popular research project called GraphLab, which Carlos Guestrin and his research group started at Carnegie Mellon University. In addition to being a professor at the University of Washington, Carlos is the CEO of Dato.

]]>Week 1’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford, Connecticut in conjunction with Coursera was to build a decision tree to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I decided to choose Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width. I also decided to do the assignment in Python as I have been programming in it for over 10 years.

Pandas, sklearn, numpy, and spyder were also used, with Anaconda being instrumental in setting everything up.

conda update condo conda update anaconda conda install seaborn conda update qt pyqt conda install spyder pip install Graphviz pip install pydotplus brew install graphviz

Started up Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

Now load our Iris dataset of 150 rows of 5 variables:

#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris() # there should be no na - for performance probably don't have to do this iris = iris.dropna() iris.dtypes iris.describe() print("head", iris.head(), sep="\n", end="\n\n") print("tail", iris.tail(), sep="\n", end="\n\n") print("types", iris["Name"].unique(), sep="\n")

Leading to the output:

head SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa tail SepalLength SepalWidth PetalLength PetalWidth Name 145 6.7 3.0 5.2 2.3 Iris-virginica 146 6.3 2.5 5.0 1.9 Iris-virginica 147 6.5 3.0 5.2 2.0 Iris-virginica 148 6.2 3.4 5.4 2.3 Iris-virginica 149 5.9 3.0 5.1 1.8 Iris-virginica types ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

Now we begin our modelling and prediction. We define our predictors and target as follows:

predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the decision tree classifier class to do this.

classifier = DecisionTreeClassifier() classifier = classifier.fit(pred_train, tar_train)

Finally we make our predictions on our test data set and verify the accuracy.

predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

Out[1]: 0.96666666666666667

I’ve run the above code, separating the training and test datasets, builiding the model, making the predictions, and finally testing the accuracy another 14 times in a loop and got accuracy predictions ranging from 84.3% to 100%, so a generated model might have the potential to be overfitted. However the mean of these values is 0.942 with a standard deviation of 0.04 so the values are not deviating much from the mean.

Out[2]: 0.94999999999999996 Out[3]: 1.0 Out[4]: 0.96666666666666667 Out[5]: 0.94999999999999996 Out[6]: 0.8433333333333333 Out[7]: 0.93333333333333335 Out[8]: 0.90000000000000002 Out[9]: 0.94999999999999996 Out[10]: 0.96666666666666667 Out[11]: 0.91666666666666663 Out[12]: 0.98333333333333328 Out[13]: 0.8833333333333333 Out[14]: 0.94999999999999996 Out[15]: 0.96666666666666667

Finally displaying the tree was achieved with the following:

#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())

And the tree was output:

The petal length (X[2]) was the first variable to separate the sample into two subgroups. Iris’ with petal length of less than or equal to 2.45 were a group of their own – the setosa with all 32 in the sample identified as this group. The next variable to separate was the petal width (X[3]) on values of less than or equal to 1.75. This is separating between the versicolor and virginica categories very well – only 3 of the remaining 58 not being categorised correctly (2 of the virginica, and 1 of the versicolor). The next decision is back on petal length again (X[2]) <= 5.45 on the left hand branch resolving virginica in the end on two more decisions, the majority with petal length less than or equal to 4.95 and the remaining 2 with petal width > 1.55. Meanwhile in the right branch all but one of the versicolor is categorised based on the petal length > 4.85. The last decision to decide between 1 versicolor and 1 virginica is decided based on variable V[0], the sepal length <= 6.05 being the virginica, and the last versicolor having a sepal length > 6.05.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.

]]>DBS Analytics Society Meetups every second Saturday morning from 10-1 in Castle House 2.2 to work through some problem set and have fun.

Provisional Calendar:

September 24th 2016

October 1st 2016

October 15th 2016

October 22nd 2016

November 5th 2016

November 19th 2016

December 3rd 2016

]]>A list of sample exam questions were also covered in this lecture which reviewed the course.

]]>

]]>

]]>

]]>

]]>

Find or create a dataset* suitable to K-Means Cluster analysis and K-Nearest Neighbour predictions of roughly 200 observations.

* The dataset should be unique with respect to your class.

Examine the dataset and separate the dataset into a training set of a suitable size and a test set to see the effectiveness of your model.

Follow the tutorials for K-Means Clustering and K-Nearest Neighbour.

Submit you completed work and summary as a classical paper

]]>Apriori is another useful algorithm to understand and be able to use. It is a Data Mining algorithm used in Association Analysis. Often referred to as Basket Analysis, or Shopping Basket Analysis.

An A Priori Algorithm R Example:

install.packages("arules") library("arules") a_list<-list(c("CrestTP","CrestTB"), c("OralBTB"),c("BarbSC"), c("ColgateTP","BarbSC"), c("OldSpiceSC"), c("CrestTP","CrestTB"), c("AIMTP","GUMTB","OldSpiceSC"), c("ColgateTP","GUMTB"), c("AIMTP","OralBTB"), c("CrestTP","BarbSC"), c("ColgateTP","GilletteSC"), c("CrestTP","OralBTB"), c("AIMTP"), c("AIMTP","GUMTB","BarbSC"), c("ColgateTP","CrestTB","GilletteSC"), c("CrestTP","CrestTB","OldSpiceSC"), c("OralBTB"), c("AIMTP","OralBTB","OldSpiceSC"), c("ColgateTP","GilletteSC"), c("OralBTB","OldSpiceSC")) names(a_list) <- paste("Tr", c(1:20), sep = "") trans <- as(a_list, "transactions") summary(trans)

> summary(trans) transactions as itemMatrix in sparse format with 20 rows (elements/itemsets/transactions) and 9 columns (items) and a density of 0.2222222 most frequent items: OralBTB AIMTP ColgateTP CrestTP OldSpiceSC (Other) 6 5 5 5 5 14 element (itemset/transaction) length distribution: sizes 1 2 3 5 10 5 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 1.75 2.00 2.00 2.25 3.00 includes extended item information - examples: labels 1 AIMTP 2 BarbSC 3 ColgateTP includes extended transaction information - examples: transactionID 1 Tr1 2 Tr2 3 Tr3

inspect(trans)

> inspect(trans) items transactionID 1 {CrestTB,CrestTP} Tr1 2 {OralBTB} Tr2 3 {BarbSC} Tr3 4 {BarbSC,ColgateTP} Tr4 5 {OldSpiceSC} Tr5 6 {CrestTB,CrestTP} Tr6 7 {AIMTP,GUMTB,OldSpiceSC} Tr7 8 {ColgateTP,GUMTB} Tr8 9 {AIMTP,OralBTB} Tr9 10 {BarbSC,CrestTP} Tr10 11 {ColgateTP,GilletteSC} Tr11 12 {CrestTP,OralBTB} Tr12 13 {AIMTP} Tr13 14 {AIMTP,BarbSC,GUMTB} Tr14 15 {ColgateTP,CrestTB,GilletteSC} Tr15 16 {CrestTB,CrestTP,OldSpiceSC} Tr16 17 {OralBTB} Tr17 18 {AIMTP,OldSpiceSC,OralBTB} Tr18 19 {ColgateTP,GilletteSC} Tr19 20 {OldSpiceSC,OralBTB} Tr20

rules<-apriori(trans,parameter=list(supp=.02, conf=.5, target="rules"))

> rules<-apriori(trans,parameter=list(supp=.02, conf=.5, target="rules")) Apriori Parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.02 1 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 0 Warning in apriori(trans, parameter = list(supp = 0.02, conf = 0.5, target = "rules")) : You chose a very low absolute support count of 0. You might run out of memory! Increase minimum support. set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 20 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 done [0.00s]. writing ... [18 rule(s)] done [0.00s]. creating S4 object ... done [0.00s].

inspect(head(sort(rules,by="lift"), n=20))

> inspect(head(sort(rules,by="lift"), n=20)) lhs rhs support confidence lift 7 {ColgateTP,CrestTB} => {GilletteSC} 0.05 1.0000000 6.666667 9 {AIMTP,BarbSC} => {GUMTB} 0.05 1.0000000 6.666667 15 {CrestTP,OldSpiceSC} => {CrestTB} 0.05 1.0000000 5.000000 1 {GilletteSC} => {ColgateTP} 0.15 1.0000000 4.000000 2 {ColgateTP} => {GilletteSC} 0.15 0.6000000 4.000000 6 {CrestTB,GilletteSC} => {ColgateTP} 0.05 1.0000000 4.000000 8 {BarbSC,GUMTB} => {AIMTP} 0.05 1.0000000 4.000000 11 {GUMTB,OldSpiceSC} => {AIMTP} 0.05 1.0000000 4.000000 14 {CrestTB,OldSpiceSC} => {CrestTP} 0.05 1.0000000 4.000000 13 {AIMTP,OldSpiceSC} => {GUMTB} 0.05 0.5000000 3.333333 4 {CrestTB} => {CrestTP} 0.15 0.7500000 3.000000 5 {CrestTP} => {CrestTB} 0.15 0.6000000 3.000000 3 {GUMTB} => {AIMTP} 0.10 0.6666667 2.666667 10 {AIMTP,GUMTB} => {BarbSC} 0.05 0.5000000 2.500000 12 {AIMTP,GUMTB} => {OldSpiceSC} 0.05 0.5000000 2.000000 16 {OldSpiceSC,OralBTB} => {AIMTP} 0.05 0.5000000 2.000000 17 {AIMTP,OralBTB} => {OldSpiceSC} 0.05 0.5000000 2.000000 18 {AIMTP,OldSpiceSC} => {OralBTB} 0.05 0.5000000 1.666667

rules<-apriori(trans,parameter=list(supp=.2, conf=.5, target="rules"))

> rules<-apriori(trans, parameter=list(supp=.2, conf=.5, target="rules")) Apriori Parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.2 1 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 4 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 20 transaction(s)] done [0.00s]. sorting and recoding items ... [7 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 done [0.00s]. writing ... [0 rule(s)] done [0.00s]. creating S4 object ... done [0.00s].

inspect(head(sort(rules, by="lift"), n=20))

> inspect(head(sort(rules, by="lift"), n=20)) lhs rhs support confidence lift 1 {GilletteSC} => {ColgateTP} 0.03 1.0000000 20.00000 2 {ColgateTP} => {GilletteSC} 0.03 0.6000000 20.00000 3 {CrestTB} => {CrestTP} 0.03 0.7500000 15.00000 4 {CrestTP} => {CrestTB} 0.03 0.6000000 15.00000 5 {GUMTB} => {AIMTP} 0.02 0.6666667 13.33333

rules<-apriori(trans,parameter=list(supp=.1, conf=.5, target="rules"))

> rules<-apriori(trans, parameter=list(supp=.1, conf=.5, target="rules")) Apriori Parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target ext 0.5 0.1 1 none FALSE TRUE 0.1 1 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 2 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[9 item(s), 20 transaction(s)] done [0.00s]. sorting and recoding items ... [9 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 done [0.00s]. writing ... [5 rule(s)] done [0.00s]. creating S4 object ... done [0.00s].

inspect(head(sort(rules, by="lift"), n=20))

> inspect(head(sort(rules, by="lift"), n=20)) lhs rhs support confidence lift 1 {GilletteSC} => {ColgateTP} 0.15 1.0000000 4.000000 2 {ColgateTP} => {GilletteSC} 0.15 0.6000000 4.000000 4 {CrestTB} => {CrestTP} 0.15 0.7500000 3.000000 5 {CrestTP} => {CrestTB} 0.15 0.6000000 3.000000 3 {GUMTB} => {AIMTP} 0.10 0.6666667 2.666667]]>

Files used in this lecture:

]]>**R** has extensive facilities for analyzing time series data. This section describes the creation of a time series, seasonal decompostion, modeling with exponential and ARIMA models, and forecasting with the **forecast** package.

See the post on statsmethods for time series here.

Or my personal favourite with excellent data examples from successive Kings of England, to Australian Souvenir Shops, and Hem Sizes on Skirts see A little book of r for Time Series.

]]>R has a lot of apis and plugin libraries which can make it impossible to remember what everything does. The following is a list of Cheat Sheets for R, Python, numpy, scipy, pandas to do regression analysis, machine learning, predictive analytics, and whatever else that might be relevant and interesting:

Ref Card in R for Regression Analysis

Ricci Ref Card for Time Series Data

]]>Geoffrey Beall published this famous paper in Biometrika in 1942 to highlight ‘The Transformation of Data from Entomological Field Experiments so that the Analysis of Variance Becomes Applicable’.

]]>

For any data scientist Python is a must, but Python alone will not go very far on its own. Pandas is the data analytics library that allows Python to deliver the functionality which comes out of the box in R.

Setting up Python & Pandas is now made very easy with Anaconda, and the running of Python can be made very intuitive with Jupyter Notebook.

Steps:

- Download Python & Install. – python 2.7 was used here
- Download Anaconda & Install
- Open Command Prompt after installation
- set PATH=%PATH%;c:\Python27;
- conda –version
- conda install pandas
- conda install ipython
- conda install pip
- jupyter notebook

For more details, please see the full tutorial to install Pandas here.

A simple, intuitive, and powerful introduction to Pandas can be found here.

The graphics matplotlib library is discussed here.

Statistical analysis made easy in Python with SciPy and Pandas DataFrames.

5 Questions which can teach you Multiple Regressions (with R and Python).

Data files useful to run analysis on:

]]>Files required for this lecture:

]]>
]]>

Taking the dataset from AirBnb’s New User Bookings scenario

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

Analyse the dataset to see if it is possible to create a model which describes where people are likely to travel as their first trip on airbnb.

Your model can be one of the many algorithms covered on the course, or from any model you’ve come across in a previous existence.

You are required to calculate the goodness of fit of your model to your data, outlining the significant independent variables, if any.

You are also required to verify if the data has normal distribution – and if it should be treated parametrically or non-parametrically.

Please examine the data set also with an emphasis to data quality and mention some data scrubbing techniques that could be used to make the data easier to work with going forward.

Your solution will include a program, runnable in R or python as well as a word document outlining your research.

Your paper should be written as a formal paper.

You are also required to create a blog post on your <name>.dbsdataprojects.com blog with your research – including your runnable program script.

]]>
]]>

The detergent data file used to demonstrate interaction effect:

]]>

]]>

Links to data files uses in this lecture:

Teaching Methods Anova in excel

]]>

Please find the links to the State Data 77 and the Milk Production data file below:

]]>Below are the links to the data files used in this lecture:

]]>

R is fundamental to data scientists and to the budding person new to this field the overriding question is ‘where do I start?’

The following are links to R tutorial sites that I have found to be invaluable to teach students new to R every year:

]]>
]]>

In 2010 Springer published the book a Guide to Intelligent Data Analysis with some excellent topics and features, such as:

- Guides the reader through the process of data analysis, following the interdependent steps of project understanding, data understanding, data preparation, modeling, and deployment and monitoring
- Equips the reader with the necessary information in order to obtain hands-on experience of the topics under discussion
- Provides a review of the basics of classical statistics that support and justify many data analysis methods, and a glossary of statistical terms
- Includes numerous examples using R and KNIME, together with appendices introducing the open source software
- Integrates illustrations and case-study-style examples to support pedagogical exposition
- Supplies further tools and information at the associated website: http://www.idaguide.net/

]]>

Earl F Flynn in 2007 published Using Colors in R for the Stowers Institute for Medical Research and it is the baseline to know for all data visualisations in R that you will ever need.

]]>

The following is an excellent presentation by Theodore Johnson from AT&T that he delivered at Rutgers University in 2004 – still as relevant today as it was on the 12th February of that year. The original presentation can be found for download here.

]]>

CA 1 – Anscombe’s Quartet

Write up a report and analysis on the Anscombe’s Quartet.

Describe the flaws that this data set exposes with just looking at Pearson’s Correlation independent of visualising the data.

Describe each of the 4 charts in a blog post roughly 500 words in length.

In your answer also come up with a new set of data points which validate Anscombe’s work.

]]>
]]>

]]>

]]>

]]>

]]>

]]>

Calculate the Haversine Distance around the earth in order to help Santa:

#!/usr/bin/env python # Haversine formula example in Python # Author: Wayne Dyck import math def distance(origin, destination): lat1, lon1 = origin lat2, lon2 = destination radius = 6371 # km dlat = math.radians(lat2-lat1) dlon = math.radians(lon2-lon1) a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \ * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a)) d = radius * c return d

Python script to direct the reindeer and the sled:

north_pole = (90,0) weight_limit = 1000 sleigh_weight = 10 import pandas as pd import numpy as np from haversine import distance def weighted_trip_length(stops, weights): tuples = [tuple(x) for x in stops.values] # adding the last trip back to north pole, with just the sleigh weight tuples.append(north_pole) weights.append(sleigh_weight) dist = 0.0 prev_stop = north_pole prev_weight = sum(weights) for i in range(len(tuples)): dist = dist + distance(tuples[i], prev_stop) * prev_weight prev_stop = tuples[i] prev_weight = prev_weight - weights[i] return dist def weighted_reindeer_weariness(all_trips): uniq_trips = all_trips.TripId.unique() if any(all_trips.groupby('TripId').Weight.sum() + sleigh_weight > weight_limit): raise Exception("One of the sleighs over weight limit!") dist = 0.0 for t in uniq_trips: this_trip = all_trips[all_trips.TripId==t] dist = dist + weighted_trip_length(this_trip[['Latitude','Longitude']], this_trip.Weight.tolist()) return dist gifts = pd.read_csv('gifts.csv') sample_sub = pd.read_csv('sample_submission.csv') all_trips = sample_sub.merge(gifts, on='GiftId') print(weighted_reindeer_weariness(all_trips))]]>

Amazon EMR is based on Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

EC2 (Elastic Compute Cloud) and S3 (Simple Secure Storage) will also be employed in this lab.

Elastic Map Reduce:

https://aws.amazon.com/elasticmapreduce

Getting Started Tutorial:

http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-gs.html

]]>Also check out A Beginner’s Guide to Big O Notation

]]>This is a tutorial to demonstrate mapper and reducer python scripts to convert data from one input format to a required output format.

Follow the steps from the tutorial below:

Create the mapper.py file and the reducer.py file and have all files in the same windows directory.

Then run ->

type googlebooks-eng-all-1gram-20120701-x | mapper.py | reducer.py

set PATH=”C:\Program Files\R\R-3.2.2\bin\x64″;%PATH%

type googlebooks-eng-all-1gram-20120701-x | mapper.py | reducer.py | rscript grapher.r

Create a graph based on the data via R:

library(ggplot2) dat <- read.csv(file="newwords.csv", header=F, sep=",") dat #attach(dat) words <- dat[1] words count <- dat[2] count ggplot(data=dat, aes(x=words, y=count)) + geom_bar(stat="identity")

Create a graph with r – piping data from standard input:

library(ggplot2) dat <- read.csv(file="stdin", header=TRUE, sep=",") dat ggplot(data=dat, aes(x=Words, y=Count)) + geom_bar(stat="identity")]]>

The python code to spell check:

class SpellChecker(object): def __init__(self): self.words = [] def load_file(self, file_name): lines = open(file_name).readlines() return map(lambda x: x.strip().lower(), lines) def load_words(self, file_name): self.words = self.load_file(file_name) def check_word(self, word): return word.strip('.').lower() in self.words # index = 0 is set here so that the function can be called for one line and index defaults to 0 def check_words(self, sentence, index = 0): words_to_check = sentence.split(' ') caret_position = 0 failed_words = [] for word in words_to_check: if not self.check_word(word): print('Word is misspelt ' + word + ' at line : ' + str(index+1) + ' pos ' + str(caret_position+1)) failed_words.append({'word':word,'line':index+1,'pos':caret_position+1}) # update the caret position to be the length of the word plus 1 for the split character. caret_position = caret_position + len(word) + 1 return failed_words def check_document(self, file_name): self.sentences = self.load_file(file_name) failed_words_in_sentences = [] for index, sentence in enumerate(self.sentences): failed_words_in_sentences.extend(self.check_words(sentence, index)) return failed_words_in_sentences

Testing the Spell Checker:

import unittest from spellcheck import SpellChecker class TestSpellChecker(unittest.TestCase): def setUp(self): self.spellChecker = SpellChecker() self.spellChecker.load_words('spell.words') def test_spell_checker(self): self.assertTrue(self.spellChecker.check_word('zygotic')) failed_words = self.spellChecker.check_words('zygotic mistasdas elementary') self.assertEquals(1, len(failed_words)) self.assertEquals('mistasdas', failed_words[0]['word']) self.assertEquals(1, failed_words[0]['line']) self.assertEquals(9, failed_words[0]['pos']) self.assertEquals(0, len(self.spellChecker.check_words('our first correct sentence'))) # handle case sensitivity self.assertEquals(0, len(self.spellChecker.check_words('Our capital sentence'))) # handle full stop self.assertEquals(0, len(self.spellChecker.check_words('Our full stop sentence.'))) failed_words = self.spellChecker.check_words('zygotic mistasdas spelllleeeing elementary') self.assertEquals(2, len(failed_words)) self.assertEquals('mistasdas', failed_words[0]['word']) self.assertEquals(1, failed_words[0]['line']) self.assertEquals(9, failed_words[0]['pos']) self.assertEquals('spelllleeeing', failed_words[1]['word']) self.assertEquals(1, failed_words[1]['line']) self.assertEquals(19, failed_words[1]['pos']) self.assertEqual(0, len(self.spellChecker.check_document('spell.words'))) if __name__ == '__main__': unittest.main()]]>

]]>

Based on the lessons we’ve learned in Week 1 interrogate the data set to find the number of elements.

Once the number of elements is determined determine an algorithm and the data types that might be required to interrogate the dataset.

Implement the program to read the data including any testsuites / classes / functions that are required.

The dataset:

The python interrogation:

# open the file - and read all of the lines. changes_file = 'changes_python.txt' # use strip to strip out spaces and trim the line. #my_file = open(changes_file, 'r') #data = my_file.readlines() data = [line.strip() for line in open(changes_file, 'r')] # print the number of lines read print(len(data)) sep = 72*'-' # create the commit class to hold each of the elements - I am hoping there will be 422 # otherwise I have messed up. class Commit: 'class for commits' def __init__(self, revision = None, author = None, date = None, comment_line_count = None, changes = None, comment = None): self.revision = revision self.author = author self.date = date self.comment_line_count = comment_line_count self.changes = changes self.comment = comment def get_commit_comment(self): return 'svn merge -r' + str(self.revision-1) + ':' + str(self.revision) + ' by ' \ + self.author + ' with the comment ' + ','.join(self.comment) \ + ' and the changes ' + ','.join(self.changes) commits = [] current_commit = None index = 0 authors = [] while True: try: # parse each of the commits and put them into a list of commits current_commit = Commit() details = data[index + 1].split('|') current_commit.revision = int(details[0].strip().strip('r')) current_commit.author = details[1].strip() current_commit.date = details[2].strip() current_commit.comment_line_count = int(details[3].strip().split(' ')[0]) current_commit.changes = data[index+2:data.index('',index+1)] #print(current_commit.changes) index = data.index(sep, index + 1) current_commit.comment = data[index-current_commit.comment_line_count:index] commits.append(current_commit) if current_commit.author not in authors: authors.append(current_commit.author) except IndexError: break print(len(commits)) commits.reverse() #for index, commit in enumerate(commits): # print(commit.get_commit_comment()) print authors]]>

]]>

A simple calculator and test programme:

import unittest class Calculator(object): def add(self, x, y): number_types = (int, long, float, complex) if isinstance(x, number_types) and isinstance(y, number_types): return x + y else: raise ValueError def subtract(self, x, y): number_types = (int, long, float, complex) if isinstance(x, number_types) and isinstance(y, number_types): return x - y else: raise ValueError # test the calculator functionality class TestCalculator(unittest.TestCase): def setUp(self): self.calc = Calculator() # this tests the add functionality # 2 + 2 = 4 # 2 + 4 = 6 # 2 + (-2) = 0 def test_calculator_add_method_returns_correct_result(self): result = self.calc.add(2, 2) self.assertEqual(4, result) result = self.calc.add(2,4) self.assertEqual(6, result) result = self.calc.add(2, -2) self.assertEqual(0, result) def test_calculator_subtract_method_returns_correct_result(self): result = self.calc.subtract(2, 2) self.assertEqual(0, result) result = self.calc.subtract(2,4) self.assertEqual(-2, result) result = self.calc.subtract(2, -4) self.assertEqual(6, result) #def test_calculator_returns_error_message_if_both_args_not_numbers(self): # self.assertRaises(ValueError, self.calc.add, 'two', 'three') # self.assertRaises(ValueError, self.calc.subtract, 'two', 'three') if __name__ == '__main__': unittest.main() pass

Addendum to testing calculators and factorials:

import unittest from factorial import factorial from factorial import factorial_simple class Conversion(object): def lbs_to_kilos(self, lbs): return lbs / 2.2 class Calculator(object): def add(self, x, y): number_types = (int, long, float, complex) if isinstance(x, number_types) and isinstance(y, number_types): return x + y else: raise ValueError def divide(self, x, y): number_types = (int, long, float, complex) if isinstance(x, number_types) and isinstance(y, number_types): if y == 0: return 'NaN' return x / float(y) else: raise ValueError def exponent(self, x, y): number_types = (int, long, float, complex) if isinstance(x, number_types) and isinstance(y, number_types): return x ** y else: raise ValueError def multiply(self, x, y): number_types = (int, long, float, complex) if isinstance(x, number_types) and isinstance(y, number_types): return x * y else: raise ValueError def subtract(self, x, y): number_types = (int, long, float, complex) if isinstance(x, number_types) and isinstance(y, number_types): return x - y else: raise ValueError # test the factorial functionality class TestFactorial(unittest.TestCase): # this tests the conversion functionality # 5 = 120 # 6 = 720 def test_factorial(self): result = factorial(5) self.assertEqual(120, result) result = factorial(6) self.assertEqual(720, result) # this tests the conversion functionality # 5 = 120 # 6 = 720 def test_factorial_simple(self): result = factorial_simple(5) self.assertEqual(120, result) result = factorial_simple(6) self.assertEqual(720, result) # test the conversion functionality class TestConversion(unittest.TestCase): def setUp(self): self.conversion = Conversion() # this tests the conversion functionality # 2.2 lbs = 1 kg # 22 lbs = 10 kgs def test_convert_lbs_to_kilos(self): result = self.conversion.lbs_to_kilos(2.2) self.assertEqual(1, result) result = self.conversion.lbs_to_kilos(22) self.assertEqual(10, result) # test the calculator functionality class TestCalculator(unittest.TestCase): def setUp(self): self.calc = Calculator() # this tests the add functionality # 2 + 2 = 4 # 2 + 4 = 6 # 2 + (-2) = 0 def test_calculator_add_method_returns_correct_result(self): result = self.calc.add(2, 2) self.assertEqual(4, result) result = self.calc.add(2,4) self.assertEqual(6, result) result = self.calc.add(2, -2) self.assertEqual(0, result) def test_calculator_subtract_method_returns_correct_result(self): result = self.calc.subtract(2, 2) self.assertEqual(0, result) result = self.calc.subtract(2,4) self.assertEqual(-2, result) result = self.calc.subtract(2, -4) self.assertEqual(6, result) def test_calculator_returns_error_message_if_both_args_not_numbers(self): self.assertRaises(ValueError, self.calc.add, 'two', 'three') self.assertRaises(ValueError, self.calc.subtract, 'two', 'three') # adding multiplication to my calculator def test_calculator_multiply_method_returns_correct_result_for_2_params_the_same(self): result = self.calc.multiply(2, 2) self.assertEqual(4, result) result = self.calc.multiply(2,4) self.assertEqual(8, result) result = self.calc.multiply(2, -2) self.assertEqual(-4, result) result = self.calc.multiply(2, 0) self.assertEqual(0, result) result = self.calc.multiply(2, 0) self.assertEqual(0, result) # adding divide to my calculator def test_calculator_divide_method_returns_correct_result_for_2_params_the_same(self): result = self.calc.divide(2, 2) self.assertEqual(1, result) result = self.calc.divide(4,2) self.assertEqual(2, result) result = self.calc.divide(2, -2) self.assertEqual(-1, result) result = self.calc.divide(2, 4) self.assertEqual(0.5, result) result = self.calc.divide(2, 0) self.assertEqual('NaN', result) self.assertRaises(ValueError, self.calc.divide, 'two', 'three') self.assertRaises(ValueError, self.calc.divide, 'two', 0) self.assertRaises(ValueError, self.calc.divide, 2, 'three') # adding exponent to my calculator def test_calculator_exponent_method_returns_correct_result_for_2_params_the_same(self): result = self.calc.exponent(2, 2) self.assertEqual(4, result) result = self.calc.exponent(2,4) self.assertEqual(16, result) result = self.calc.exponent(2, -2) self.assertEqual(0.25, result) result = self.calc.exponent(2, 0) self.assertEqual(1, result) if __name__ == '__main__': unittest.main()]]>

Writing factorial functions in python:

def factorial_simple(n): if n == 0: return 1 else: return n * factorial_simple(n - 1) print(factorial_simple(5)) def factorial(n): return reduce(lambda x,y:x*y, [1]+range(1,n+1)) print(factorial(5))]]>

]]>

]]>

]]>

Look what can be done with Fusion Tables and the Irish Population Data merged with the county boundaries.

See this tutorial on mulinblog.com to create a heat map with fusion tables

]]>For your DBS Data Projects blog which is Assignment 4 you are required to write 5 blog posts during the course.

Select the topics of interest to use as the material for your blog.

This material might be some critical analysis, or thoughts regarding Management Information Systems, the industry, or any of the topics that have appeared on this course, which you would like to blog about. You should also consider including references to information on the Internet and other published information, such as journal articles, company reports, etc… that you discover in your research either online or offline.

Keep saving your posts regularly to prevent loosing them.

Blog regularly!

]]>**Welcome to**

** DBS Analytics Society**

Join at the following url:

https://docs.google.com/forms/d/1oxG7f_jFdXYiDRhOFBroviZpXfTkw_igPpClFF8YOZQ/viewform?usp=send_form

**Why are we here?**

- Because We Love Data – isn’t that right? Yes
- Where to begin -> https://www.kaggle.com/
- The home of data science – create an account

- Pick any problem set and try to solve
- 2 biggest gifts
- Our brains
- Our hypothesis

- Practice, practice, practice
- Be meticulous

**What is a Hackathon?**

- https://hackathon.guide/
- Hacking – Creative Problem Solving
- Hackathon – coming together to solve problems
- Parallel track for workshops going forward
- Groups of 2-5 people – dive into the problem
- Positive Energy Only
- Welcome Newcomers
- Learn Something New
- Solve Problems That Interest You
- Imposter Syndrome – You Belong Here – We are all beginners

**Good Hackathons**

- Clearly Articulated Problem Set
- Attainable
- Easy to Onboard Newcomers
- Led By A Stakeholder
- Organized

**Process The Data – what to calculate?**

- Take input files and process
- Turn into the dataset form that we can process
- Refactor and improve
- Test

- Process with any tools available
- http://www.computerworld.com/article/2502891/business-intelligence/business-intelligence-8-cool-tools-for-data-analysis-visualization-and-presentation.html

- Yes Google Search is your friend – lecturer’s hat removed
- Search for Data Analytics Tools

**Is Programming Required**

- In order not to scare you – No
- but then a graphical transformation tools is required because you can’t program

- Realistically when I want to process Big Data – Programming is Required
- Python
- R

- Parallelization for Big Data
- Multithread
- Map Reduce
- In Memory computing / databases

**Helping Santa**

- https://www.kaggle.com/c/santas-stolen-sleigh
- Read the requirements – calculate weighted reindeer weariness
- Unlimited number of trips
- Head back to to north pole for each trip
- Haversine used to calculate distance –

- Get the data
- gifts.csv
- sample_submission.csv

- Check out Kaggle solutions / forums / ideas – people help eachother

**Excel**

- How many gifts?
- Weight allowed in sleight
- Number of trips required?
- Best Case / Worst Case?
- Sample Solution Case? 5000 Buckets with 20 gifts in each bucket
- How to solve better?

**Solving With Python**

- https://www.kaggle.com/wendykan/santas-stolen-sleigh/computing-weighted-reindeer-weariness
- Potential solutions for weighted reindeer algorithm
- https://gist.github.com/rochacbruno/2883505
- Simple Haversine in Python
- Usually python 2.7 – install or run if installed
- Python – because I can
- Data Analytics in python using pandas, NumPy, SciPy
- http://pandas-docs.github.io/pandas-docs-travis/install.html#installing-pandas-with-anaconda
- Install with Anaconda – so need to download that
- https://www.continuum.io/downloads
- https://www.continuum.io/downloads#_macosx

- Create reindeer.py

north_pole = (90,0)

weight_limit = 1000

sleigh_weight = 10

```
```import pandas as pd

import numpy as np

from haversine import distance

def weighted_trip_length(stops, weights):

tuples = [tuple(x) for x in stops.values]

# adding the last trip back to north pole, with just the sleigh weight

tuples.append(north_pole)

weights.append(sleigh_weight)

dist = 0.0

prev_stop = north_pole

prev_weight = sum(weights)

for i in range(len(tuples)):

dist = dist + distance(tuples[i], prev_stop) * prev_weight

prev_stop = tuples[i]

prev_weight = prev_weight - weights[i]

return dist

def weighted_reindeer_weariness(all_trips):

uniq_trips = all_trips.TripId.unique()

if any(all_trips.groupby('TripId').Weight.sum() + sleigh_weight > weight_limit):

raise Exception("One of the sleighs over weight limit!")

dist = 0.0

for t in uniq_trips:

this_trip = all_trips[all_trips.TripId==t]

dist = dist + weighted_trip_length(this_trip[['Latitude','Longitude']], this_trip.Weight.tolist())

return dist

gifts = pd.read_csv('gifts.csv')

sample_sub = pd.read_csv('sample_submission.csv')

all_trips = sample_sub.merge(gifts, on='GiftId')

`print(weighted_reindeer_weariness(all_trips))`

**Visualise – See Map Data**

Take the gifts.csv file – apply to Fusion Tables

**reindeer.py**

**The Sample Solution Value – 144525525772**

**Can You Improve on This?**

]]>

]]>

]]>

]]>

Links to the usefulness of information systems:

http://www.nytimes.com/2010/12/28/business/media/28disney.html?smid=pl-share&_r=0

The Beauty of Data Visualisation – Information is Beautiful

]]>
]]>

]]>

]]>

]]>

]]>

]]>

]]>

Files required for lab exercises:

Formulas

Music Charts

Copying

Theatre Ticket

Grades If

Grades If Extra

V-Lookup

Useful Video Excel Tutorials:

http://www.gcflearnfree.org/excel2010/

]]>

]]>

]]>

See this tutorial on mulinblog.com to create a heat map with fusion tables