Today I passed Wesleyan University’s Machine Learning for Data Analysis Course on Coursera. This course was a great Python & SAS course and part 4 of their Data Analysis and Interpretation Specialisation. So only the Capstone project left for me to do. The lecturers Lisa Dierker and Jen Rose know their stuff and the practicals each week are fun to do. This month’s Programming for Big Data course in DBS will contain some of the practicals and research I did for this course.

# Category: Wesleyan University

Wesleyan University Hartford, Connecticut Area

## Cluster Analysis of the Iris Dataset

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
iris_data = pd.read_csv("iris_data.csv", header=None) iris_data iris_data_clean = iris_data.dropna() iris_data_clean iris_data_clean.describe() iris_cluster = iris_data_clean[['sepal_length','sepal_width','petal_length','petal_width']].copy() iris_cluster['sepal_length']=preprocessing.scale(iris_cluster['sepal_length'].astype('float64')) iris_cluster['sepal_width']=preprocessing.scale(iris_cluster['sepal_width'].astype('float64')) iris_cluster['petal_length']=preprocessing.scale(iris_cluster['petal_length'].astype('float64')) iris_cluster['petal_width']=preprocessing.scale(iris_cluster['petal_width'].astype('float64')) iris_train, iris_test = train_test_split(iris_cluster, test_size=.3, random_state=123) from scipy.spatial.distance import cdist clusters=range(1,6) meandist=[] clusters for k in clusters: model=KMeans(n_clusters=k) model.fit(iris_train) iris_assign=model.predict(iris_train) meandist.append(sum(np.min(cdist(iris_train, model.cluster_centers_, 'euclidean'), axis=1)) / iris_train.shape[0]) plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') # Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(iris_train) iris_assign=model3.predict(iris_train) from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(iris_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show() |

## Developing a Research Question

While trying to buy a house in Dublin I realised I had no way of knowing if I was paying a fair price for a house, if I was getting it for a great price, or if I was over-paying for the house. The data scientist in me would like to develop an algorithm, a hypothesis, a research question, so that my decisions are based on sound science and not on gut instinct. So for the last couple of weeks I have been developing algorithms to determine this fair price value. So my research questions is:

**Is house sales price associated with socio-economic location?**

I stumbled upon similar research by Dean DeCock from 2009 in his research determining the house price for Ames Iowa. So that is the data set that I will use. See the Kaggle page House Prices Advanced Regression Techniques to get the data.

I would like to study the association between the neighborhood (location) and the house price, to determine does location influence the sale price and is the difference in means between different locations significant.

This dataset has 79 independent variables with sale price being the dependent variable. Initially I am only focusing on one independent variable – the neighborhood, so I can reduce the dataset variables down to two, to simplify the computation my analysis of variance needs to perform.

Now that I have determined I am going to study location, I decide that I might further want to look at the bands of house size, not just the house size (square footage), but if I can turn those into categories of square footage, less than 1000, between 1000 and 1250 square feet, 1250 to 1500, > 1500 to see if there is a variance in the mean among these categories.

I can now take the above ground living space variable (square footage) and add it to my codebook. I will also add any other variables related to square footage for first floor, second floor, basement etc…

I then search google scholar, kaggle, dbs library for previous study in these areas, finding: a paper from 2001 discussing previous research in Dublin, however it was done in 2001 when a bubble was about to begin, and a big property crash in 2008 that was not conceived. http://www.sciencedirect.com/science/article/pii/S0264999300000407

Secondly Dean De Cock’s research on house prices in Iowa http://ww2.amstat.org/publications/jse/v19n3/decock.pdf

Based on my literature review I believe that there might be a statistically significant association between house location (neighborhood) and sales price. Secondary I believe there will be a statistically significant association between size bands (square footage band) and sales price. I further believe that might be an interaction effect between location & square footage bands and sales price which I would like to investigate too.

So I have developed three null hypotheses:

* There is **NO** association between location and sales price

* There is **NO** association between bands of square footage and sales price

* There is **NO** interaction effect in association between location, bands of square footage and sales price.

## Running a LASSO Regression Analysis

A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step:

Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV import os data = pd.read_csv("iowa_house_data.csv") #upper-case all DataFrame column names data.columns = map(str.upper, data.columns) print(data.columns) data_clean = data #select predictor variables and target variable as separate data sets predvar= data_clean[['GRLIVAREA', 'LOTAREA', 'YEARBUILT', 'FIREPLACES', 'OVERALLQUAL', 'OVERALLCOND', 'TOTRMSABVGRD', 'YEARREMODADD', '1STFLRSF', '2NDFLRSF', 'YRSOLD', 'BSMTFINSF1', 'BSMTFINSF2', 'BSMTUNFSF', 'TOTALBSMTSF', 'MSSUBCLASS', 'MISCVAL', 'MOSOLD', 'GARAGECARS', 'GARAGEAREA', 'WOODDECKSF', 'OPENPORCHSF', 'ENCLOSEDPORCH', '3SSNPORCH', 'SCREENPORCH', 'POOLAREA', 'LOWQUALFINSF', 'BSMTFULLBATH', 'BSMTHALFBATH', 'FULLBATH', 'HALFBATH', 'BEDROOMABVGR', 'KITCHENABVGR']] target = data_clean.SALEPRICE # standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing print predvar for k in predvar.columns: print k predictors[k]=preprocessing.scale(predictors[k].astype('float64')) # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123) # specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train) # print variable names and regression coefficients var_imp = pd.DataFrame(data = {'predictors':list(predictors.columns.values),'coefficients':model.coef_}) var_imp['sort'] = var_imp.coefficients.abs() print(var_imp.sort_values(by='sort', ascending=False)) # plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold') |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
coefficients predictors sort 4 0.36 OVERALLQUAL 0.36 0 0.26 GRLIVAREA 0.26 18 0.12 GARAGECARS 0.12 11 0.07 BSMTFINSF1 0.07 2 0.07 YEARBUILT 0.07 7 0.05 YEARREMODADD 0.05 8 0.05 1STFLRSF 0.05 15 -0.04 MSSUBCLASS 0.04 3 0.04 FIREPLACES 0.04 14 0.04 TOTALBSMTSF 0.04 20 0.02 WOODDECKSF 0.02 27 0.01 BSMTFULLBATH 0.01 1 0.01 LOTAREA 0.01 24 0.00 SCREENPORCH 0.00 25 0.00 POOLAREA 0.00 26 0.00 LOWQUALFINSF 0.00 31 0.00 BEDROOMABVGR 0.00 22 0.00 ENCLOSEDPORCH 0.00 28 0.00 BSMTHALFBATH 0.00 29 0.00 FULLBATH 0.00 30 0.00 HALFBATH 0.00 23 0.00 3SSNPORCH 0.00 16 0.00 MISCVAL 0.00 21 0.00 OPENPORCHSF 0.00 19 0.00 GARAGEAREA 0.00 17 0.00 MOSOLD 0.00 13 0.00 BSMTUNFSF 0.00 12 0.00 BSMTFINSF2 0.00 10 0.00 YRSOLD 0.00 9 0.00 2NDFLRSF 0.00 6 0.00 TOTRMSABVGRD 0.00 5 0.00 OVERALLCOND 0.00 32 0.00 KITCHENABVGR 0.00 |

1 2 3 4 |
training data R-square 0.777169556607 test data R-square 0.81016173881 |

## Wesleyan’s Regression Modeling in Practice – Week 2

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn from sklearn import preprocessing # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x) #call in data set data = pandas.read_csv('homes_train.csv') print (data['SalePrice'].describe()) |

1 2 3 4 5 6 7 8 9 |
count 1460.00 mean 180921.20 std 79442.50 min 34900.00 25% 129975.00 50% 163000.00 75% 214000.00 max 755000.00 Name: SalePrice, dtype: float64 |

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.

So we can center the variables as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
data['GrLivArea'] = preprocessing.scale(data['GrLivArea'], with_mean='True', with_std='False') data['SalePrice'] = preprocessing.scale(data['SalePrice'], with_mean='True', with_std='False') print(data['GrLivArea'].mean()) print(data['SalePrice'].mean()) # convert variables to numeric format using convert_objects function data['GrLivArea'] = pandas.to_numeric(data['GrLivArea'], errors='coerce') data['SalePrice'] = pandas.to_numeric(data['SalePrice'], errors='coerce') # view the centering data['SalePrice'].diff().hist() # BASIC LINEAR REGRESSION scat1 = seaborn.regplot(x="SalePrice", y="GrLivArea", scatter=True, data=data) plt.xlabel('Sale Price') plt.ylabel('Ground Living Area') plt.title ('Scatterplot for the Association Between Sale Price and Ground Living Area') print(scat1) |

1 2 3 |
print ("OLS regression model for the association between sale price and ground living area") reg1 = smf.ols('SalePrice ~ GrLivArea', data=data).fit() print (reg1.summary()) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
OLS regression model for the association between sale price and ground living area OLS Regression Results ======================================================================== Dep. Variable: SalePrice R-squared: 0.502 Model: OLS Adj. R-squared: 0.502 Method: Least Squares F-statistic: 1471. Date: Mon, 03 Oct 2016 Prob (F-statistic): 4.52e-223 Time: 00:13:00 Log-Likelihood: -18035. No. Observations: 1460 AIC: 3.607e+04 Df Residuals: 1458 BIC: 3.608e+04 Df Model: 1 Covariance Type: nonrobust ======================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------ Intercept 1.857e+04 4480.755 4.144 0.000 9779.612 2.74e+04 GrLivArea 107.1304 2.794 38.348 0.000 101.650 112.610 ======================================================================== Omnibus: 261.166 Durbin-Watson: 2.025 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3432.287 Skew: 0.410 Prob(JB): 0.00 Kurtosis: 10.467 Cond. No. 4.90e+03 ======================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.9e+03. This might indicate that there are strong multicollinearity or other numerical problems. |

Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:

1 2 3 4 5 6 7 8 9 10 11 |
house = read.csv('train.csv') house_model = lm(house$SalePrice ~ house$GrLivArea, house) summary(house_model) plot(house$GrLivArea, house$SalePrice) hist(house$SalePrice) shapiro.test(house$SalePrice) ## Plot using a qqplot qqnorm(house$SalePrice) qqline(house$SalePrice, col = 2) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Call: lm(formula = house$SalePrice ~ house$GrLivArea, data = house) Residuals: Min 1Q Median 3Q Max -462999 -29800 -1124 21957 339832 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18569.026 4480.755 4.144 3.61e-05 *** house$GrLivArea 107.130 2.794 38.348 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 56070 on 1458 degrees of freedom Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018 F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16 |

To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

## Wesleyan’s Machine Learning for Data Analysis Week 2

Week 2’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford Connecticut Area in conjunction with Coursera was to build a random forest to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I continued using Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width.

Using Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier |

Now load our Iris dataset of 150 rows of 5 variables:

1 2 3 4 5 |
#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris() |

Now we begin our modelling and prediction. We define our predictors and target as follows:

1 2 3 |
predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name |

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

1 2 3 4 5 6 |
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape |

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the random forest classifier class to do this.

1 2 |
classifier = RandomForestClassifier(n_estimators=25) classifier = classifier.fit(pred_train,tar_train) |

Finally we make our predictions on our test data set and verify the accuracy.

1 2 3 4 |
predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) |

1 |
Out[1]: 0.94999999999999996 |

Next we figure out the relative importance of each of the attributes:

# fit an Extra Trees model to the data

1 2 3 |
model = ExtraTreesClassifier() model.fit(pred_train,tar_train) print(model.feature_importances_) |

1 |
[ 0.09603246 0.06664688 0.40937484 0.42794582] |

Finally displaying the performance of the random forest was achieved with the following:

1 2 3 4 5 6 7 8 9 10 11 |
trees=range(25) accuracy=np.zeros(25) for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla() plt.plot(trees, accuracy) |

And the plot success was output:

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary or categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating the type of Iris based on petal width, petal length, sepal width, sepal length.

The explanatory variables with the highest relative importance scores were petal width (42.8%), petal length (40.9%), sepal length (9.6%), and finally sepal width (6.7%). The accuracy of the random forest was 95%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.

## Wesleyan’s Regression Modeling in Practice – Week 1

For Wesleyan’s Regression Modeling in Practice week 1 assignment I am required to write up the sample, the procedure, and the measures section of a classical research paper. I’ve been trying to decide recently whether to move house or not, stay in the current house, sell the current house, move to another house, stay in the same area, move areas. So many decisions, so much choice so I want to do some regression modeling to help me with this decision. From kaggle.com I found an interesting problem and decided to write this up as my research data set for this assignment – House Prices: Advanced Regression Techniques.

### Sample

The sample is taken from the Ames Assessor’s Office computing assessed value for individual residential properties sold in Ames, Iowa from 2006 to 2010. Participants (N=2930) represented individual residential property sales in the Ames area.

The data analytic sample for this study included participants who had sold an individual residential property. Also if a home was sold multiple times in the 5 year period only the most recent property sale was included. (N=1,320).

### Procedure

Data were collected by trained Ames Assessor’s Office Representatives during 2006–2010 through computer-assisted personal interviews (CAPI). At the selling time of the house one party involved in the sale of the property would be contacted and the required variables were submitted by way of questions in an interview in respondents’ homes following informed consent procedures.

### Measures

The house sale price was assessed using 79 variables based on the type of dwelling involved in the sale (16 different types of dwellings were found). The zoning of the house with its 8 types of zones. 20 continuous variables relate to various area dimensions for each observation. In addition to the typical lot size and total dwelling square footage found on most common home listings, other more specific variables are quantified in the data set. Area measurements on the basement, main living area, and even porches are broken down into individual categories based on quality and type. 14 discrete variables typically quantify the number of items occurring within the house. Most are specifically focused on the number of kitchens, bedrooms, and bathrooms (full and half) located in the basement and above grade (ground) living areas of the home. Additionally, the garage capacity and construction/remodeling dates are also recorded. There are a large number of categorical variables (23 nominal, 23 ordinal) associated with this data set. They range from 2 to 28 classes with the smallest being STREET (gravel or paved) and

the largest being NEIGHBORHOOD (areas within the Ames city limits). The nominal variables typically identify various types of dwellings, garages, materials, and environmental conditions while the ordinal variables typically rate various items within the property.

**Dependant Variable: **Sale Price – the price the house sold for.

**Independant Variable: **

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 |
MSSubClass: Identifies the type of dwelling involved in the sale. 20 1-STORY 1946 & NEWER ALL STYLES 30 1-STORY 1945 & OLDER 40 1-STORY W/FINISHED ATTIC ALL AGES 45 1-1/2 STORY - UNFINISHED ALL AGES 50 1-1/2 STORY FINISHED ALL AGES 60 2-STORY 1946 & NEWER 70 2-STORY 1945 & OLDER 75 2-1/2 STORY ALL AGES 80 SPLIT OR MULTI-LEVEL 85 SPLIT FOYER 90 DUPLEX - ALL STYLES AND AGES 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 150 1-1/2 STORY PUD - ALL AGES 160 2-STORY PUD - 1946 & NEWER 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 190 2 FAMILY CONVERSION - ALL STYLES AND AGES MSZoning: Identifies the general zoning classification of the sale. A Agriculture C Commercial FV Floating Village Residential I Industrial RH Residential High Density RL Residential Low Density RP Residential Low Density Park RM Residential Medium Density LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access to property Grvl Gravel Pave Paved Alley: Type of alley access to property Grvl Gravel Pave Paved NA No alley access LotShape: General shape of property Reg Regular IR1 Slightly irregular IR2 Moderately Irregular IR3 Irregular LandContour: Flatness of the property Lvl Near Flat/Level Bnk Banked - Quick and significant rise from street grade to building HLS Hillside - Significant slope from side to side Low Depression Utilities: Type of utilities available AllPub All public Utilities (E,G,W,& S) NoSewr Electricity, Gas, and Water (Septic Tank) NoSeWa Electricity and Gas Only ELO Electricity only LotConfig: Lot configuration Inside Inside lot Corner Corner lot CulDSac Cul-de-sac FR2 Frontage on 2 sides of property FR3 Frontage on 3 sides of property LandSlope: Slope of property Gtl Gentle slope Mod Moderate Slope Sev Severe Slope Neighborhood: Physical locations within Ames city limits Blmngtn Bloomington Heights Blueste Bluestem BrDale Briardale BrkSide Brookside ClearCr Clear Creek CollgCr College Creek Crawfor Crawford Edwards Edwards Gilbert Gilbert IDOTRR Iowa DOT and Rail Road MeadowV Meadow Village Mitchel Mitchell Names North Ames NoRidge Northridge NPkVill Northpark Villa NridgHt Northridge Heights NWAmes Northwest Ames OldTown Old Town SWISU South & West of Iowa State University Sawyer Sawyer SawyerW Sawyer West Somerst Somerset StoneBr Stone Brook Timber Timberland Veenker Veenker Condition1: Proximity to various conditions Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal RRNn Within 200' of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature--park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200' of East-West Railroad RRAe Adjacent to East-West Railroad Condition2: Proximity to various conditions (if more than one is present) Artery Adjacent to arterial street Feedr Adjacent to feeder street Norm Normal RRNn Within 200' of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature--park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200' of East-West Railroad RRAe Adjacent to East-West Railroad BldgType: Type of dwelling 1Fam Single-family Detached 2FmCon Two-family Conversion; originally built as one-family dwelling Duplx Duplex TwnhsE Townhouse End Unit TwnhsI Townhouse Inside Unit HouseStyle: Style of dwelling 1Story One story 1.5Fin One and one-half story: 2nd level finished 1.5Unf One and one-half story: 2nd level unfinished 2Story Two story 2.5Fin Two and one-half story: 2nd level finished 2.5Unf Two and one-half story: 2nd level unfinished SFoyer Split Foyer SLvl Split Level OverallQual: Rates the overall material and finish of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor OverallCond: Rates the overall condition of the house 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor YearBuilt: Original construction date YearRemodAdd: Remodel date (same as construction date if no remodeling or additions) RoofStyle: Type of roof Flat Flat Gable Gable Gambrel Gabrel (Barn) Hip Hip Mansard Mansard Shed Shed RoofMatl: Roof material ClyTile Clay or Tile CompShg Standard (Composite) Shingle Membran Membrane Metal Metal Roll Roll Tar&Grv Gravel & Tar WdShake Wood Shakes WdShngl Wood Shingles Exterior1st: Exterior covering on house AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles Exterior2nd: Exterior covering on house (if more than one material) AsbShng Asbestos Shingles AsphShn Asphalt Shingles BrkComm Brick Common BrkFace Brick Face CBlock Cinder Block CemntBd Cement Board HdBoard Hard Board ImStucc Imitation Stucco MetalSd Metal Siding Other Other Plywood Plywood PreCast PreCast Stone Stone Stucco Stucco VinylSd Vinyl Siding Wd Sdng Wood Siding WdShing Wood Shingles MasVnrType: Masonry veneer type BrkCmn Brick Common BrkFace Brick Face CBlock Cinder Block None None Stone Stone MasVnrArea: Masonry veneer area in square feet ExterQual: Evaluates the quality of the material on the exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor ExterCond: Evaluates the present condition of the material on the exterior Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor Foundation: Type of foundation BrkTil Brick & Tile CBlock Cinder Block PConc Poured Contrete Slab Slab Stone Stone Wood Wood BsmtQual: Evaluates the height of the basement Ex Excellent (100+ inches) Gd Good (90-99 inches) TA Typical (80-89 inches) Fa Fair (70-79 inches) Po Poor (<70 inches NA No Basement BsmtCond: Evaluates the general condition of the basement Ex Excellent Gd Good TA Typical - slight dampness allowed Fa Fair - dampness or some cracking or settling Po Poor - Severe cracking, settling, or wetness NA No Basement BsmtExposure: Refers to walkout or garden level walls Gd Good Exposure Av Average Exposure (split levels or foyers typically score average or above) Mn Mimimum Exposure No No Exposure NA No Basement BsmtFinType1: Rating of basement finished area GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Rating of basement finished area (if multiple types) GLQ Good Living Quarters ALQ Average Living Quarters BLQ Below Average Living Quarters Rec Average Rec Room LwQ Low Quality Unf Unfinshed NA No Basement BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating Floor Floor Furnace GasA Gas forced warm air furnace GasW Gas hot water or steam heat Grav Gravity furnace OthW Hot water or steam heat other than gas Wall Wall furnace HeatingQC: Heating quality and condition Ex Excellent Gd Good TA Average/Typical Fa Fair Po Poor CentralAir: Central air conditioning N No Y Yes Electrical: Electrical system SBrkr Standard Circuit Breakers & Romex FuseA Fuse Box over 60 AMP and all Romex wiring (Average) FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair) FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor) Mix Mixed 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Bedrooms above grade (does NOT include basement bedrooms) Kitchen: Kitchens above grade KitchenQual: Kitchen quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality (Assume typical unless deductions are warranted) Typ Typical Functionality Min1 Minor Deductions 1 Min2 Minor Deductions 2 Mod Moderate Deductions Maj1 Major Deductions 1 Maj2 Major Deductions 2 Sev Severely Damaged Sal Salvage only Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality Ex Excellent - Exceptional Masonry Fireplace Gd Good - Masonry Fireplace in main level TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement Fa Fair - Prefabricated Fireplace in basement Po Poor - Ben Franklin Stove NA No Fireplace GarageType: Garage location 2Types More than one type of garage Attchd Attached to home Basment Basement Garage BuiltIn Built-In (Garage part of house - typically has room above garage) CarPort Car Port Detchd Detached from home NA No Garage GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage Fin Finished RFn Rough Finished Unf Unfinished NA No Garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage GarageCond: Garage condition Ex Excellent Gd Good TA Typical/Average Fa Fair Po Poor NA No Garage PavedDrive: Paved driveway Y Paved P Partial Pavement N Dirt/Gravel WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Ex Excellent Gd Good TA Average/Typical Fa Fair NA No Pool Fence: Fence quality GdPrv Good Privacy MnPrv Minimum Privacy GdWo Good Wood MnWw Minimum Wood/Wire NA No Fence MiscFeature: Miscellaneous feature not covered in other categories Elev Elevator Gar2 2nd Garage (if not described in garage section) Othr Other Shed Shed (over 100 SF) TenC Tennis Court NA None MiscVal: $Value of miscellaneous feature MoSold: Month Sold (MM) YrSold: Year Sold (YYYY) SaleType: Type of sale WD Warranty Deed - Conventional CWD Warranty Deed - Cash VWD Warranty Deed - VA Loan New Home just constructed and sold COD Court Officer Deed/Estate Con Contract 15% Down payment regular terms ConLw Contract Low Down payment and low interest ConLI Contract Low Interest ConLD Contract Low Down Oth Other SaleCondition: Condition of sale Normal Normal Sale Abnorml Abnormal Sale - trade, foreclosure, short sale AdjLand Adjoining Land Purchase Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit Family Sale between family members Partial Home was not completed when last assessed (associated with New Homes) |

### References

Kaggle’s House Prices: Advanced Regression Techniques

Ames Assessor’s Original Publication

Data Documentation

## Wesleyan’s Machine Learning for Data Analysis Week 1

Week 1’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford, Connecticut in conjunction with Coursera was to build a decision tree to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I decided to choose Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width. I also decided to do the assignment in Python as I have been programming in it for over 10 years.

Pandas, sklearn, numpy, and spyder were also used, with Anaconda being instrumental in setting everything up.

1 2 3 4 5 6 7 8 9 10 11 |
conda update condo conda update anaconda conda install seaborn conda update qt pyqt conda install spyder pip install Graphviz pip install pydotplus brew install graphviz |

Started up Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

1 2 3 4 5 6 7 8 9 10 |
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics |

Now load our Iris dataset of 150 rows of 5 variables:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris() # there should be no na - for performance probably don't have to do this iris = iris.dropna() iris.dtypes iris.describe() print("head", iris.head(), sep="\n", end="\n\n") print("tail", iris.tail(), sep="\n", end="\n\n") print("types", iris["Name"].unique(), sep="\n") |

Leading to the output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
head SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa tail SepalLength SepalWidth PetalLength PetalWidth Name 145 6.7 3.0 5.2 2.3 Iris-virginica 146 6.3 2.5 5.0 1.9 Iris-virginica 147 6.5 3.0 5.2 2.0 Iris-virginica 148 6.2 3.4 5.4 2.3 Iris-virginica 149 5.9 3.0 5.1 1.8 Iris-virginica types ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] |

Now we begin our modelling and prediction. We define our predictors and target as follows:

1 2 3 |
predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name |

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

1 2 3 4 5 6 |
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape |

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the decision tree classifier class to do this.

1 2 |
classifier = DecisionTreeClassifier() classifier = classifier.fit(pred_train, tar_train) |

Finally we make our predictions on our test data set and verify the accuracy.

1 2 3 4 |
predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) |

1 |
Out[1]: 0.96666666666666667 |

I’ve run the above code, separating the training and test datasets, builiding the model, making the predictions, and finally testing the accuracy another 14 times in a loop and got accuracy predictions ranging from 84.3% to 100%, so a generated model might have the potential to be overfitted. However the mean of these values is 0.942 with a standard deviation of 0.04 so the values are not deviating much from the mean.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Out[2]: 0.94999999999999996 Out[3]: 1.0 Out[4]: 0.96666666666666667 Out[5]: 0.94999999999999996 Out[6]: 0.8433333333333333 Out[7]: 0.93333333333333335 Out[8]: 0.90000000000000002 Out[9]: 0.94999999999999996 Out[10]: 0.96666666666666667 Out[11]: 0.91666666666666663 Out[12]: 0.98333333333333328 Out[13]: 0.8833333333333333 Out[14]: 0.94999999999999996 Out[15]: 0.96666666666666667 |

Finally displaying the tree was achieved with the following:

1 2 3 4 5 6 7 8 9 10 11 |
#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png()) |

And the tree was output:

The petal length (X[2]) was the first variable to separate the sample into two subgroups. Iris’ with petal length of less than or equal to 2.45 were a group of their own – the setosa with all 32 in the sample identified as this group. The next variable to separate was the petal width (X[3]) on values of less than or equal to 1.75. This is separating between the versicolor and virginica categories very well – only 3 of the remaining 58 not being categorised correctly (2 of the virginica, and 1 of the versicolor). The next decision is back on petal length again (X[2]) <= 5.45 on the left hand branch resolving virginica in the end on two more decisions, the majority with petal length less than or equal to 4.95 and the remaining 2 with petal width > 1.55. Meanwhile in the right branch all but one of the versicolor is categorised based on the petal length > 4.85. The last decision to decide between 1 versicolor and 1 virginica is decided based on variable V[0], the sepal length <= 6.05 being the virginica, and the last versicolor having a sepal length > 6.05.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.