Today I passed Wesleyan University’s Machine Learning for Data Analysis Course on Coursera. This course was a great Python & SAS course and part 4 of their Data Analysis and Interpretation Specialisation. So only the Capstone project left for me to do. The lecturers Lisa Dierker and Jen Rose know their stuff and the practicals each week are fun to do. This month’s Programming for Big Data course in DBS will contain some of the practicals and research I did for this course.

# Category: Machine Learning for Data Analysis

## Cluster Analysis of the Iris Dataset

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
iris_data = pd.read_csv("iris_data.csv", header=None) iris_data iris_data_clean = iris_data.dropna() iris_data_clean iris_data_clean.describe() iris_cluster = iris_data_clean[['sepal_length','sepal_width','petal_length','petal_width']].copy() iris_cluster['sepal_length']=preprocessing.scale(iris_cluster['sepal_length'].astype('float64')) iris_cluster['sepal_width']=preprocessing.scale(iris_cluster['sepal_width'].astype('float64')) iris_cluster['petal_length']=preprocessing.scale(iris_cluster['petal_length'].astype('float64')) iris_cluster['petal_width']=preprocessing.scale(iris_cluster['petal_width'].astype('float64')) iris_train, iris_test = train_test_split(iris_cluster, test_size=.3, random_state=123) from scipy.spatial.distance import cdist clusters=range(1,6) meandist=[] clusters for k in clusters: model=KMeans(n_clusters=k) model.fit(iris_train) iris_assign=model.predict(iris_train) meandist.append(sum(np.min(cdist(iris_train, model.cluster_centers_, 'euclidean'), axis=1)) / iris_train.shape[0]) plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') # Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(iris_train) iris_assign=model3.predict(iris_train) from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(iris_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show() |

## Running a LASSO Regression Analysis

A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step:

Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV import os data = pd.read_csv("iowa_house_data.csv") #upper-case all DataFrame column names data.columns = map(str.upper, data.columns) print(data.columns) data_clean = data #select predictor variables and target variable as separate data sets predvar= data_clean[['GRLIVAREA', 'LOTAREA', 'YEARBUILT', 'FIREPLACES', 'OVERALLQUAL', 'OVERALLCOND', 'TOTRMSABVGRD', 'YEARREMODADD', '1STFLRSF', '2NDFLRSF', 'YRSOLD', 'BSMTFINSF1', 'BSMTFINSF2', 'BSMTUNFSF', 'TOTALBSMTSF', 'MSSUBCLASS', 'MISCVAL', 'MOSOLD', 'GARAGECARS', 'GARAGEAREA', 'WOODDECKSF', 'OPENPORCHSF', 'ENCLOSEDPORCH', '3SSNPORCH', 'SCREENPORCH', 'POOLAREA', 'LOWQUALFINSF', 'BSMTFULLBATH', 'BSMTHALFBATH', 'FULLBATH', 'HALFBATH', 'BEDROOMABVGR', 'KITCHENABVGR']] target = data_clean.SALEPRICE # standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing print predvar for k in predvar.columns: print k predictors[k]=preprocessing.scale(predictors[k].astype('float64')) # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123) # specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train) # print variable names and regression coefficients var_imp = pd.DataFrame(data = {'predictors':list(predictors.columns.values),'coefficients':model.coef_}) var_imp['sort'] = var_imp.coefficients.abs() print(var_imp.sort_values(by='sort', ascending=False)) # plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold') |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
coefficients predictors sort 4 0.36 OVERALLQUAL 0.36 0 0.26 GRLIVAREA 0.26 18 0.12 GARAGECARS 0.12 11 0.07 BSMTFINSF1 0.07 2 0.07 YEARBUILT 0.07 7 0.05 YEARREMODADD 0.05 8 0.05 1STFLRSF 0.05 15 -0.04 MSSUBCLASS 0.04 3 0.04 FIREPLACES 0.04 14 0.04 TOTALBSMTSF 0.04 20 0.02 WOODDECKSF 0.02 27 0.01 BSMTFULLBATH 0.01 1 0.01 LOTAREA 0.01 24 0.00 SCREENPORCH 0.00 25 0.00 POOLAREA 0.00 26 0.00 LOWQUALFINSF 0.00 31 0.00 BEDROOMABVGR 0.00 22 0.00 ENCLOSEDPORCH 0.00 28 0.00 BSMTHALFBATH 0.00 29 0.00 FULLBATH 0.00 30 0.00 HALFBATH 0.00 23 0.00 3SSNPORCH 0.00 16 0.00 MISCVAL 0.00 21 0.00 OPENPORCHSF 0.00 19 0.00 GARAGEAREA 0.00 17 0.00 MOSOLD 0.00 13 0.00 BSMTUNFSF 0.00 12 0.00 BSMTFINSF2 0.00 10 0.00 YRSOLD 0.00 9 0.00 2NDFLRSF 0.00 6 0.00 TOTRMSABVGRD 0.00 5 0.00 OVERALLCOND 0.00 32 0.00 KITCHENABVGR 0.00 |

1 2 3 4 |
training data R-square 0.777169556607 test data R-square 0.81016173881 |

## Wesleyan’s Machine Learning for Data Analysis Week 2

Week 2’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford Connecticut Area in conjunction with Coursera was to build a random forest to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I continued using Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width.

Using Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier |

Now load our Iris dataset of 150 rows of 5 variables:

1 2 3 4 5 |
#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris() |

Now we begin our modelling and prediction. We define our predictors and target as follows:

1 2 3 |
predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name |

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

1 2 3 4 5 6 |
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape |

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the random forest classifier class to do this.

1 2 |
classifier = RandomForestClassifier(n_estimators=25) classifier = classifier.fit(pred_train,tar_train) |

Finally we make our predictions on our test data set and verify the accuracy.

1 2 3 4 |
predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) |

1 |
Out[1]: 0.94999999999999996 |

Next we figure out the relative importance of each of the attributes:

# fit an Extra Trees model to the data

1 2 3 |
model = ExtraTreesClassifier() model.fit(pred_train,tar_train) print(model.feature_importances_) |

1 |
[ 0.09603246 0.06664688 0.40937484 0.42794582] |

Finally displaying the performance of the random forest was achieved with the following:

1 2 3 4 5 6 7 8 9 10 11 |
trees=range(25) accuracy=np.zeros(25) for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla() plt.plot(trees, accuracy) |

And the plot success was output:

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary or categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating the type of Iris based on petal width, petal length, sepal width, sepal length.

The explanatory variables with the highest relative importance scores were petal width (42.8%), petal length (40.9%), sepal length (9.6%), and finally sepal width (6.7%). The accuracy of the random forest was 95%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.

## SFrame and Free GraphLab Create

Why SFrame & GraphLab Create

There are many excellent machine learning libraries in Python. One of the most popular one today is scikit-learn. Similarly, there are many tools for data manipulations in Python; a popular example is Pandas. However, most of these tools do not scale to large datasets.

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free. It can be installed with:

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free.

1 |
pip install -U sframe |

GraphLab Create is free on a 1-year, renewable license for educational purposes, including Coursera. This software, however, has a paid license for commercial purposes. You can get the GraphLab Create academic license at the following link:

https://dato.com/learn/coursera/

I was able to signup with my dbs lecturer email address and get a valid license key and then download the product and install. It will work in conjunction with Anaconda and Jupyter Notebooks.

GraphLab Create is very actively used in industry by a large number of companies. This package was created by a machine learning company called Dato. This company is spin off from a popular research project called GraphLab, which Carlos Guestrin and his research group started at Carnegie Mellon University. In addition to being a professor at the University of Washington, Carlos is the CEO of Dato.

## Wesleyan’s Machine Learning for Data Analysis Week 1

Week 1’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford, Connecticut in conjunction with Coursera was to build a decision tree to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I decided to choose Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width. I also decided to do the assignment in Python as I have been programming in it for over 10 years.

Pandas, sklearn, numpy, and spyder were also used, with Anaconda being instrumental in setting everything up.

1 2 3 4 5 6 7 8 9 10 11 |
conda update condo conda update anaconda conda install seaborn conda update qt pyqt conda install spyder pip install Graphviz pip install pydotplus brew install graphviz |

Started up Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

1 2 3 4 5 6 7 8 9 10 |
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics |

Now load our Iris dataset of 150 rows of 5 variables:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris() # there should be no na - for performance probably don't have to do this iris = iris.dropna() iris.dtypes iris.describe() print("head", iris.head(), sep="\n", end="\n\n") print("tail", iris.tail(), sep="\n", end="\n\n") print("types", iris["Name"].unique(), sep="\n") |

Leading to the output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
head SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa tail SepalLength SepalWidth PetalLength PetalWidth Name 145 6.7 3.0 5.2 2.3 Iris-virginica 146 6.3 2.5 5.0 1.9 Iris-virginica 147 6.5 3.0 5.2 2.0 Iris-virginica 148 6.2 3.4 5.4 2.3 Iris-virginica 149 5.9 3.0 5.1 1.8 Iris-virginica types ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] |

Now we begin our modelling and prediction. We define our predictors and target as follows:

1 2 3 |
predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name |

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

1 2 3 4 5 6 |
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape |

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the decision tree classifier class to do this.

1 2 |
classifier = DecisionTreeClassifier() classifier = classifier.fit(pred_train, tar_train) |

Finally we make our predictions on our test data set and verify the accuracy.

1 2 3 4 |
predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) |

1 |
Out[1]: 0.96666666666666667 |

I’ve run the above code, separating the training and test datasets, builiding the model, making the predictions, and finally testing the accuracy another 14 times in a loop and got accuracy predictions ranging from 84.3% to 100%, so a generated model might have the potential to be overfitted. However the mean of these values is 0.942 with a standard deviation of 0.04 so the values are not deviating much from the mean.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Out[2]: 0.94999999999999996 Out[3]: 1.0 Out[4]: 0.96666666666666667 Out[5]: 0.94999999999999996 Out[6]: 0.8433333333333333 Out[7]: 0.93333333333333335 Out[8]: 0.90000000000000002 Out[9]: 0.94999999999999996 Out[10]: 0.96666666666666667 Out[11]: 0.91666666666666663 Out[12]: 0.98333333333333328 Out[13]: 0.8833333333333333 Out[14]: 0.94999999999999996 Out[15]: 0.96666666666666667 |

Finally displaying the tree was achieved with the following:

1 2 3 4 5 6 7 8 9 10 11 |
#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png()) |

And the tree was output:

The petal length (X[2]) was the first variable to separate the sample into two subgroups. Iris’ with petal length of less than or equal to 2.45 were a group of their own – the setosa with all 32 in the sample identified as this group. The next variable to separate was the petal width (X[3]) on values of less than or equal to 1.75. This is separating between the versicolor and virginica categories very well – only 3 of the remaining 58 not being categorised correctly (2 of the virginica, and 1 of the versicolor). The next decision is back on petal length again (X[2]) <= 5.45 on the left hand branch resolving virginica in the end on two more decisions, the majority with petal length less than or equal to 4.95 and the remaining 2 with petal width > 1.55. Meanwhile in the right branch all but one of the versicolor is categorised based on the petal length > 4.85. The last decision to decide between 1 versicolor and 1 virginica is decided based on variable V[0], the sepal length <= 6.05 being the virginica, and the last versicolor having a sepal length > 6.05.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.