Machine Learning for Data Analysis Course Passed

Today I passed Wesleyan University’s Machine Learning for Data Analysis Course on Coursera. This course was a great Python & SAS course and part 4 of their Data Analysis and Interpretation Specialisation. So only the Capstone project left for me to do. The lecturers Lisa Dierker and Jen Rose know their stuff and the practicals each week are fun to do. This month’s Programming for Big Data course in DBS will contain some of the practicals and research I did for this course.

Cluster Analysis of the Iris Dataset

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

Running a LASSO Regression Analysis

A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step:

Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.

Wesleyan’s Machine Learning for Data Analysis Week 2

Week 2’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford Connecticut Area in conjunction with Coursera was to build a random forest to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I continued using Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width.

Using Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

Now load our Iris dataset of 150 rows of 5 variables:

Now we begin our modelling and prediction. We define our predictors and target as follows:

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the random forest classifier class to do this.

Finally we make our predictions on our test data set and verify the accuracy.

Next we figure out the relative importance of each of the attributes:
# fit an Extra Trees model to the data

Finally displaying the performance of the random forest was achieved with the following:

And the plot success was output:

Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary or categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating the type of Iris based on petal width, petal length, sepal width, sepal length.

The explanatory variables with the highest relative importance scores were petal width (42.8%), petal length (40.9%), sepal length (9.6%), and finally sepal width (6.7%). The accuracy of the random forest was 95%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.

SFrame and Free GraphLab Create

Why SFrame & GraphLab Create

There are many excellent machine learning libraries in Python. One of the most popular one today is scikit-learn. Similarly, there are many tools for data manipulations in Python; a popular example is Pandas. However, most of these tools do not scale to large datasets.

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free. It can be installed with:

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free.

GraphLab Create is free on a 1-year, renewable license for educational purposes, including Coursera. This software, however, has a paid license for commercial purposes. You can get the GraphLab Create academic license at the following link:

https://dato.com/learn/coursera/

I was able to signup with my dbs lecturer email address and get a valid license key and then download the product and install. It will work in conjunction with Anaconda and Jupyter Notebooks.

GraphLab Create is very actively used in industry by a large number of companies. This package was created by a machine learning company called Dato. This company is spin off from a popular research project called GraphLab, which Carlos Guestrin and his research group started at Carnegie Mellon University. In addition to being a professor at the University of Washington, Carlos is the CEO of Dato.

Wesleyan’s Machine Learning for Data Analysis Week 1

Week 1’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford, Connecticut in conjunction with Coursera was to build a decision tree to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I decided to choose Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width. I also decided to do the assignment in Python as I have been programming in it for over 10 years.

Pandas, sklearn, numpy, and spyder were also used, with Anaconda being instrumental in setting everything up.

Started up Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

Now load our Iris dataset of 150 rows of 5 variables:

Now we begin our modelling and prediction. We define our predictors and target as follows:

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the decision tree classifier class to do this.

Finally we make our predictions on our test data set and verify the accuracy.

I’ve run the above code, separating the training and test datasets, builiding the model, making the predictions, and finally testing the accuracy another 14 times in a loop and got accuracy predictions ranging from 84.3% to 100%, so a generated model might have the potential to be overfitted. However the mean of these values is 0.942 with a standard deviation of 0.04 so the values are not deviating much from the mean.

Finally displaying the tree was achieved with the following:

And the tree was output:

The petal length (X[2]) was the first variable to separate the sample into two subgroups. Iris’ with petal length of less than or equal to 2.45 were a group of their own – the setosa with all 32 in the sample identified as this group. The next variable to separate was the petal width (X[3]) on values of less than or equal to 1.75. This is separating between the versicolor and virginica categories very well – only 3 of the remaining 58 not being categorised correctly (2 of the virginica, and 1 of the versicolor). The next decision is back on petal length again (X[2]) <= 5.45 on the left hand branch resolving virginica in the end on two more decisions, the majority with petal length less than or equal to 4.95 and the remaining 2 with petal width > 1.55. Meanwhile in the right branch all but one of the versicolor is categorised based on the petal length > 4.85. The last decision to decide between 1 versicolor and 1 virginica is decided based on variable V[0], the sepal length <= 6.05 being the virginica, and the last versicolor having a sepal length > 6.05.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.