Wesleyan’s Machine Learning for Data Analysis Week 1


Week 1’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford, Connecticut in conjunction with Coursera was to build a decision tree to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I decided to choose Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width. I also decided to do the assignment in Python as I have been programming in it for over 10 years.

Pandas, sklearn, numpy, and spyder were also used, with Anaconda being instrumental in setting everything up.

Started up Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

Now load our Iris dataset of 150 rows of 5 variables:

Leading to the output:

Now we begin our modelling and prediction. We define our predictors and target as follows:

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the decision tree classifier class to do this.

Finally we make our predictions on our test data set and verify the accuracy.

I’ve run the above code, separating the training and test datasets, builiding the model, making the predictions, and finally testing the accuracy another 14 times in a loop and got accuracy predictions ranging from 84.3% to 100%, so a generated model might have the potential to be overfitted. However the mean of these values is 0.942 with a standard deviation of 0.04 so the values are not deviating much from the mean.

Finally displaying the tree was achieved with the following:

And the tree was output:


The petal length (X[2]) was the first variable to separate the sample into two subgroups. Iris’ with petal length of less than or equal to 2.45 were a group of their own – the setosa with all 32 in the sample identified as this group. The next variable to separate was the petal width (X[3]) on values of less than or equal to 1.75. This is separating between the versicolor and virginica categories very well – only 3 of the remaining 58 not being categorised correctly (2 of the virginica, and 1 of the versicolor). The next decision is back on petal length again (X[2]) <= 5.45 on the left hand branch resolving virginica in the end on two more decisions, the majority with petal length less than or equal to 4.95 and the remaining 2 with petal width > 1.55. Meanwhile in the right branch all but one of the versicolor is categorised based on the petal length > 4.85. The last decision to decide between 1 versicolor and 1 virginica is decided based on variable V[0], the sepal length <= 6.05 being the virginica, and the last versicolor having a sepal length > 6.05.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.