Week 1’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford, Connecticut in conjunction with Coursera was to build a decision tree to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I decided to choose Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width. I also decided to do the assignment in Python as I have been programming in it for over 10 years.

Pandas, sklearn, numpy, and spyder were also used, with Anaconda being instrumental in setting everything up.

1 2 3 4 5 6 7 8 9 10 11 |
conda update condo conda update anaconda conda install seaborn conda update qt pyqt conda install spyder pip install Graphviz pip install pydotplus brew install graphviz |

Started up Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:

1 2 3 4 5 6 7 8 9 10 |
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics |

Now load our Iris dataset of 150 rows of 5 variables:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris() # there should be no na - for performance probably don't have to do this iris = iris.dropna() iris.dtypes iris.describe() print("head", iris.head(), sep="\n", end="\n\n") print("tail", iris.tail(), sep="\n", end="\n\n") print("types", iris["Name"].unique(), sep="\n") |

Leading to the output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
head SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa tail SepalLength SepalWidth PetalLength PetalWidth Name 145 6.7 3.0 5.2 2.3 Iris-virginica 146 6.3 2.5 5.0 1.9 Iris-virginica 147 6.5 3.0 5.2 2.0 Iris-virginica 148 6.2 3.4 5.4 2.3 Iris-virginica 149 5.9 3.0 5.1 1.8 Iris-virginica types ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] |

Now we begin our modelling and prediction. We define our predictors and target as follows:

1 2 3 |
predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name |

Next we split our data into our training and test datasets with a 60%, 40% split respectively:

1 2 3 4 5 6 |
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape |

Training data set of length 90, and test data set of length 60.

Now it is time to build our classification model and we use the decision tree classifier class to do this.

1 2 |
classifier = DecisionTreeClassifier() classifier = classifier.fit(pred_train, tar_train) |

Finally we make our predictions on our test data set and verify the accuracy.

1 2 3 4 |
predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) |

1 |
Out[1]: 0.96666666666666667 |

I’ve run the above code, separating the training and test datasets, builiding the model, making the predictions, and finally testing the accuracy another 14 times in a loop and got accuracy predictions ranging from 84.3% to 100%, so a generated model might have the potential to be overfitted. However the mean of these values is 0.942 with a standard deviation of 0.04 so the values are not deviating much from the mean.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Out[2]: 0.94999999999999996 Out[3]: 1.0 Out[4]: 0.96666666666666667 Out[5]: 0.94999999999999996 Out[6]: 0.8433333333333333 Out[7]: 0.93333333333333335 Out[8]: 0.90000000000000002 Out[9]: 0.94999999999999996 Out[10]: 0.96666666666666667 Out[11]: 0.91666666666666663 Out[12]: 0.98333333333333328 Out[13]: 0.8833333333333333 Out[14]: 0.94999999999999996 Out[15]: 0.96666666666666667 |

Finally displaying the tree was achieved with the following:

1 2 3 4 5 6 7 8 9 10 11 |
#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png()) |

And the tree was output:

The petal length (X[2]) was the first variable to separate the sample into two subgroups. Iris’ with petal length of less than or equal to 2.45 were a group of their own – the setosa with all 32 in the sample identified as this group. The next variable to separate was the petal width (X[3]) on values of less than or equal to 1.75. This is separating between the versicolor and virginica categories very well – only 3 of the remaining 58 not being categorised correctly (2 of the virginica, and 1 of the versicolor). The next decision is back on petal length again (X[2]) <= 5.45 on the left hand branch resolving virginica in the end on two more decisions, the majority with petal length less than or equal to 4.95 and the remaining 2 with petal width > 1.55. Meanwhile in the right branch all but one of the versicolor is categorised based on the petal length > 4.85. The last decision to decide between 1 versicolor and 1 virginica is decided based on variable V[0], the sepal length <= 6.05 being the virginica, and the last versicolor having a sepal length > 6.05.

So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.