A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
iris_data = pd.read_csv("iris_data.csv", header=None) iris_data iris_data_clean = iris_data.dropna() iris_data_clean iris_data_clean.describe() iris_cluster = iris_data_clean[['sepal_length','sepal_width','petal_length','petal_width']].copy() iris_cluster['sepal_length']=preprocessing.scale(iris_cluster['sepal_length'].astype('float64')) iris_cluster['sepal_width']=preprocessing.scale(iris_cluster['sepal_width'].astype('float64')) iris_cluster['petal_length']=preprocessing.scale(iris_cluster['petal_length'].astype('float64')) iris_cluster['petal_width']=preprocessing.scale(iris_cluster['petal_width'].astype('float64')) iris_train, iris_test = train_test_split(iris_cluster, test_size=.3, random_state=123) from scipy.spatial.distance import cdist clusters=range(1,6) meandist=[] clusters for k in clusters: model=KMeans(n_clusters=k) model.fit(iris_train) iris_assign=model.predict(iris_train) meandist.append(sum(np.min(cdist(iris_train, model.cluster_centers_, 'euclidean'), axis=1)) / iris_train.shape[0]) plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') # Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(iris_train) iris_assign=model3.predict(iris_train) from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(iris_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show() |