Cluster Analysis of the Iris Dataset

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

iris_clusters_five

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

scatterplot_for_3_clusters

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

Wesleyan’s Regression Modeling in Practice – Week 1

For Wesleyan’s Regression Modeling in Practice week 1 assignment I am required to write up the sample, the procedure, and the measures section of a classical research paper. I’ve been trying to decide recently whether to move house or not, stay in the current house, sell the current house, move to another house, stay in the same area, move areas. So many decisions, so much choice so I want to do some regression modeling to help me with this decision. From kaggle.com I found an interesting problem and decided to write this up as my research data set for this assignment – House Prices: Advanced Regression Techniques.

Sample

The sample is taken from the Ames Assessor’s Office computing assessed value for individual residential properties sold in Ames, Iowa from 2006 to 2010. Participants (N=2930) represented individual residential property sales in the Ames area.
The data analytic sample for this study included participants who had sold an individual residential property. Also if a home was sold multiple times in the 5 year period only the most recent property sale was included. (N=1,320).

Procedure

Data were collected by trained Ames Assessor’s Office Representatives during 2006–2010 through computer-assisted personal interviews (CAPI). At the selling time of the house one party involved in the sale of the property would be contacted and the required variables were submitted by way of questions in an interview in respondents’ homes following informed consent procedures.

Measures

The house sale price was assessed using 79 variables based on the type of dwelling involved in the sale (16 different types of dwellings were found). The zoning of the house with its 8 types of zones. 20 continuous variables relate to various area dimensions for each observation. In addition to the typical lot size and total dwelling square footage found on most common home listings, other more specific variables are quantified in the data set. Area measurements on the basement, main living area, and even porches are broken down into individual categories based on quality and type. 14 discrete variables typically quantify the number of items occurring within the house. Most are specifically focused on the number of kitchens, bedrooms, and bathrooms (full and half) located in the basement and above grade (ground) living areas of the home. Additionally, the garage capacity and construction/remodeling dates are also recorded. There are a large number of categorical variables (23 nominal, 23 ordinal) associated with this data set. They range from 2 to 28 classes with the smallest being STREET (gravel or paved) and
the largest being NEIGHBORHOOD (areas within the Ames city limits). The nominal variables typically identify various types of dwellings, garages, materials, and environmental conditions while the ordinal variables typically rate various items within the property.
Dependant Variable: Sale Price – the price the house sold for.
Independant Variable:

References

Kaggle’s House Prices: Advanced Regression Techniques
Ames Assessor’s Original Publication
Data Documentation