Trump for President? It’s Beginning to look like it!

I should be in bed, I know, I have a lot of work tomorrow but not being one to contradict the great Nate Silver and fivethirtyeight.com but has Trump just won this US presidential election?

It is 3:10am (10:10 EST) and as I look at the numbers does Trump have enough electoral votes? The TV has him at 150 – CBS News seems to have prematurely given Virgina’s 13 votes to Clinton – it looks 150 to 109 (although CBS have 122).

So Trump looks to have nailed Georgia (16), North Carolina (15), Michigan (16), Ohio (18). That is 65 – he now would only need 55 more. He is leading but very close in Winconsin (10), Arizona (11), and Florida (29). This would only leave him needing 5 votes and he is expected to win Alaska, Idaho, and Utah comfortably to give him 13 votes.

Does that have him winning by 18? Getting Trump to 278 and Hillary to 260.

At what point does he not even need Florida?

I am not even including the 80 electoral votes for Nevada, or the West Coast states of California, Oregon, Washington in these projections.

What is the p-value when Hypothesis Testing?

The p-value helps us determine the significance of our results. The hypothesis test is used to test the validity of a claim being made about the overall population. The claim that is being tested we determine to be the null hypothesis. If the null hypothesis was concluded to be false this is referred to as the alternative hypothesis.

Hypothesis tests use the p-value to weight the strength of the evidence (what the data tells us about the overall population).

* A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis – and accept the alternative hypothesis – this is a statistically significant outcome.

* A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis – accept the null hypothesis, and reject the alternative hypothesis – no statistical significance

* p-values very close to the cutoff (0.05) are considered to be marginal (could go either way) – further sampling should be performed where possible.

If possible another 19 samples can be tested and hope that in the other 19 cases the p-value becomes significantly smaller than 0.05, or significantly bigger – at least then we can say in 19/20 (95% certainty) we can reject or fail to reject the null hypothesis.

Always report the p-value so your audience can draw their own conclusions.

In the case you are insisting on an alpha different than 0.05 – say 0.01 – 99% certainty then you are enforcing a different cut-off value and the rules above for 0.05 become 0.01.

So you are making it harder to find a statistical significant result – in other words you are saying that you need further proof before you will accept an alternative hypothesis, and before you will fail to reject the null hypothesis.

Alternatively if you set the alpha to be 0.1 you make it easier to find a statistically significant result and may over-optimistically reject the null hypothesis because the p-value might be 0.08.

Changing the alpha up or down like this may make it harder or easier to make Type 1 (rejecting the null hypothesis when you should have failed to reject (accept) it – a false positive) and Type 2 errors (failing to reject (accepting) the null hypothesis when you should have rejected it – a false negative)

A simple example springing to mind of why even at a 100% certainty we still might fail to reject the null hypothesis would be a guy called Thomas when he hears that Jesus has rose from the dead and appeared to the other 10 apostles. Out of a population of all possible apostles who weren’t named Thomas (Judas was dead by this stage) who could have seen a person rise from the dead – Thomas still doubted – why because his null hypothesis was people simply did not resurrect themselves from the dead (it had never happened before – and I don’t think it has happened since either) – and unless he saw it with his own eyes he would never reject his null hypothesis and no amount of talk from the other 10 would make it statistically significant. Once Jesus appeared to him – then he was able to reject his null hypothesis and accept the alternative hypothesis that this was a statistically significant event and that Jesus had in fact arisen from the dead.

Or another way to look at this was that if 10/11 apostles witnessed, giving 91% apostles who saw and 9% apostles (Thomas) who didn’t – p-value of 0.09 and that at an alpha of .05 meant all 11 would have to see for Thomas to believe – therefore 11/11.

A less tongue an cheek example from the web might look like – Apache Pizza pizza place claims their delivery times are 30 minutes or less on average but you think it’s more than that. You conduct a hypothesis test because you believe the null hypothesis, Ho, that the mean delivery time is 30 minutes max, is incorrect. Your alternative hypothesis (Ha) is that the mean time is greater than 30 minutes. You randomly sample some delivery times and run the data through the hypothesis test, and your p-value turns out to be 0.001, which is much less than 0.05. In real terms, there is a probability of 0.001 that you will mistakenly reject the pizza place’s claim that their delivery time is less than or equal to 30 minutes. Since typically we are willing to reject the null hypothesis when this probability is less than 0.05, you conclude that the pizza place is wrong; their delivery times are in fact more than 30 minutes on average, and you want to know what they’re gonna do about it! (Of course, you could be wrong by having sampled an unusually high number of late pizza deliveries just by chance.)

Can you recognise a Ghost, Ghoul, or Goblin

Being a Irish child born hours after Halloween (All Hallows Eve), means being born on All Hallows Day or as we call it All Saints Day now that we are no longer pagan in Ireland. Oiche Shamhna as we say in Gaelic and it is the Gaelic word Samhain (November) which gives English the name samhainophobia – the morbid fear of Halloween. Well this is topical at this time of year and Kaggle have created a lovely problem to solve to aid peoples’ samhainophobia and to help us spot a ghost, a ghoul, or a goblin.

Perfect for budding new R students to practice some data analytics in R.

Machine Learning for Data Analysis Course Passed

Today I passed Wesleyan University’s Machine Learning for Data Analysis Course on Coursera. This course was a great Python & SAS course and part 4 of their Data Analysis and Interpretation Specialisation. So only the Capstone project left for me to do. The lecturers Lisa Dierker and Jen Rose know their stuff and the practicals each week are fun to do. This month’s Programming for Big Data course in DBS will contain some of the practicals and research I did for this course.

Cluster Analysis of the Iris Dataset

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

iris_clusters_five

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

scatterplot_for_3_clusters

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.

Machine Learning Foundations Course Passed

Yes today I passed my Machine Learning Certificate Course in Machine Learning Foundations a Case Study approach from the University of Washington on Coursera. This course was a great introduction to Graphlab and really fun to do the modules from all 6 weeks. Graphlab allowed me to do regression analysis, classification analysis, sentiment analysis and machine learning with easy to use apis. The lecturers Carlos Guestrin and Emily Fox were fantastically enthusiastic making the course really enjoyable to do. I look forward to rolling this knowledge into my lectures in DBS over the coming months. Hopefully I have the time to complete the Specialization and Capstone project on Coursera too in the coming months.

Hackathon in Excel / R / Python

Today in the hackathon you can practice / learn some Excel, R, & Python, Fusion Tables to perform some data manipulation, data analysis, and graphics.

In R to set your working directory, use the function setwd() or in Python use the os.chdir function to achieve the same.

Part A

Hackathon Quiz 23rd October 2016 R

Attempt Some R Questions to practice using R.

Next we can practice reading data sets.

Attached are two files for US baby names in 1900 and 2000.

In the files you’ll see that each year is a comma-separated file with 3 columns – name, sex, and number of births.

Part B

Hackathon Quiz 23rd October 2016 Baby Names

Attached are two files for US baby names in 1900 and 2000


Amazon best sellers 2014
Froud ships 1907

Running an Analysis of Variance

Carrying on from the Hypothesis developed in Developing a Research Question I am trying to ascertain if there is a statistically significant relationship between the location and the sales price of a house in Ames Iowa. I have chosen to explore this in python. The tools used are pandas, numpy, and statsmodels.

Load in the data set and ensure the interested in variables are converted to numbers or categories where necessary. I decide to use ANOVA (Analysis of Variance) to test and TukeyHSD (Tukey Honest Significant Difference) for post-hoc testing my data set and my hypothesis.

This tells us that there are 25 neighbourhoods in the dataset.

We can create our ANOVA model with the smf.ols function and we will tilda SalePrice (dependent variable) with Neighborhood (independent variable) to build our model. We can then get the model fit using the fit function on the model and use the summary function to get our F-statistic and associated p value which we hope will be less than 0.05 so that we can reject our null hypothesis that there is no significant association between neighbourhood and sale price, therefore we can accept our alternate hypothesis that there is a significant relationship.

We get the output below which tells us that for 1460 observations with an F-statistic of 71.78 the p-value is 1.56e-225 meaning that the chance of this happening by chance is very very very small – 224 zero after the decimal point followed by 156, so we can safely reject the null hypothesis and accept the alternative hypothesis. Our adjusted R-squared is also .538 so our model is giving up a nearly 54% value for accuracy in including more than half of our training samples correctly. So our alternative hypothesis is that there IS a significant relationship between sale price and location (neighbourhood).

We know there is a significant relationship between neighbourhood and sale price but we don’t know which neighbourhood – remember we have 25 of these that can be different from eachother. So we must do some post-hoc testing. I will use the tukey hsd for this investigation

We can check the reject column below to see if we should reject any variations between neighbourhoods – but with 25 neighbourhoods, there are 25*24/2  = 300 relationships to check so there is a lot of output. Note we can output a box-plot to help visualise this too – see below the data for this output.

To visualise this we can use the pandas boxplot function although we probably have to tidy up the indices on the neighborhood (x) axis:

box_plot

Developing a Research Question

While trying to buy a house in Dublin I realised I had no way of knowing if I was paying a fair price for a house, if I was getting it for a great price, or if I was over-paying for the house. The data scientist in me would like to develop an algorithm, a hypothesis, a research question, so that my decisions are based on sound science and not on gut instinct. So for the last couple of weeks I have been developing algorithms to determine this fair price value. So my research questions is:
Is house sales price associated with socio-economic location?

I stumbled upon similar research by Dean DeCock from 2009 in his research determining the house price for Ames Iowa. So that is the data set that I will use. See the Kaggle page House Prices Advanced Regression Techniques to get the data.

I would like to study the association between the neighborhood (location) and the house price, to determine does location influence the sale price and is the difference in means between different locations significant.

This dataset has 79 independent variables with sale price being the dependent variable. Initially I am only focusing on one independent variable – the neighborhood, so I can reduce the dataset variables down to two, to simplify the computation my analysis of variance needs to perform.

Now that I have determined I am going to study location, I decide that I might further want to look at the bands of house size, not just the house size (square footage), but if I can turn those into categories of square footage, less than 1000, between 1000 and 1250 square feet, 1250 to 1500, > 1500 to see if there is a variance in the mean among these categories.

I can now take the above ground living space variable (square footage) and add it to my codebook. I will also add any other variables related to square footage for first floor, second floor, basement etc…

I then search google scholar, kaggle, dbs library for previous study in these areas, finding: a paper from 2001 discussing previous research in Dublin, however it was done in 2001 when a bubble was about to begin, and a big property crash in 2008 that was not conceived. http://www.sciencedirect.com/science/article/pii/S0264999300000407
Secondly Dean De Cock’s research on house prices in Iowa http://ww2.amstat.org/publications/jse/v19n3/decock.pdf

Based on my literature review I believe that there might be a statistically significant association between house location (neighborhood) and sales price. Secondary I believe there will be a statistically significant association between size bands (square footage band) and sales price. I further believe that might be an interaction effect between location & square footage bands and sales price which I would like to investigate too.

So I have developed three null hypotheses:
* There is NO association between location and sales price
* There is NO association between bands of square footage and sales price
* There is NO interaction effect in association between location, bands of square footage and sales price.

Running a LASSO Regression Analysis

A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 1. Change in the validation mean square error at each step:

regression_coef_progmean_squared_error

Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.