Johns Hopkins R Programming Course introduces SWIRL

When reviewing Coursera’s excellent Johns Hopkins R Programming course this evening delivered by Roger D. Peng it introduced me to swirl.

What I hear you ask is swirl? Well it is an interactive R tutorial which can be run from R or R Studio.

Previously I’ve been getting my students to run the Try R interactive tutorial from codeschool but I think from now on it will be swirl.

To enter into the swirl interactive R tutorial – open up R or R Studio and type the following and enjoy the hours spent practicing R:

I must say I am really impressed with this course and it only costs $43 per month – I managed to get through the first 3 weeks of lectures, quizzes, and practicals this evening. This is the second of nine courses in this specialisation. I am so impressed that I bought Professor Peng’s books, course notes, videos, datasets. If there were t-shirts, I would have bought one too.

Visualising Uber Ride Share Data in R

Fivethirtyeight.com have release a dataset to Kaggle that they received as a result of a series of freedom of information requests from the New York Taxi Commission (NYTC) called Uber Pickups in New York City. They want us kagglers to investigate the data and one kaggler Rob Harrand came up with a kernal called Uber-Duper animation. During our DBS Analytics meetup yesterday every single pun on uber was used and we looked and used a few of the kernals with some ideas on how we could improve on them.

uber

This can be generated with the following R-code.

Note image-magick must be installed on your computer to be able to do this.

To install on a mac:

On windows download Image Magick and make sure it is added to the path.

As a programmer immediately I realise that this can be improved – the animated gif being generated is only generating one month of data for 2014, but there are six months of data – so loading in the 6 data sets and binding them into one will allow me to create an image over 6 months – hey I could change the colours as the months change. I can give the x and y axes proper names for Longitude and Latitude. I notice also that the function is skipping every 25000 data points so that out of 1 million data points it is generating 40 images and merging them into 1 animated gif. When all six datasets are merged there are in excess of 4.5 million observations – this is 180 images merged into 1 gif file – with 40 images it was almost 2mb – so this could be a 9mb file. Perhaps I can generate per 250000 – so I could parameterise this offset and so I convert this call to the animation saveGif code into a function called generate uber plot, with parameterised colours for the months, offset to change the number of frames in the animation – thus creating the following animation

uber

Data is the New Oil – Old News

Twice in the last week I have been at conferences or awards and had to listen to people giving a talk and state that ‘Data is the new oil’ and stand back and expect people to look at them in awe as if they were Einstein putting forth the Theory of Relativity, or Archimedes shouting ‘Eureka’ (I realise the latter may not have happened the way the myth tells it).

Please this is nothing new. It is 10 years since Clive Humby of Tesco Clubcard fame wrote a paper describing this term in Data is the New Oil (DITNO). It is 16 years since Gartner’s Doug Laney developed his theory on 3 V’s of Big Data in a paper called 3-D Data Management: Controlling Data Volume, Velocity and Variety. David McCandless of Information is Beautiful fame 6 years ago in a TED talk at least had the decency to refer to DITNO and expand on the theory with his thoughts on Data is the New Soil.

So please, DITNO is 10 years old, and although it is more important today then it ever was, but please stop putting this theory forward as if it is ground breaking – move on – develop your own theory don’t just agree with 3 V’s do some research and like David above create your own DITNS, or as one of my students Svetlana did – critiquing 14 V’s of Big Data in her excellent research and paper 3 V’s and Beyond – The Missing V’s in Big Data?. Can you find a 15th V or an interesting concept on what big data is?

Enlighten me! I will have over 100 students this year telling me what Big Data is and isn’t – so make the paper interesting, make me think – this person ‘really gets it’. I will not name and shame the two speakers that used the cliches in their talks. Perhaps it was news to the other delegates and it was just me bored by the staleness of their talks. I hope my students don’t think that about my lectures – time to freshen up my material.

Trump for President? It’s Beginning to look like it!

I should be in bed, I know, I have a lot of work tomorrow but not being one to contradict the great Nate Silver and fivethirtyeight.com but has Trump just won this US presidential election?

It is 3:10am (10:10 EST) and as I look at the numbers does Trump have enough electoral votes? The TV has him at 150 – CBS News seems to have prematurely given Virgina’s 13 votes to Clinton – it looks 150 to 109 (although CBS have 122).

So Trump looks to have nailed Georgia (16), North Carolina (15), Michigan (16), Ohio (18). That is 65 – he now would only need 55 more. He is leading but very close in Winconsin (10), Arizona (11), and Florida (29). This would only leave him needing 5 votes and he is expected to win Alaska, Idaho, and Utah comfortably to give him 13 votes.

Does that have him winning by 18? Getting Trump to 278 and Hillary to 260.

At what point does he not even need Florida?

I am not even including the 80 electoral votes for Nevada, or the West Coast states of California, Oregon, Washington in these projections.

What is the p-value when Hypothesis Testing?

The p-value helps us determine the significance of our results. The hypothesis test is used to test the validity of a claim being made about the overall population. The claim that is being tested we determine to be the null hypothesis. If the null hypothesis was concluded to be false this is referred to as the alternative hypothesis.

Hypothesis tests use the p-value to weight the strength of the evidence (what the data tells us about the overall population).

* A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis – and accept the alternative hypothesis – this is a statistically significant outcome.

* A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis – accept the null hypothesis, and reject the alternative hypothesis – no statistical significance

* p-values very close to the cutoff (0.05) are considered to be marginal (could go either way) – further sampling should be performed where possible.

If possible another 19 samples can be tested and hope that in the other 19 cases the p-value becomes significantly smaller than 0.05, or significantly bigger – at least then we can say in 19/20 (95% certainty) we can reject or fail to reject the null hypothesis.

Always report the p-value so your audience can draw their own conclusions.

In the case you are insisting on an alpha different than 0.05 – say 0.01 – 99% certainty then you are enforcing a different cut-off value and the rules above for 0.05 become 0.01.

So you are making it harder to find a statistical significant result – in other words you are saying that you need further proof before you will accept an alternative hypothesis, and before you will fail to reject the null hypothesis.

Alternatively if you set the alpha to be 0.1 you make it easier to find a statistically significant result and may over-optimistically reject the null hypothesis because the p-value might be 0.08.

Changing the alpha up or down like this may make it harder or easier to make Type 1 (rejecting the null hypothesis when you should have failed to reject (accept) it – a false positive) and Type 2 errors (failing to reject (accepting) the null hypothesis when you should have rejected it – a false negative)

A simple example springing to mind of why even at a 100% certainty we still might fail to reject the null hypothesis would be a guy called Thomas when he hears that Jesus has rose from the dead and appeared to the other 10 apostles. Out of a population of all possible apostles who weren’t named Thomas (Judas was dead by this stage) who could have seen a person rise from the dead – Thomas still doubted – why because his null hypothesis was people simply did not resurrect themselves from the dead (it had never happened before – and I don’t think it has happened since either) – and unless he saw it with his own eyes he would never reject his null hypothesis and no amount of talk from the other 10 would make it statistically significant. Once Jesus appeared to him – then he was able to reject his null hypothesis and accept the alternative hypothesis that this was a statistically significant event and that Jesus had in fact arisen from the dead.

Or another way to look at this was that if 10/11 apostles witnessed, giving 91% apostles who saw and 9% apostles (Thomas) who didn’t – p-value of 0.09 and that at an alpha of .05 meant all 11 would have to see for Thomas to believe – therefore 11/11.

A less tongue an cheek example from the web might look like – Apache Pizza pizza place claims their delivery times are 30 minutes or less on average but you think it’s more than that. You conduct a hypothesis test because you believe the null hypothesis, Ho, that the mean delivery time is 30 minutes max, is incorrect. Your alternative hypothesis (Ha) is that the mean time is greater than 30 minutes. You randomly sample some delivery times and run the data through the hypothesis test, and your p-value turns out to be 0.001, which is much less than 0.05. In real terms, there is a probability of 0.001 that you will mistakenly reject the pizza place’s claim that their delivery time is less than or equal to 30 minutes. Since typically we are willing to reject the null hypothesis when this probability is less than 0.05, you conclude that the pizza place is wrong; their delivery times are in fact more than 30 minutes on average, and you want to know what they’re gonna do about it! (Of course, you could be wrong by having sampled an unusually high number of late pizza deliveries just by chance.)

Can you recognise a Ghost, Ghoul, or Goblin

Being a Irish child born hours after Halloween (All Hallows Eve), means being born on All Hallows Day or as we call it All Saints Day now that we are no longer pagan in Ireland. Oiche Shamhna as we say in Gaelic and it is the Gaelic word Samhain (November) which gives English the name samhainophobia – the morbid fear of Halloween. Well this is topical at this time of year and Kaggle have created a lovely problem to solve to aid peoples’ samhainophobia and to help us spot a ghost, a ghoul, or a goblin.

Perfect for budding new R students to practice some data analytics in R.

Machine Learning for Data Analysis Course Passed

Today I passed Wesleyan University’s Machine Learning for Data Analysis Course on Coursera. This course was a great Python & SAS course and part 4 of their Data Analysis and Interpretation Specialisation. So only the Capstone project left for me to do. The lecturers Lisa Dierker and Jen Rose know their stuff and the practicals each week are fun to do. This month’s Programming for Big Data course in DBS will contain some of the practicals and research I did for this course.

Cluster Analysis of the Iris Dataset

A k-means cluster analysis was conducted to identify underlying subgroups of Iris’s based on their similarity of 4 variables that represented petal length, petal width, sepal length, and sepal width. The 4 clustering variables were all quantitative variables. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data were randomly split into a training set that included 70% of the observations (N=105) and a test set that included 30% of the observations (N=45). A series of k-means cluster analyses were conducted on the training data specifying k=1-5 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the five cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

Figure 1. Elbow curve of r-square values for the three cluster solutions.

iris_clusters_five

The elbow curve was pretty conclusive, suggesting that there was a natural 3 cluster solutions that might be interpreted. The results below are for an interpretation of the 3-cluster solution.

A scatterplot of the four variables (reduced to 2 principal components) by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 2 the values were densely packed with relatively low within cluster variance, although they overlapped a little with the other clusters. Clusters 1 and 2 were generally distinct but were close to each other. Observations in cluster 0 were spread out more than the other clusters with no overlap to the other clusters (the Euclidean distance being quite large between this cluster and the other two), showing high within cluster variance. The results of this plot suggest that the best cluster solution would have 3 clusters.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.

scatterplot_for_3_clusters

We can see, the data belonging to the Setosa species was grouped into cluster 0, Versicolor into cluster 2, and Virginica into cluster 1. The first principle component was based on petal length and petal width, and secondly sepal length and sepal width.