Can you judge a movie by its online presence?

Based on the thought process – judging a book by its cover? Is it possible to judge a movie by its online presence?

This hackathon idea is hosted on Kaggle and made available by Chuan Sun to allow budding data scientist’s to test out their ideas See https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset for more information.

This morning the DBS Hackathon group will investigate this dataset.

Anonymised Student Swirl Tutorial Dataset

So people have been taking part in the DBS Hackathons for over a year now and they are if not great fun, they appear to be popular. I also fully embraced swirl (Interactive R Learning) and have built it into my students’ continuous assessments. Plus when the data is all collected it provides information to be able to do some analytics. I did promise to make the data available in an anonymised fashion. So below is 514 different submissions by students over the last month or so.

The dataset provided shows the course completed, an anonymised unique id for the student, an anonymised email address, the date and time completed and lastly whether the student was male or female.

Swirl makes my DBS lectures and hackathons practical

Swirl really does make the DBS Hackathons and Lectures Practical.

Goto R Studio and run the following 7 lines of R code.

Welcome to the world of interactive learning via R for Data Analytics, Data Mining, and R Programming.

There are currently 27 tutorials in the 4 modules that you can claim credit against my course in Dublin Business School – or you can just do them for fun.

Each module is interactive, practical based, and will teach you the subject with some great examples. To date 220 tutorials have been completed and credit claimed for them.

Enjoy!

Johns Hopkins R Programming Course introduces SWIRL

When reviewing Coursera’s excellent Johns Hopkins R Programming course this evening delivered by Roger D. Peng it introduced me to swirl.

What I hear you ask is swirl? Well it is an interactive R tutorial which can be run from R or R Studio.

Previously I’ve been getting my students to run the Try R interactive tutorial from codeschool but I think from now on it will be swirl.

To enter into the swirl interactive R tutorial – open up R or R Studio and type the following and enjoy the hours spent practicing R:

I must say I am really impressed with this course and it only costs $43 per month – I managed to get through the first 3 weeks of lectures, quizzes, and practicals this evening. This is the second of nine courses in this specialisation. I am so impressed that I bought Professor Peng’s books, course notes, videos, datasets. If there were t-shirts, I would have bought one too.

Visualising Uber Ride Share Data in R

Fivethirtyeight.com have release a dataset to Kaggle that they received as a result of a series of freedom of information requests from the New York Taxi Commission (NYTC) called Uber Pickups in New York City. They want us kagglers to investigate the data and one kaggler Rob Harrand came up with a kernal called Uber-Duper animation. During our DBS Analytics meetup yesterday every single pun on uber was used and we looked and used a few of the kernals with some ideas on how we could improve on them.

uber

This can be generated with the following R-code.

Note image-magick must be installed on your computer to be able to do this.

To install on a mac:

On windows download Image Magick and make sure it is added to the path.

As a programmer immediately I realise that this can be improved – the animated gif being generated is only generating one month of data for 2014, but there are six months of data – so loading in the 6 data sets and binding them into one will allow me to create an image over 6 months – hey I could change the colours as the months change. I can give the x and y axes proper names for Longitude and Latitude. I notice also that the function is skipping every 25000 data points so that out of 1 million data points it is generating 40 images and merging them into 1 animated gif. When all six datasets are merged there are in excess of 4.5 million observations – this is 180 images merged into 1 gif file – with 40 images it was almost 2mb – so this could be a 9mb file. Perhaps I can generate per 250000 – so I could parameterise this offset and so I convert this call to the animation saveGif code into a function called generate uber plot, with parameterised colours for the months, offset to change the number of frames in the animation – thus creating the following animation

uber

Data is the New Oil – Old News

Twice in the last week I have been at conferences or awards and had to listen to people giving a talk and state that ‘Data is the new oil’ and stand back and expect people to look at them in awe as if they were Einstein putting forth the Theory of Relativity, or Archimedes shouting ‘Eureka’ (I realise the latter may not have happened the way the myth tells it).

Please this is nothing new. It is 10 years since Clive Humby of Tesco Clubcard fame wrote a paper describing this term in Data is the New Oil (DITNO). It is 16 years since Gartner’s Doug Laney developed his theory on 3 V’s of Big Data in a paper called 3-D Data Management: Controlling Data Volume, Velocity and Variety. David McCandless of Information is Beautiful fame 6 years ago in a TED talk at least had the decency to refer to DITNO and expand on the theory with his thoughts on Data is the New Soil.

So please, DITNO is 10 years old, and although it is more important today then it ever was, but please stop putting this theory forward as if it is ground breaking – move on – develop your own theory don’t just agree with 3 V’s do some research and like David above create your own DITNS, or as one of my students Svetlana did – critiquing 14 V’s of Big Data in her excellent research and paper 3 V’s and Beyond – The Missing V’s in Big Data?. Can you find a 15th V or an interesting concept on what big data is?

Enlighten me! I will have over 100 students this year telling me what Big Data is and isn’t – so make the paper interesting, make me think – this person ‘really gets it’. I will not name and shame the two speakers that used the cliches in their talks. Perhaps it was news to the other delegates and it was just me bored by the staleness of their talks. I hope my students don’t think that about my lectures – time to freshen up my material.