Swirl makes my DBS lectures and hackathons practical

Swirl really does make the DBS Hackathons and Lectures Practical.

Goto R Studio and run the following 7 lines of R code.

Welcome to the world of interactive learning via R for Data Analytics, Data Mining, and R Programming.

There are currently 27 tutorials in the 4 modules that you can claim credit against my course in Dublin Business School – or you can just do them for fun.

Each module is interactive, practical based, and will teach you the subject with some great examples. To date 220 tutorials have been completed and credit claimed for them.

Enjoy!

Wesleyan’s Regression Modeling in Practice – Week 2

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:

sale_price_histogram_pythonsale_price_ground_living_area

Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:

sale_price_histogramqqnorm_sale_price

To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

Washington’s Regression Analysis – Assignment 1

First assignment done for the University of Washington’s Machine Learning Foundations course in regression analysis. There were 9 questions to answer having done the slides and practicals for week 1. An interesting way to pass this assignment – one has until the 9th October to get above 80% – so 8 out of 9 required. One can do the assignment at most 3 times in every 8 hour period. Anyway I got the following 9 questions correct on the first attempt

Q1. Which figure represents an overfitted model?

fitting_samples

Q2. True or false: The model that best minimizes training error is the one that will perform best for the task of prediction on new data.

Q3. The following table illustrates the results of evaluating 4 models with different parameter choices on some data set. Which of the following models fits this data the best?

Model index Parameters (intercept, slope) Residual sum of squares (RSS)
1 (0,1.4) 20.51
2 (3.1,1.4) 15.23
3 (2.7, 1.9) 13.67
4 (0, 2.3) 18.99

Q4. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? (Note: you must select all parameters estimated as 0 to get the question correct.)

linear_regression

Q5. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? (Note: you must select all parameters estimated as 0 to get the question correct.)

linear_regression2

Q6. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? (Note: you must select all parameters estimated as 0 to get the question correct.)

linear_regression3

Q7. Assume we fit the following quadratic function: f(x) = w0+w1*x+w2*(x^2) to the dataset shown (blue circles). The fitted function is shown by the green curve in the picture below. Out of the 3 parameters of the fitted function (w0, w1, w2), which ones are estimated to be 0? (Note: you must select all parameters estimated as 0 to get the question correct.)

linear_regression4

Q8. Would you not expect to see this polot as a plot of training and test error curves?

screen-shot-2016-10-02-at-11-00-50-pm

Q9. True or false: One always prefers to use a model with more features since it better captures the true underlying process.

CA3 – Cluster Analysis and Nearest Neighbour

Find or create a dataset* suitable to K-Means Cluster analysis and K-Nearest Neighbour predictions of roughly 200 observations.

* The dataset should be unique with respect to your class.

Examine the dataset and separate the dataset into a training set of a suitable size and a test set to see the effectiveness of your model.

Follow the tutorials for K-Means Clustering and K-Nearest Neighbour.

Submit you completed work and summary as a classical paper

CA2 – AirBnB’s Analysis of Variance

Taking the dataset from AirBnb’s New User Bookings scenario

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

Analyse the dataset to see if it is possible to create a model which describes where people are likely to travel as their first trip on airbnb.

Your model can be one of the many algorithms covered on the course, or from any model you’ve come across in a previous existence.

You are required to calculate the goodness of fit of your model to your data, outlining the significant independent variables, if any.

You are also required to verify if the data has normal distribution – and if it should be treated parametrically or non-parametrically.

Please examine the data set also with an emphasis to data quality and mention some data scrubbing techniques that could be used to make the data easier to work with going forward.

Your solution will include a program, runnable in R or python as well as a word document outlining your research.

Your paper should be written as a formal paper.

You are also required to create a blog post on your <name>.dbsdataprojects.com blog with your research – including your runnable program script.

CA1 – Anscombe’s Quartet

CA 1 – Anscombe’s Quartet

Write up a report and analysis on the Anscombe’s Quartet.

Describe the flaws that this data set exposes with just looking at Pearson’s Correlation independent of visualising the data.

Describe each of the 4 charts in a blog post roughly 500 words in length.

In your answer also come up with a new set of data points which validate Anscombe’s work.