Wesleyan’s Regression Modeling in Practice – Week 2

Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.

The data set, sample, procedure, and methods were detailed in week 1’s post.

There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:


Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).

My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.

From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.

I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).

Note the linear regression can also be done in R as follows:


To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.

Will Mayo Ever Win an All-Ireland? Will Dublin win 3 in a Row?

On a bulletin board yesterday a Mayo man posed the following questions. Calculate the probabilities of:

  • Mayo winning the All Ireland within the next 65 years
  • Dublin getting three in a row

He will be delighted to know that the probability of Mayo winning an All Ireland in the next 65 years is almost 100% that they will, no matter what way the data is sliced.

They have won 3 / 131 so approximately 1 in 44.
They have won 3 / 15 finals they have appeared in so 1 in 5, (.2), and they have now been in 8 in a row without winning one.
They have been in 5 out of the last 15 finals = one in 3 = (.33)

Which led me onto the Dublin question:
As of today the Dubs getting 3 in a row without putting thought into it should be -> 1 in 33.
The 31 counties taking part (Kilkenny doesn’t and the shouldn’t be allowed hurl if they don’t play football) plus London and New York.

However Dublin only play in Leinster and winning that gets them to the quarter-final – so if they win Leinster then that is 1 in 8.
But they are not guaranteed to win Leinster – they have only won 9 out of the last 10 – so 90% chance of getting to the last 8 ->
So 9/10 * 1/8 = 9/80 = 0.1125
But this seems a bit to low to price Dublin to win next year.

From another view Dublin have won four of the last six = 4/6 = 2/3

But I s’pose this last algorithm is lacking any nerves of doing a threepeat – it is 93 years since Dublin did it. Kerry are the only team to have done it in the last 50 years, and they only did it twice in that time, and it has not been done in the last 30 years – only 2 teams in the last 30 years have been in a position to do it and both failed, and this included Kerry getting to 6 finals in a row, winning 4 in 6 and still failing to win 3 in a row.

And now what odds would I want to place a bet in a bookmakers – probably 1 in 4 sounds right – if they can beat any two out of Kerry, Mayo, and the Ulster champions that would win it for them.