Yes today I passed my Machine Learning Certificate Course in Machine Learning Foundations a Case Study approach from the University of Washington on Coursera. This course was a great introduction to Graphlab and really fun to do the modules from all 6 weeks. Graphlab allowed me to do regression analysis, classification analysis, sentiment analysis and machine learning with easy to use apis. The lecturers Carlos Guestrin and Emily Fox were fantastically enthusiastic making the course really enjoyable to do. I look forward to rolling this knowledge into my lectures in DBS over the coming months. Hopefully I have the time to complete the Specialization and Capstone project on Coursera too in the coming months.
Month: October 2016
Hackathon in Excel / R / Python
Today in the hackathon you can practice / learn some Excel, R, & Python, Fusion Tables to perform some data manipulation, data analysis, and graphics.
In R to set your working directory, use the function setwd() or in Python use the os.chdir function to achieve the same.
Part A
Hackathon Quiz 23rd October 2016 R
Attempt Some R Questions to practice using R.
Next we can practice reading data sets.
Attached are two files for US baby names in 1900 and 2000.
In the files you’ll see that each year is a comma-separated file with 3 columns – name, sex, and number of births.
Part B
Hackathon Quiz 23rd October 2016 Baby Names
Amazon best sellers 2014
Froud ships 1907
Running an Analysis of Variance
Carrying on from the Hypothesis developed in Developing a Research Question I am trying to ascertain if there is a statistically significant relationship between the location and the sales price of a house in Ames Iowa. I have chosen to explore this in python. The tools used are pandas, numpy, and statsmodels.
Load in the data set and ensure the interested in variables are converted to numbers or categories where necessary. I decide to use ANOVA (Analysis of Variance) to test and TukeyHSD (Tukey Honest Significant Difference) for post-hoc testing my data set and my hypothesis.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi data = pandas.read_csv('ames_house_price.csv', low_memory=False) #setting variables you will be working with to numeric data['SalePrice'] = data['SalePrice'].convert_objects(convert_numeric=True) data['GrLivArea'] = data['GrLivArea'].convert_objects(convert_numeric=True) ct1 = data.groupby('Neighborhood').size() print (ct1) |
This tells us that there are 25 neighbourhoods in the dataset.
We can create our ANOVA model with the smf.ols function and we will tilda SalePrice (dependent variable) with Neighborhood (independent variable) to build our model. We can then get the model fit using the fit function on the model and use the summary function to get our F-statistic and associated p value which we hope will be less than 0.05 so that we can reject our null hypothesis that there is no significant association between neighbourhood and sale price, therefore we can accept our alternate hypothesis that there is a significant relationship.
1 2 3 4 |
# using ols function for calculating the F-statistic and associated p value model1 = smf.ols(formula='SalePrice ~ C(Neighborhood)', data=data) results1 = model1.fit() print (results1.summary()) |
We get the output below which tells us that for 1460 observations with an F-statistic of 71.78 the p-value is 1.56e-225 meaning that the chance of this happening by chance is very very very small – 224 zero after the decimal point followed by 156, so we can safely reject the null hypothesis and accept the alternative hypothesis. Our adjusted R-squared is also .538 so our model is giving up a nearly 54% value for accuracy in including more than half of our training samples correctly. So our alternative hypothesis is that there IS a significant relationship between sale price and location (neighbourhood).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.546 Model: OLS Adj. R-squared: 0.538 Method: Least Squares F-statistic: 71.78 Date: Mon, 10 Oct 2016 Prob (F-statistic): 1.56e-225 Time: 09:16:00 Log-Likelihood: -17968. No. Observations: 1460 AIC: 3.599e+04 Df Residuals: 1435 BIC: 3.612e+04 Df Model: 24 Covariance Type: nonrobust ============================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------------- Intercept 1.949e+05 1.31e+04 14.879 0.000 1.69e+05 2.21e+05 C(Neighborhood)[T.Blueste] -5.737e+04 4.04e+04 -1.421 0.155 -1.37e+05 2.18e+04 C(Neighborhood)[T.BrDale] -9.038e+04 1.88e+04 -4.805 0.000 -1.27e+05 -5.35e+04 C(Neighborhood)[T.BrkSide] -7.004e+04 1.49e+04 -4.703 0.000 -9.93e+04 -4.08e+04 C(Neighborhood)[T.ClearCr] 1.769e+04 1.66e+04 1.066 0.287 -1.49e+04 5.03e+04 C(Neighborhood)[T.CollgCr] 3094.8910 1.38e+04 0.224 0.823 -2.4e+04 3.02e+04 C(Neighborhood)[T.Crawfor] 1.575e+04 1.51e+04 1.042 0.298 -1.39e+04 4.54e+04 C(Neighborhood)[T.Edwards] -6.665e+04 1.42e+04 -4.705 0.000 -9.44e+04 -3.89e+04 C(Neighborhood)[T.Gilbert] -2016.3760 1.44e+04 -0.140 0.889 -3.03e+04 2.63e+04 C(Neighborhood)[T.IDOTRR] -9.475e+04 1.58e+04 -5.988 0.000 -1.26e+05 -6.37e+04 C(Neighborhood)[T.MeadowV] -9.629e+04 1.85e+04 -5.199 0.000 -1.33e+05 -6e+04 C(Neighborhood)[T.Mitchel] -3.86e+04 1.52e+04 -2.540 0.011 -6.84e+04 -8784.735 C(Neighborhood)[T.NAmes] -4.902e+04 1.36e+04 -3.609 0.000 -7.57e+04 -2.24e+04 C(Neighborhood)[T.NPkVill] -5.218e+04 2.23e+04 -2.344 0.019 -9.58e+04 -8510.657 C(Neighborhood)[T.NWAmes] -5820.8139 1.45e+04 -0.400 0.689 -3.43e+04 2.27e+04 C(Neighborhood)[T.NoRidge] 1.404e+05 1.56e+04 9.015 0.000 1.1e+05 1.71e+05 C(Neighborhood)[T.NridgHt] 1.214e+05 1.45e+04 8.390 0.000 9.3e+04 1.5e+05 C(Neighborhood)[T.OldTown] -6.665e+04 1.4e+04 -4.744 0.000 -9.42e+04 -3.91e+04 C(Neighborhood)[T.SWISU] -5.228e+04 1.7e+04 -3.080 0.002 -8.56e+04 -1.9e+04 C(Neighborhood)[T.Sawyer] -5.808e+04 1.45e+04 -3.999 0.000 -8.66e+04 -2.96e+04 C(Neighborhood)[T.SawyerW] -8315.0857 1.49e+04 -0.559 0.576 -3.75e+04 2.08e+04 C(Neighborhood)[T.Somerst] 3.051e+04 1.43e+04 2.129 0.033 2393.494 5.86e+04 C(Neighborhood)[T.StoneBr] 1.156e+05 1.7e+04 6.812 0.000 8.23e+04 1.49e+05 C(Neighborhood)[T.Timber] 4.738e+04 1.58e+04 3.007 0.003 1.65e+04 7.83e+04 C(Neighborhood)[T.Veenker] 4.39e+04 2.09e+04 2.101 0.036 2913.679 8.49e+04 ============================================================================== Omnibus: 618.883 Durbin-Watson: 1.956 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5526.438 Skew: 1.737 Prob(JB): 0.00 Kurtosis: 11.875 Cond. No. 48.8 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. |
We know there is a significant relationship between neighbourhood and sale price but we don’t know which neighbourhood – remember we have 25 of these that can be different from eachother. So we must do some post-hoc testing. I will use the tukey hsd for this investigation
1 2 3 4 5 6 7 8 9 10 11 12 13 |
data_sub = data[['SalePrice', 'Neighborhood']].dropna() print ('means for sale price by neighbourhood') m1 = data_sub.groupby('Neighborhood').mean() print (m1) print ('standard deviations for sale price by major neighbourhood') sd1 = data_sub.groupby('Neighborhood').std() print (sd1) mc1 = multi.MultiComparison(data['SalePrice'], data['Neighborhood']) res1 = mc1.tukeyhsd() print(res1.summary()) |
We can check the reject column below to see if we should reject any variations between neighbourhoods – but with 25 neighbourhoods, there are 25*24/2 = 300 relationships to check so there is a lot of output. Note we can output a box-plot to help visualise this too – see below the data for this output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 |
means for sale price by neighbourhood SalePrice Neighborhood Blmngtn 194870 Blueste 137500 BrDale 104493 BrkSide 124834 ClearCr 212565 CollgCr 197965 Crawfor 210624 Edwards 128219 Gilbert 192854 IDOTRR 100123 MeadowV 98576 Mitchel 156270 NAmes 145847 NPkVill 142694 NWAmes 189050 NoRidge 335295 NridgHt 316270 OldTown 128225 SWISU 142591 Sawyer 136793 SawyerW 186555 Somerst 225379 StoneBr 310499 Timber 242247 Veenker 238772 standard deviations for sale price by major neighbourhood SalePrice Neighborhood Blmngtn 30393.23 Blueste 19091.88 BrDale 14330.18 BrkSide 40348.69 ClearCr 50231.54 CollgCr 51403.67 Crawfor 68866.40 Edwards 43208.62 Gilbert 35986.78 IDOTRR 33376.71 MeadowV 23491.05 Mitchel 36486.63 NAmes 33075.35 NPkVill 9377.31 NWAmes 37172.22 NoRidge 121412.66 NridgHt 96392.54 OldTown 52650.58 SWISU 32622.92 Sawyer 22345.13 SawyerW 55652.00 Somerst 56177.56 StoneBr 112969.68 Timber 64845.65 Veenker 72369.32 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================================= group1 group2 meandiff lower upper reject ------------------------------------------------------------- Blmngtn Blueste -57370.8824 -205327.1494 90585.3847 False Blmngtn BrDale -90377.1324 -159316.6978 -21437.5669 True Blmngtn BrkSide -70036.8306 -124623.7013 -15449.96 True Blmngtn ClearCr 17694.5462 -43160.8093 78549.9018 False Blmngtn CollgCr 3094.891 -47555.6594 53745.4414 False Blmngtn Crawfor 15753.8431 -39675.653 71183.3393 False Blmngtn Edwards -66651.1824 -118574.7463 -14727.6184 True Blmngtn Gilbert -2016.376 -54933.1834 50900.4313 False Blmngtn IDOTRR -94747.0986 -152739.031 -36755.1661 True Blmngtn MeadowV -96294.4118 -164181.4029 -28407.4206 True Blmngtn Mitchel -38600.7599 -94312.3419 17110.8221 False Blmngtn NAmes -49023.8024 -98807.5959 759.9912 False Blmngtn NPkVill -52176.4379 -133766.4471 29413.5713 False Blmngtn NWAmes -5820.8139 -59121.3267 47479.699 False Blmngtn NoRidge 140424.4347 83330.0192 197518.8502 True Blmngtn NridgHt 121399.741 68361.3761 174438.1059 True Blmngtn OldTown -66645.5815 -118133.3438 -15157.8192 True Blmngtn SWISU -52279.5224 -114498.9775 9939.9328 False Blmngtn Sawyer -58077.7472 -111310.1904 -4845.304 True Blmngtn SawyerW -8315.0857 -62796.9994 46166.8279 False Blmngtn Somerst 30508.9549 -22025.1032 83043.0129 False Blmngtn StoneBr 115628.1176 53408.6625 177847.5728 True Blmngtn Timber 47376.565 -10374.6478 105127.7779 False Blmngtn Veenker 43901.8449 -32685.0101 120488.7 False Blueste BrDale -33006.25 -181448.4174 115435.9174 False Blueste BrkSide -12665.9483 -155011.0916 129679.1951 False Blueste ClearCr 75065.4286 -69799.2935 219930.1506 False Blueste CollgCr 60465.7733 -80416.7722 201348.3189 False Blueste Crawfor 73124.7255 -69545.6724 215795.1234 False Blueste Edwards -9280.3 -150625.5153 132064.9153 False Blueste Gilbert 55354.5063 -86358.5908 197067.6034 False Blueste IDOTRR -37376.2162 -181061.5586 106309.1262 False Blueste MeadowV -38923.5294 -186879.7965 109032.7377 False Blueste Mitchel 18770.1224 -124010.1064 161550.3513 False Blueste NAmes 8347.08 -132226.1731 148920.3331 False Blueste NPkVill 5194.4444 -149528.9959 159917.8848 False Blueste NWAmes 51550.0685 -90306.7539 193406.8909 False Blueste NoRidge 197795.3171 54469.8634 341120.7708 True Blueste NridgHt 178770.6234 37012.0908 320529.1559 True Blueste OldTown -9274.6991 -150460.4033 131911.005 False Blueste SWISU 5091.36 -140351.6666 150534.3866 False Blueste Sawyer -706.8649 -142538.1252 141124.3954 False Blueste SawyerW 49055.7966 -93249.1306 191360.7238 False Blueste Somerst 87879.8372 -53690.7835 229450.4579 False Blueste StoneBr 172999.0 27555.9734 318442.0266 True Blueste Timber 104747.4474 -38840.9086 248335.8034 False Blueste Veenker 101272.7273 -50871.8085 253417.263 False BrDale BrkSide 20340.3017 -35550.1855 76230.7889 False BrDale ClearCr 108071.6786 46044.3103 170099.0468 True BrDale CollgCr 93472.0233 41419.1813 145524.8654 True BrDale Crawfor 106130.9755 49417.228 162844.723 True BrDale Edwards 23725.95 -29566.4191 77018.3191 False BrDale Gilbert 88360.7563 34100.1941 142621.3185 True BrDale IDOTRR -4369.9662 -63590.6074 54850.6749 False BrDale MeadowV -5917.2794 -74856.8449 63022.286 False BrDale Mitchel 51776.3724 -5213.1044 108765.8493 False BrDale NAmes 41353.33 -9856.4953 92563.1553 False BrDale NPkVill 38200.6944 -44267.1764 120668.5652 False BrDale NWAmes 84556.3185 29921.4873 139191.1497 True BrDale NoRidge 230801.5671 172459.5377 289143.5965 True BrDale NridgHt 211776.8734 157397.7573 266155.9895 True BrDale OldTown 23731.5509 -29136.3011 76599.4029 False BrDale SWISU 38097.61 -25268.6327 101463.8527 False BrDale Sawyer 32299.3851 -22269.0409 86867.8112 False BrDale SawyerW 82062.0466 26274.0638 137850.0294 True BrDale Somerst 120886.0872 66998.7291 174773.4453 True BrDale StoneBr 206005.25 142639.0073 269371.4927 True BrDale Timber 137753.6974 78768.7612 196738.6335 True BrDale Veenker 134278.9773 56757.5836 211800.3709 True BrkSide ClearCr 87731.3768 42185.1676 133277.5861 True BrkSide CollgCr 73131.7216 42528.4353 103735.0079 True BrkSide Crawfor 85790.6738 47797.0964 123784.2512 True BrkSide Edwards 3385.6483 -29281.4508 36052.7474 False BrkSide Gilbert 68020.4546 33796.6124 102244.2968 True BrkSide IDOTRR -24710.2679 -66353.3598 16932.824 False BrkSide MeadowV -26257.5811 -80844.4518 28329.2895 False BrkSide Mitchel 31436.0707 -6967.8775 69840.019 False BrkSide NAmes 21013.0283 -8133.309 50159.3655 False BrkSide NPkVill 17860.3927 -53048.0869 88768.8723 False BrkSide NWAmes 64216.0168 29401.8308 99030.2027 True BrkSide NoRidge 210461.2653 170077.4176 250845.1131 True BrkSide NridgHt 191436.5717 157025.0761 225848.0673 True BrkSide OldTown 3391.2492 -28578.6201 35361.1184 False BrkSide SWISU 17757.3083 -29596.081 65110.6975 False BrkSide Sawyer 11959.0834 -22750.7982 46668.965 False BrkSide SawyerW 61721.7449 25124.4528 98319.0369 True BrkSide Somerst 100545.7855 66916.7782 134174.7928 True BrkSide StoneBr 185664.9483 138311.559 233018.3375 True BrkSide Timber 117413.3956 76106.1873 158720.604 True BrkSide Veenker 113938.6755 48849.2813 179028.0698 True ClearCr CollgCr -14599.6552 -55345.3174 26146.0069 False ClearCr Crawfor -1940.7031 -48493.4664 44612.0603 False ClearCr Edwards -84345.7286 -126663.4225 -42028.0347 True ClearCr Gilbert -19710.9222 -63241.5922 23819.7477 False ClearCr IDOTRR -112441.6448 -162017.7979 -62865.4917 True ClearCr MeadowV -113988.958 -174844.3135 -53133.6024 True ClearCr Mitchel -56295.3061 -103183.5892 -9407.023 True ClearCr NAmes -66718.3486 -106381.3896 -27055.3075 True ClearCr NPkVill -69870.9841 -145710.6857 5968.7174 False ClearCr NWAmes -23515.3601 -67511.6712 20480.9511 False ClearCr NoRidge 122729.8885 74206.6672 171253.1098 True ClearCr NridgHt 103705.1948 60026.8377 147383.5519 True ClearCr OldTown -84340.1277 -126121.9466 -42558.3088 True ClearCr SWISU -69974.0686 -124434.9842 -15513.153 True ClearCr Sawyer -75772.2934 -119686.1151 -31858.4718 True ClearCr SawyerW -26009.632 -71429.9978 19410.7339 False ClearCr Somerst 12814.4086 -30250.1706 55878.9878 False ClearCr StoneBr 97933.5714 43472.6558 152394.487 True ClearCr Timber 29682.0188 -19612.335 78976.3725 False ClearCr Veenker 26207.2987 -44221.9359 96636.5333 False CollgCr Crawfor 12658.9522 -19423.1882 44741.0925 False CollgCr Edwards -69746.0733 -95297.8086 -44194.3381 True CollgCr Gilbert -5111.267 -32625.3213 22402.7873 False CollgCr IDOTRR -97841.9895 -134172.4026 -61511.5765 True CollgCr MeadowV -99389.3027 -150039.8531 -48738.7524 True CollgCr Mitchel -41695.6509 -74262.7362 -9128.5655 True CollgCr NAmes -52118.6933 -72981.5978 -31255.7889 True CollgCr NPkVill -55271.3289 -123196.0246 12653.3668 False CollgCr NWAmes -8915.7048 -37160.6929 19329.2832 False CollgCr NoRidge 137329.5437 102449.6503 172209.4372 True CollgCr NridgHt 118304.85 90557.727 146051.9731 True CollgCr OldTown -69740.4724 -94394.5664 -45086.3785 True CollgCr SWISU -55374.4133 -98130.6443 -12618.1824 True CollgCr Sawyer -61172.6382 -89288.9625 -33056.3139 True CollgCr SawyerW -11409.9767 -41825.6568 19005.7033 False CollgCr Somerst 27414.0639 643.5215 54184.6062 True CollgCr StoneBr 112533.2267 69776.9957 155289.4576 True CollgCr Timber 44281.674 8336.7541 80226.594 True CollgCr Veenker 40806.9539 -21018.4538 102632.3617 False Crawfor Edwards -82405.0255 -116461.4781 -48348.5729 True Crawfor Gilbert -17770.2192 -53322.6308 17782.1925 False Crawfor IDOTRR -110500.9417 -153242.6041 -67759.2793 True Crawfor MeadowV -112048.2549 -167477.7511 -56618.7587 True Crawfor Mitchel -54354.603 -93947.1003 -14762.1058 True Crawfor NAmes -64777.6455 -95473.1105 -34082.1804 True Crawfor NPkVill -67930.281 -139489.4529 3628.8908 False Crawfor NWAmes -21574.657 -57695.7055 14546.3915 False Crawfor NoRidge 124670.5916 83154.8384 166186.3447 True Crawfor NridgHt 105645.8979 69912.8092 141378.9866 True Crawfor OldTown -82399.4246 -115787.6732 -49011.1761 True Crawfor SWISU -68033.3655 -116355.68 -19711.051 True Crawfor Sawyer -73831.5904 -109852.119 -37811.0617 True Crawfor SawyerW -24068.9289 -61911.5555 13773.6977 False Crawfor Somerst 14755.1117 -20225.0645 49735.288 False Crawfor StoneBr 99874.2745 51551.96 148196.589 True Crawfor Timber 31622.7219 -10791.7575 74037.2013 False Crawfor Veenker 28148.0018 -37649.6565 93945.6601 False Edwards Gilbert 64634.8063 34842.166 94427.4466 True Edwards IDOTRR -28095.9162 -66181.0465 9989.214 False Edwards MeadowV -29643.2294 -81566.7933 22280.3345 False Edwards Mitchel 28050.4224 -6463.2456 62564.0905 False Edwards NAmes 17627.38 -6159.9909 41414.7509 False Edwards NPkVill 14474.7444 -54404.4434 83353.9323 False Edwards NWAmes 60830.3685 30361.4075 91299.3295 True Edwards NoRidge 207075.6171 170371.5955 243779.6387 True Edwards NridgHt 188050.9234 158042.9066 218058.9402 True Edwards OldTown 5.6009 -27167.9632 27179.1649 False Edwards SWISU 14371.66 -29885.2436 58628.5636 False Edwards Sawyer 8573.4351 -21776.2918 38923.1621 False Edwards SawyerW 58336.0966 25844.685 90827.5082 True Edwards Somerst 97160.1372 68052.7469 126267.5276 True Edwards StoneBr 182279.3 138022.3964 226536.2036 True Edwards Timber 114027.7474 76310.1719 151745.3229 True Edwards Veenker 110553.0273 47680.4635 173425.5911 True Gilbert IDOTRR -92730.7225 -132159.2548 -53302.1903 True Gilbert MeadowV -94278.0357 -147194.8431 -41361.2284 True Gilbert Mitchel -36584.3839 -72575.0117 -593.756 True Gilbert NAmes -47007.4263 -72891.2249 -21123.6278 True Gilbert NPkVill -50160.0619 -119791.0502 19470.9264 False Gilbert NWAmes -3804.4378 -35936.814 28327.9383 False Gilbert NoRidge 142440.8107 104344.6533 180536.9682 True Gilbert NridgHt 123416.117 91720.4851 155111.7489 True Gilbert OldTown -64629.2054 -93655.6519 -35602.759 True Gilbert SWISU -50263.1463 -95681.2653 -4845.0274 True Gilbert Sawyer -56061.3712 -88080.7081 -24042.0343 True Gilbert SawyerW -6298.7097 -40354.8962 27757.4768 False Gilbert Somerst 32525.3309 1681.0092 63369.6526 True Gilbert StoneBr 117644.4937 72226.3747 163062.6126 True Gilbert Timber 49392.941 10319.3245 88466.5576 True Gilbert Veenker 45918.2209 -17777.0794 109613.5213 False IDOTRR MeadowV -1547.3132 -59539.2456 56444.6192 False IDOTRR Mitchel 56146.3387 13039.4828 99253.1945 True IDOTRR NAmes 45723.2962 10611.3787 80835.2138 True IDOTRR NPkVill 42570.6607 -30991.2198 116132.5411 False IDOTRR NWAmes 88926.2847 48984.2602 128868.3092 True IDOTRR NoRidge 235171.5333 190291.7724 280051.2942 True IDOTRR NridgHt 216146.8396 176555.3151 255738.3641 True IDOTRR OldTown 28101.5171 -9387.2855 65590.3197 False IDOTRR SWISU 42467.5762 -8773.8256 93708.978 False IDOTRR Sawyer 36669.3514 -3181.7925 76520.4952 False IDOTRR SawyerW 86432.0128 44926.5967 127937.4289 True IDOTRR Somerst 125256.0534 86342.715 164169.3919 True IDOTRR StoneBr 210375.2162 159133.8144 261616.618 True IDOTRR Timber 142123.6636 96411.2666 187836.0606 True IDOTRR Veenker 138648.9435 70678.6042 206619.2828 True MeadowV Mitchel 57693.6519 1982.0699 113405.2338 True MeadowV NAmes 47270.6094 -2513.1841 97054.4029 False MeadowV NPkVill 44117.9739 -37472.0354 125707.9831 False MeadowV NWAmes 90473.5979 37173.0851 143774.1107 True MeadowV NoRidge 236718.8465 179624.431 293813.262 True MeadowV NridgHt 217694.1528 164655.7879 270732.5177 True MeadowV OldTown 29648.8303 -21838.932 81136.5926 False MeadowV SWISU 44014.8894 -18204.5657 106234.3446 False MeadowV Sawyer 38216.6645 -15015.7786 91449.1077 False MeadowV SawyerW 87979.326 33497.4124 142461.2396 True MeadowV Somerst 126803.3666 74269.3086 179337.4247 True MeadowV StoneBr 211922.5294 149703.0743 274141.9846 True MeadowV Timber 143670.9768 85919.7639 201422.1896 True MeadowV Veenker 140196.2567 63609.4017 216783.1117 True Mitchel NAmes -10423.0424 -41625.0118 20778.9269 False Mitchel NPkVill -13575.678 -85353.5743 58202.2183 False Mitchel NWAmes 32779.946 -3772.502 69332.3941 False Mitchel NoRidge 179025.1946 137133.5597 220916.8295 True Mitchel NridgHt 160000.5009 123831.385 196169.6169 True Mitchel OldTown -28044.8216 -61899.311 5809.6679 False Mitchel SWISU -13678.7624 -62324.3932 34966.8683 False Mitchel Sawyer -19476.9873 -55930.1051 16976.1305 False Mitchel SawyerW 30285.6742 -7968.9426 68540.2909 False Mitchel Somerst 69109.7148 33684.243 104535.1865 True Mitchel StoneBr 154228.8776 105583.2468 202874.5083 True Mitchel Timber 85977.3249 43194.8591 128759.7907 True Mitchel Veenker 82502.6048 16467.1359 148538.0738 True NAmes NPkVill -3152.6356 -70433.4807 64128.2096 False NAmes NWAmes 43202.9885 16543.5212 69862.4557 True NAmes NoRidge 189448.2371 155839.3869 223057.0872 True NAmes NridgHt 170423.5434 144292.1316 196554.9551 True NAmes OldTown -17621.7791 -40442.2128 5198.6545 False NAmes SWISU -3255.72 -44981.5289 38470.0889 False NAmes Sawyer -9053.9449 -35577.0581 17469.1683 False NAmes SawyerW 40708.7166 11759.4258 69658.0074 True NAmes Somerst 79532.7572 54440.7309 104624.7835 True NAmes StoneBr 164651.92 122926.1111 206377.7289 True NAmes Timber 96400.3674 61687.4719 131113.2628 True NAmes Veenker 92925.6473 31808.3102 154042.9843 True NPkVill NWAmes 46355.624 -23567.4101 116278.6582 False NPkVill NoRidge 192600.8726 119744.45 265457.2952 True NPkVill NridgHt 173576.1789 103852.7669 243299.591 True NPkVill OldTown -14469.1436 -83020.4068 54082.1197 False NPkVill SWISU -103.0844 -77041.6744 76835.5056 False NPkVill Sawyer -5901.3093 -75772.4696 63969.851 False NPkVill SawyerW 43861.3522 -26966.3609 114689.0653 False NPkVill Somerst 82685.3928 13344.8326 152025.9529 True NPkVill StoneBr 167804.5556 90865.9656 244743.1456 True NPkVill Timber 99553.0029 26180.7424 172925.2635 True NPkVill Veenker 96078.2828 7118.5594 185038.0063 True NWAmes NoRidge 146245.2486 107617.8829 184872.6142 True NWAmes NridgHt 127220.5549 94888.3844 159552.7254 True NWAmes OldTown -60824.7676 -90544.9756 -31104.5597 True NWAmes SWISU -46458.7085 -92323.3103 -594.1067 True NWAmes Sawyer -52256.9334 -84906.4985 -19607.3682 True NWAmes SawyerW -2494.2719 -37143.6587 32155.1149 False NWAmes Somerst 36329.7687 4831.6997 67827.8377 True NWAmes StoneBr 121448.9315 75584.3297 167313.5333 True NWAmes Timber 53197.3789 13605.6666 92789.0911 True NWAmes Veenker 49722.6588 -14291.7729 113737.0904 False NoRidge NridgHt -19024.6937 -57289.5191 19240.1317 False NoRidge OldTown -207070.0162 -243154.8936 -170985.1388 True NoRidge SWISU -192703.9571 -242927.3511 -142480.563 True NoRidge Sawyer -198502.1819 -237035.5664 -159968.7975 True NoRidge SawyerW -148739.5205 -188981.3845 -108497.6564 True NoRidge Somerst -109915.4799 -147478.1737 -72352.7861 True NoRidge StoneBr -24796.3171 -75019.7111 25427.077 False NoRidge Timber -93047.8697 -137616.1465 -48479.5929 True NoRidge Veenker -96522.5898 -163728.8029 -29316.3767 True NridgHt OldTown -188045.3225 -217292.7882 -158797.8568 True NridgHt SWISU -173679.2634 -219238.9515 -128119.5752 True NridgHt Sawyer -179477.4882 -211697.3205 -147257.656 True NridgHt SawyerW -129714.8268 -163959.5854 -95470.0681 True NridgHt Somerst -90890.7862 -121943.1909 -59838.3815 True NridgHt StoneBr -5771.6234 -51331.3115 39788.0648 False NridgHt Timber -74023.176 -113261.2591 -34785.0929 True NridgHt Veenker -77497.8961 -141294.22 -13701.5722 True OldTown SWISU 14366.0591 -29378.7314 58110.8496 False OldTown Sawyer 8567.8343 -21030.1235 38165.792 False OldTown SawyerW 58330.4957 26540.1669 90120.8245 True OldTown Somerst 97154.5363 68831.8714 125477.2013 True OldTown StoneBr 182273.6991 138528.9086 226018.4896 True OldTown Timber 114022.1465 76906.8036 151137.4894 True OldTown Veenker 110547.4264 48034.2881 173060.5647 True SWISU Sawyer -5798.2249 -51583.7033 39987.2536 False SWISU SawyerW 43964.4366 -3267.9245 91196.7978 False SWISU Somerst 82788.4772 37816.883 127760.0714 True SWISU StoneBr 167907.64 111926.593 223888.687 True SWISU Timber 99656.0874 48687.2772 150624.8976 True SWISU Veenker 96181.3673 24570.1713 167792.5633 True Sawyer SawyerW 49762.6615 15218.0766 84307.2464 True Sawyer Somerst 88586.7021 57203.957 119969.4472 True Sawyer StoneBr 173705.8649 127920.3864 219491.3433 True Sawyer Timber 105454.3122 65954.2867 144954.3378 True Sawyer Veenker 101979.5921 38021.8264 165937.3579 True SawyerW Somerst 38824.0406 5365.6695 72282.4117 True SawyerW StoneBr 123943.2034 76710.8422 171175.5645 True SawyerW Timber 55691.6508 14523.2415 96860.06 True SawyerW Veenker 52216.9307 -12784.467 117218.3284 False Somerst StoneBr 85119.1628 40147.5686 130090.757 True Somerst Timber 16867.6102 -21686.0702 55421.2905 False Somerst Veenker 13392.8901 -49984.7878 76770.5679 False StoneBr Timber -68251.5526 -119220.3628 -17282.7424 True StoneBr Veenker -71726.2727 -143337.4687 -115.0767 True Timber Veenker -3474.7201 -71239.795 64290.3548 False ------------------------------------------------------------- |
To visualise this we can use the pandas boxplot function although we probably have to tidy up the indices on the neighborhood (x) axis:
1 |
data_sub.boxplot(by='Neighborhood') |
Developing a Research Question
While trying to buy a house in Dublin I realised I had no way of knowing if I was paying a fair price for a house, if I was getting it for a great price, or if I was over-paying for the house. The data scientist in me would like to develop an algorithm, a hypothesis, a research question, so that my decisions are based on sound science and not on gut instinct. So for the last couple of weeks I have been developing algorithms to determine this fair price value. So my research questions is:
Is house sales price associated with socio-economic location?
I stumbled upon similar research by Dean DeCock from 2009 in his research determining the house price for Ames Iowa. So that is the data set that I will use. See the Kaggle page House Prices Advanced Regression Techniques to get the data.
I would like to study the association between the neighborhood (location) and the house price, to determine does location influence the sale price and is the difference in means between different locations significant.
This dataset has 79 independent variables with sale price being the dependent variable. Initially I am only focusing on one independent variable – the neighborhood, so I can reduce the dataset variables down to two, to simplify the computation my analysis of variance needs to perform.
Now that I have determined I am going to study location, I decide that I might further want to look at the bands of house size, not just the house size (square footage), but if I can turn those into categories of square footage, less than 1000, between 1000 and 1250 square feet, 1250 to 1500, > 1500 to see if there is a variance in the mean among these categories.
I can now take the above ground living space variable (square footage) and add it to my codebook. I will also add any other variables related to square footage for first floor, second floor, basement etc…
I then search google scholar, kaggle, dbs library for previous study in these areas, finding: a paper from 2001 discussing previous research in Dublin, however it was done in 2001 when a bubble was about to begin, and a big property crash in 2008 that was not conceived. http://www.sciencedirect.com/science/article/pii/S0264999300000407
Secondly Dean De Cock’s research on house prices in Iowa http://ww2.amstat.org/publications/jse/v19n3/decock.pdf
Based on my literature review I believe that there might be a statistically significant association between house location (neighborhood) and sales price. Secondary I believe there will be a statistically significant association between size bands (square footage band) and sales price. I further believe that might be an interaction effect between location & square footage bands and sales price which I would like to investigate too.
So I have developed three null hypotheses:
* There is NO association between location and sales price
* There is NO association between bands of square footage and sales price
* There is NO interaction effect in association between location, bands of square footage and sales price.
Running a LASSO Regression Analysis
A lasso regression analysis was conducted to identify a subset of variables from a pool of 79 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring Ames Iowa house sale price. Categorical predictors included house type, neighbourhood, and zoning type to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include lot area, above ground living area, first floor area, second floor area. Scale were used for measuring number of bathrooms, number of bedrooms. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
The data set was randomly split into a training set that included 70% of the observations (N=1022) and a test set that included 30% of the observations (N=438). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Figure 1. Change in the validation mean square error at each step:
Of the 33 predictor variables, 13 were retained in the selected model. During the estimation process, overall quality, above ground floor space, and garage cars being the main 3 variables. These 13 variables accounted for just over 77% of the variance in the training set, and performed even better at 81% on the test set of data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV import os data = pd.read_csv("iowa_house_data.csv") #upper-case all DataFrame column names data.columns = map(str.upper, data.columns) print(data.columns) data_clean = data #select predictor variables and target variable as separate data sets predvar= data_clean[['GRLIVAREA', 'LOTAREA', 'YEARBUILT', 'FIREPLACES', 'OVERALLQUAL', 'OVERALLCOND', 'TOTRMSABVGRD', 'YEARREMODADD', '1STFLRSF', '2NDFLRSF', 'YRSOLD', 'BSMTFINSF1', 'BSMTFINSF2', 'BSMTUNFSF', 'TOTALBSMTSF', 'MSSUBCLASS', 'MISCVAL', 'MOSOLD', 'GARAGECARS', 'GARAGEAREA', 'WOODDECKSF', 'OPENPORCHSF', 'ENCLOSEDPORCH', '3SSNPORCH', 'SCREENPORCH', 'POOLAREA', 'LOWQUALFINSF', 'BSMTFULLBATH', 'BSMTHALFBATH', 'FULLBATH', 'HALFBATH', 'BEDROOMABVGR', 'KITCHENABVGR']] target = data_clean.SALEPRICE # standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing print predvar for k in predvar.columns: print k predictors[k]=preprocessing.scale(predictors[k].astype('float64')) # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123) # specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train) # print variable names and regression coefficients var_imp = pd.DataFrame(data = {'predictors':list(predictors.columns.values),'coefficients':model.coef_}) var_imp['sort'] = var_imp.coefficients.abs() print(var_imp.sort_values(by='sort', ascending=False)) # plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
coefficients predictors sort 4 0.36 OVERALLQUAL 0.36 0 0.26 GRLIVAREA 0.26 18 0.12 GARAGECARS 0.12 11 0.07 BSMTFINSF1 0.07 2 0.07 YEARBUILT 0.07 7 0.05 YEARREMODADD 0.05 8 0.05 1STFLRSF 0.05 15 -0.04 MSSUBCLASS 0.04 3 0.04 FIREPLACES 0.04 14 0.04 TOTALBSMTSF 0.04 20 0.02 WOODDECKSF 0.02 27 0.01 BSMTFULLBATH 0.01 1 0.01 LOTAREA 0.01 24 0.00 SCREENPORCH 0.00 25 0.00 POOLAREA 0.00 26 0.00 LOWQUALFINSF 0.00 31 0.00 BEDROOMABVGR 0.00 22 0.00 ENCLOSEDPORCH 0.00 28 0.00 BSMTHALFBATH 0.00 29 0.00 FULLBATH 0.00 30 0.00 HALFBATH 0.00 23 0.00 3SSNPORCH 0.00 16 0.00 MISCVAL 0.00 21 0.00 OPENPORCHSF 0.00 19 0.00 GARAGEAREA 0.00 17 0.00 MOSOLD 0.00 13 0.00 BSMTUNFSF 0.00 12 0.00 BSMTFINSF2 0.00 10 0.00 YRSOLD 0.00 9 0.00 2NDFLRSF 0.00 6 0.00 TOTRMSABVGRD 0.00 5 0.00 OVERALLCOND 0.00 32 0.00 KITCHENABVGR 0.00 |
1 2 3 4 |
training data R-square 0.777169556607 test data R-square 0.81016173881 |
Wesleyan’s Regression Modeling in Practice – Week 2
Continuing on with the Kaggle data set from House Prices: Advanced Regression Techniques I plan to make a very simple linear regression model to see if house sale price (response variable) has a linear relationship with ground floor living area, my primary explanatory variable. Even though there are 80 variables and 1460 observations in this dataset, my hypothesis is that there is a linear relationship between house sale price and the ground floor living area.
The data set, sample, procedure, and methods were detailed in week 1’s post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import numpy as numpyp import pandas as pandas import statsmodels.api import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn from sklearn import preprocessing # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x) #call in data set data = pandas.read_csv('homes_train.csv') print (data['SalePrice'].describe()) |
1 2 3 4 5 6 7 8 9 |
count 1460.00 mean 180921.20 std 79442.50 min 34900.00 25% 129975.00 50% 163000.00 75% 214000.00 max 755000.00 Name: SalePrice, dtype: float64 |
There is quite a sizable differece between the mean and median – almost 17000, or just under 10% of our mean.
So we can center the variables as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
data['GrLivArea'] = preprocessing.scale(data['GrLivArea'], with_mean='True', with_std='False') data['SalePrice'] = preprocessing.scale(data['SalePrice'], with_mean='True', with_std='False') print(data['GrLivArea'].mean()) print(data['SalePrice'].mean()) # convert variables to numeric format using convert_objects function data['GrLivArea'] = pandas.to_numeric(data['GrLivArea'], errors='coerce') data['SalePrice'] = pandas.to_numeric(data['SalePrice'], errors='coerce') # view the centering data['SalePrice'].diff().hist() # BASIC LINEAR REGRESSION scat1 = seaborn.regplot(x="SalePrice", y="GrLivArea", scatter=True, data=data) plt.xlabel('Sale Price') plt.ylabel('Ground Living Area') plt.title ('Scatterplot for the Association Between Sale Price and Ground Living Area') print(scat1) |
1 2 3 |
print ("OLS regression model for the association between sale price and ground living area") reg1 = smf.ols('SalePrice ~ GrLivArea', data=data).fit() print (reg1.summary()) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
OLS regression model for the association between sale price and ground living area OLS Regression Results ======================================================================== Dep. Variable: SalePrice R-squared: 0.502 Model: OLS Adj. R-squared: 0.502 Method: Least Squares F-statistic: 1471. Date: Mon, 03 Oct 2016 Prob (F-statistic): 4.52e-223 Time: 00:13:00 Log-Likelihood: -18035. No. Observations: 1460 AIC: 3.607e+04 Df Residuals: 1458 BIC: 3.608e+04 Df Model: 1 Covariance Type: nonrobust ======================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------ Intercept 1.857e+04 4480.755 4.144 0.000 9779.612 2.74e+04 GrLivArea 107.1304 2.794 38.348 0.000 101.650 112.610 ======================================================================== Omnibus: 261.166 Durbin-Watson: 2.025 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3432.287 Skew: 0.410 Prob(JB): 0.00 Kurtosis: 10.467 Cond. No. 4.90e+03 ======================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.9e+03. This might indicate that there are strong multicollinearity or other numerical problems. |
Looking at the graphs and summary statistics my hypothesis seems to be explained better than I expected. Remember the null hypothesis (H0) was that there was no linear relationship between house sale price and ground floor living space. The alternative hypothesis (H1) was that there is a statistically significant relationship. Considering there are 79 explanatory variables and I selected only one to explain the response variable and yet both my R-squared and adjusted R-squared are at .502 (so a little over 50% of my dataset is explained with just one explanatory variable).
My p-value of 4.52e-223 is a lot less than .05 so there is significance that the model explains a linear regression between sale price and ground floor living area so I can reject my null hypothesis and accept my alternative hypothesis that there is a relationship between house price and ground floor living space. However both the intercept (p-value = 3.61e-05) and the ground floor living space (p-value = 2e-16) appear to be contributing to the significance – with both p-values 0.000 to 3 decimal places and both t values being greater than zero so it is a positive linear relationship.
From the graph the dataset appears to be skewed on the sale price data – the mean is -1124 from zero (where we’d like it to be) so the data was centered.
I realise I still need to examine the residuals and test for normality (normal or log-normal distribution).
Note the linear regression can also be done in R as follows:
1 2 3 4 5 6 7 8 9 10 11 |
house = read.csv('train.csv') house_model = lm(house$SalePrice ~ house$GrLivArea, house) summary(house_model) plot(house$GrLivArea, house$SalePrice) hist(house$SalePrice) shapiro.test(house$SalePrice) ## Plot using a qqplot qqnorm(house$SalePrice) qqline(house$SalePrice, col = 2) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Call: lm(formula = house$SalePrice ~ house$GrLivArea, data = house) Residuals: Min 1Q Median 3Q Max -462999 -29800 -1124 21957 339832 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18569.026 4480.755 4.144 3.61e-05 *** house$GrLivArea 107.130 2.794 38.348 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 56070 on 1458 degrees of freedom Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018 F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16 |
To improve the performance of my model I now need to look at treating multiple explanatory variables which will be done in next week’s blog post.
Will Mayo Ever Win an All-Ireland? Will Dublin win 3 in a Row?
On a bulletin board yesterday a Mayo man posed the following questions. Calculate the probabilities of:
- Mayo winning the All Ireland within the next 65 years
- Dublin getting three in a row
He will be delighted to know that the probability of Mayo winning an All Ireland in the next 65 years is almost 100% that they will, no matter what way the data is sliced.
They have won 3 / 131 so approximately 1 in 44.
They have won 3 / 15 finals they have appeared in so 1 in 5, (.2), and they have now been in 8 in a row without winning one.
They have been in 5 out of the last 15 finals = one in 3 = (.33)
Which led me onto the Dublin question:
As of today the Dubs getting 3 in a row without putting thought into it should be -> 1 in 33.
The 31 counties taking part (Kilkenny doesn’t and the shouldn’t be allowed hurl if they don’t play football) plus London and New York.
However Dublin only play in Leinster and winning that gets them to the quarter-final – so if they win Leinster then that is 1 in 8.
But they are not guaranteed to win Leinster – they have only won 9 out of the last 10 – so 90% chance of getting to the last 8 ->
So 9/10 * 1/8 = 9/80 = 0.1125
But this seems a bit to low to price Dublin to win next year.
From another view Dublin have won four of the last six = 4/6 = 2/3
But I s’pose this last algorithm is lacking any nerves of doing a threepeat – it is 93 years since Dublin did it. Kerry are the only team to have done it in the last 50 years, and they only did it twice in that time, and it has not been done in the last 30 years – only 2 teams in the last 30 years have been in a position to do it and both failed, and this included Kerry getting to 6 finals in a row, winning 4 in 6 and still failing to win 3 in a row.
And now what odds would I want to place a bet in a bookmakers – probably 1 in 4 sounds right – if they can beat any two out of Kerry, Mayo, and the Ulster champions that would win it for them.
Wesleyan’s Machine Learning for Data Analysis Week 2
Week 2’s assignment for this machine learning for data analytics course delivered by Wesleyan University, Hartford Connecticut Area in conjunction with Coursera was to build a random forest to test nonlinear relationships among a series of explanatory variables and a categorical response variable. I continued using Fisher’s Iris data set comprising of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with 4 explanatory variables representing sepal length, sepal width, petal length, and petal width.
Using Spyder IDE via Anaconda Navigator and then began to import the necessary python libraries:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier |
Now load our Iris dataset of 150 rows of 5 variables:
1 2 3 4 5 |
#Load the iris dataset iris = pd.read_csv("iris.csv") # or if not on file could call this. #iris = datasets.load_iris() |
Now we begin our modelling and prediction. We define our predictors and target as follows:
1 2 3 |
predictors = iris[['SepalLength','SepalWidth','PetalLength','PetalWidth']] targets = iris.Name |
Next we split our data into our training and test datasets with a 60%, 40% split respectively:
1 2 3 4 5 6 |
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape |
Training data set of length 90, and test data set of length 60.
Now it is time to build our classification model and we use the random forest classifier class to do this.
1 2 |
classifier = RandomForestClassifier(n_estimators=25) classifier = classifier.fit(pred_train,tar_train) |
Finally we make our predictions on our test data set and verify the accuracy.
1 2 3 4 |
predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) |
1 |
Out[1]: 0.94999999999999996 |
Next we figure out the relative importance of each of the attributes:
# fit an Extra Trees model to the data
1 2 3 |
model = ExtraTreesClassifier() model.fit(pred_train,tar_train) print(model.feature_importances_) |
1 |
[ 0.09603246 0.06664688 0.40937484 0.42794582] |
Finally displaying the performance of the random forest was achieved with the following:
1 2 3 4 5 6 7 8 9 10 11 |
trees=range(25) accuracy=np.zeros(25) for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla() plt.plot(trees, accuracy) |
And the plot success was output:
Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary or categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating the type of Iris based on petal width, petal length, sepal width, sepal length.
The explanatory variables with the highest relative importance scores were petal width (42.8%), petal length (40.9%), sepal length (9.6%), and finally sepal width (6.7%). The accuracy of the random forest was 95%, with the subsequent growing of multiple trees rather than a single tree, adding little to the overall accuracy of the model, and suggesting that interpretation of a single decision tree may be appropriate.
So our model seems to be behaving very well at categorising the iris flowers based on the variables we have available to us.