Developing a Research Question

While trying to buy a house in Dublin I realised I had no way of knowing if I was paying a fair price for a house, if I was getting it for a great price, or if I was over-paying for the house. The data scientist in me would like to develop an algorithm, a hypothesis, a research question, so that my decisions are based on sound science and not on gut instinct. So for the last couple of weeks I have been developing algorithms to determine this fair price value. So my research questions is:
Is house sales price associated with socio-economic location?

I stumbled upon similar research by Dean DeCock from 2009 in his research determining the house price for Ames Iowa. So that is the data set that I will use. See the Kaggle page House Prices Advanced Regression Techniques to get the data.

I would like to study the association between the neighborhood (location) and the house price, to determine does location influence the sale price and is the difference in means between different locations significant.

This dataset has 79 independent variables with sale price being the dependent variable. Initially I am only focusing on one independent variable – the neighborhood, so I can reduce the dataset variables down to two, to simplify the computation my analysis of variance needs to perform.

Now that I have determined I am going to study location, I decide that I might further want to look at the bands of house size, not just the house size (square footage), but if I can turn those into categories of square footage, less than 1000, between 1000 and 1250 square feet, 1250 to 1500, > 1500 to see if there is a variance in the mean among these categories.

I can now take the above ground living space variable (square footage) and add it to my codebook. I will also add any other variables related to square footage for first floor, second floor, basement etc…

I then search google scholar, kaggle, dbs library for previous study in these areas, finding: a paper from 2001 discussing previous research in Dublin, however it was done in 2001 when a bubble was about to begin, and a big property crash in 2008 that was not conceived.
Secondly Dean De Cock’s research on house prices in Iowa

Based on my literature review I believe that there might be a statistically significant association between house location (neighborhood) and sales price. Secondary I believe there will be a statistically significant association between size bands (square footage band) and sales price. I further believe that might be an interaction effect between location & square footage bands and sales price which I would like to investigate too.

So I have developed three null hypotheses:
* There is NO association between location and sales price
* There is NO association between bands of square footage and sales price
* There is NO interaction effect in association between location, bands of square footage and sales price.