Predicting House Prices in King County, Washington

John Luo, Sean Lin

Introduction

King County is the most populous county in Washington State. Within it lies Seattle, Washington's most populous city. Around two-thirds of King County's population lives in Seattle's suburbs. As we move westward, the county slowly transforms into countryside and farmland. King County is also host to the headquarters of 10 Fortune 500 Companies, including: Starbucks, Nordstrom, Alaska Airlines, Costco, Expedia, Microsoft, and Amazon. Microsoft and Amazon are two tech giants of the Big Five, coveted companies for those looking to work within the tech industry.

For those interested in working at a Fortune 500 company based in Seattle, it is only natural to wonder about the price of housing in King County. That question is further complicated when we start to qualify the houses we're interested in: 3 Bed 2 Bath on 2 floors versus 4 Bed 4 Bath on 1 floor. It goes without saying the convenience that would come from being able to estimate the price of a house given a selection of desired features. The ability to assign a price tag to our "dream house" makes it more attainable for the dream to become a reality.

In this tutorial, we walk through creating a linear regression model in Python to predict the price of properties in King County given some key attributes. Along the way, we'll learn about data parsing, cleaning, and analysis in the context of data science.

Required Tools

Before starting, we'll need to install the following Python libraries:

You can learn more about these libraries through their documentation!

Library Documentation
pandas https://pandas.pydata.org/
numpy https://numpy.org/doc/
urllib https://docs.python.org/3/library/urllib.html
io https://docs.python.org/3/library/io.html
zipfile https://docs.python.org/3/library/zipfile.html
seaborn https://seaborn.pydata.org/
matplotlib https://matplotlib.org/3.3.3/contents.html
sklearn https://scikit-learn.org/stable/
statsmodel.api https://www.statsmodels.org/stable/api.html

Data Collection

We will be working with a dataset collected by the University of Chicago's Center for Spatial Data Science on "Home Sales in King County, WA from 2014-2015". In your own time, we highly recommend you learn more about the Center for Spatial Data Science and dig into more of their other awesome datasets!

Let's start by grabbing loading the housing sale dataset into a Pandas Dataframe!

There are 21,613 observation and 21 variables in our dataset. Let's take a closer look at what each variable name represents:

Variable Description
id Unique identifier of the house
date Date of sale
price Sell price
bedrooms Number of bedrooms
bathrooms Number of bathrooms. Noninteger values exist due to "1/2 bathrooms" and "3/4 bathrooms"
sqft_liv Size of interior space in square feet
sqft_lot Size of land lot in square feet
floors Number of floors. Noninteger values exist due to "half floor" architecture
waterfront '1' if property has a waterfront, '0' if not
view An index from 0 to 4 of how good the property's view is
condition Condition of the house, ranked from 1 to 5, 5 being the greatest condition
grade Classification by construction material and worksmanship quality. Numeric scale with higher numbers being better. For more information see the King County glossary
sqft_above Square feet above ground
sqft_below Square feet below ground
yr_built Year built
yr_renov Year renovated. '0' if never renovated
zipcode 5 digit zip code
lat Latitude
long Longitude
squft_liv15 Average size of interior space for closest 15 houses, in square feet
squft_lot15 Average size of land lot for closest 15 houses, in square feet

Data Preprocessing

Armed with a better understanding of the dataset, let's clean it up in preparation for data analysis! We first want to check for any null or missing values across the Dataframe. Keep in mind that there are 21,613 observations, so every column should also contain 21,613 values. Let's verify this.

Nice! The Dataframe does not contain any None or numpy.nan values. But this doesn't mean we're not completely out of the woods yet. Let's take a deeper look at each column's summary statistics and see if there are any misreported observations.

Hm, some of these values don't make a lot of sense in the context of house sales. For example, it should not be possible to sell a house with 0 bathrooms (interestingly, zero-bedroom dwellings are a defined and legal residential dwelling). Additionally, something seems fishy about a house with 33 bedrooms but only 1.75 bathrooms on 1620 square feet. Let's remove these extraneous values from the Dataframe. This will help us build a more accurate model later on.

Since the Dataframe is storing House Sales as observations, it is not guaranteed that each observation is for a unique house. It may be the case that a house was sold several times between 2014-2015. This may later introduce bias into our regression model. Let's remove all instances of duplicate houses. Since different observations on the same house represent different times the house was sold, let's resolve these entities by always using the most recent house sale. This way, we always preserve the most recent information about the house.

Lastly, to aid in data analysis, let's "tidy up" this Dataframe by following the Tidy Data standard. To do this, we must ensure that each variable is stored in its own column. Currently, the date and yr_renovated columns stored more than one variable. Let's split and replace the date column into separate columns year, month, and day. For fun, we'll also add a day_of_week column. Let's also add is_renovated column and replace all values of 0 in yr_renovated with numpy.nan.

Our cleaned and tidied Dataframe now has 21,425 observations each representing a unique house that was sold in King County, WA between 2014-2015. With that, we are ready to begin data analysis!

Exploratory Data Analysis and Visualization

Summary Statistics

First, let's do some summary statistics: Mapping price distributions across the dataset

Let's take a look at the overall price distribution. What's the ballpark for housing prices in the county?

As we can see, there is a noticeable right skew. This suggests that, on the more expensive side of houses, there's a large variety in the prices of houses.

Let's see if a logged distributed will create more symmetry and ergo more interpretability.

Looks good! It seems that applying a logarithm transform to the data made it more symmetric as a result. This is something we can work with later.

Next, let's take a look at the impact of the time of year on housing; both prices and frequency.

What days and months of the year were purchases made the most over 2014-2015?

Month's impact on frequency

Month's impact on price

Housing prices stayed relatively stable throughout the months, with a few in April, June, August, September, and October. While there doesn't seem to be much of a difference in the effect of the month on housing prices, there seems to be a noticeable relationship between the month and the frequency of purchases. Specifically, it seems that the middle months of March, May, June, and July have the highest frequency of purchases. There could be a variety of reasons for this, such as a warmer climate leading to more people touring houses or the time of year coinciding with the end of the school year giving more time to shop for houses.

Day's impact on Frequency

Day's impact on Price

Similar to months, the day of week didn't seem to affect the average price of house purchases very much, with Saturday slightly edging out other days. However, there's a clear trend in the effect on purchase frequency. Tuesday and Wednesday stand out as the most common while all weekdays were far more common than weekends. While I would imagine weekends would be easier for purchasers to spend time on buying houses, it seems that the sellers' work schedules, if they sell real estate full-time, may be more important.

How about the number of floors, bedrooms, and bathroooms? How do these affect price?

It seems that one floor houses are the most common, followed by the two floor houses. Due to the low quantity of houses of other floor numbers, any outliers will have a greater effect, as we'll see later.

Floors

As we can see, 2.5 and 3.5 floors have very noticeable outliers in house prices. Since these are pretty uncommon houses with very low frequency, we'll need to be careful in interpreting the information to account for these outliers.

As mentioned earlier, these averages can be misleading. 2.5 and 3.5 floor averages are inflated by the combination of their low frequency and large outliers. It's hard to take these into account with such little data, but it seems that, outside of these two outliers, the 2 floor houses have the highest overall prices. It's hard to tell if there's a strong correlation or not with number of floors, since the trend is so irregular and the range of floors is only 3.5.

Bedrooms

It seems like the most frequent houses are 3 bedrooms. As before, with lower frequency counts in 0, 1 and 5+ bedroom houses, we need to be wary of the presence and effect of outliers. 3 and 4 bedrooms being the most common doesn't seem like that much of a surprise, since that should be enough to accommodate families or roommate situations, while any more may be too many bedrooms and any fewer may butt heads with the benefits of apartments.

For bedroom numbers 7 and 8, there seems to be a noticeable skew toward higher prices. Combined with low frequencies, we can expect these two to be some of the highest.

Due to the aforementioned skews, 7 and 8 bedroom houses sold for the most on average. While there is a generally increasing trend with number of bedrooms and price, it drops after 8 bedrooms. This could possibly be due to the small amount of data for higher bedroom houses, or those houses with that number of bedrooms may be undesirable in one way or another. Either way, there seems to be a noticeable positive correlation between number of bedrooms and price up until 8 bedrooms.

Bathrooms

It seems that most of the houses are concentrated within the 1-2.5 bathroom range, with relatively very few houses outside of that range.

Once bathrooms get to the 4.75 mark, the spread of their prices becomes far greater. Combined with the extremely low amount of datapoints for houses with those numbers of bathrooms, what we can glean from those numbers is fairly low.

Overall, it seems that there is a consistent increasing trend with the number of bathrooms. As mentioned before, above 4.5 bathrooms is a territory with few datapoints, but even excluding those, the trend seems to still hold.

Analysis of floor/bedroom/bathroom

How many renovations occurred? What was the distribution of years between built and renovation? did it differ by zipcode?

Based off this chart, we can see that renovations were fairly rare for this county.

It seems that there's a slightly bimodal distribution of years until renovation, with the highest frequency in the range of 40-60 years until renovation.

What about the condition and grade of the house? How are those qualities distributed?

It seems that the majority of houses are in condition 3, or middle of the road. It's interesting to note that condition 1 and 2 houses are very infrequently sold. This could be due to some degree of self-selection---houses that are in worse condition may be less likely to be sold.

It seems that the grades are distributed fairly unimodally, with the majority of houses at grade 7. There are very few houses below that grade, which, as mentioned with condition, could be due to a degree of self-selection.

Now let's see some information on a map. The following code will produce a map from a sample of 5,000 from the data. It will color code by price bracket and let you filter by condition of the house.

We can see that a lot of the higher priced properties are clustered around Seattle and Bellevue. This is expected, since that area is fairly urban and popular cities. As we move south, further away from those two cities, prices of properties start dropping. Furthermore, we can see that waterfront properties alongside the coast are more expensive, while more inland properties tend to a be a bit cheaper. It's interesting to note that, even though we know that the vast majority of properties are at condition 3, even the more expensive properties are also condition 3. Additionally, cheaper properties have a high presence at condition 4 and a decent presence at condition 5, suggesting that condition and price aren't necessarily entirely correlated, which is something we'll examine a little later. The waterfront-ness of properties and their conditions don't seem to have too much of a correlation, until condition 5 where it's mostly waterfront.

Correlation Analysis

Now, let's take another look at condition and its effect on price.

As we can see, when the condition of the house is 3 or higher, its price is far greater. However, this is interesting as houses of conditions 1 and 2 are fairly rare. This might be due to some kind of selection bias, as poorer houses might be less likely to be sold in the first place. However, it's interesting to note that, amongst houses of conditions 3-5, there's not a strict increase. Condition 4 houses on average less costly than condition 3 houses, but are also a little less common. Condition 5 houses are the clear highest, but, as we gleaned from the map, this may also be due to being waterfront, which we will examine next.

Waterfront

There are a few outliers in both situations, but due to there being far fewer houses with a waterfront, the outliers will affect that group more, which is something we need to be careful of. However, the houses with a waterfront are overall more spread than those without pricewise, so it's reasonable to assume that its average will likely be higher regardless of outliers.

Keeping in mind the presence of outliers from earlier, it still seems that there is a big difference pricewise between houses with a waterfront and houses without. This is in line with what we saw from the map earlier. There are various reasons why a waterfront could cost more, but the explanations are out of scope for just this exploratory data analysis.

Hypothesis Testing

Based on our exploratory analysis, we have a good idea of which features might affect the price of a house in King County. We will now focus on developing a linear regression model to predict the price of a house. For this project, we will be fitting our regression line using the most common method of least squares.

Before continuing, let's first prepare the Dataframe for regression. For our simple model, we will exclude the id, day_of_week, day, month, year, lat, long, or zipcode columns. These variables do not provide intrinsic value in the prediction of prices. We also exclude yr_renovated and is_renovated as there are only 910 houses that were renovated between 2014-2015. Similarly, we excluse waterfront as there are only 163 waterfront houses in the Dataframe.

Linear regression has 5 key assumptions:

We've shown through EDA that we should be safe assuming that our features have a linear relationship with price. What we need to do now is ensure that our features are independent from one another. To verify this, let's create a correlation matrix to view the relationship between our features.

We have chosen to view the correlation matrix as a heatmap. Red cells indicate variables with positive correlation; blue cells indicate variables with negative correlation. The intensity of the color matches the strength of the correlation. It looks like some of our variables are collinear. This breaks assumptions required for linear regression.

In order to create independent inputs for our model, we will find orthogonal components among our input variables using Principal Component Analysis (PCA). PCA provides the added benefit of dimensionality reduction - reducing the number of input variables for our model. This will better capture the "essence" of our dataset and lower the chances of an overfitting.

The downside to PCA comes in the form of a reduction in result interpretability. Coefficients of a principal component in a regression model do not provide as much insight into the relations between independent and dependent variables. Thus, we sacrifice interpretability for accuracy.

First, we must first standardize our input data in order to maximize variance and the effect of PCA. We then use PCA to select components until they explain at least 80% of the variance in the data. Next, we transform our dataframe from collinear input variables into orthogonal principal components and store them in a new Dataframe.

PCA transforms our original dataframe using a weight matrix. This weight matrix indicates the importance of each input variables in creating each principal component. Below is a heatmap plotting the PCA weight matrix.

Note for example that sqft_lot and sqft_lot15 have high correlation and can be explained by a single principal component.

We are now ready to fit an ordinary-least-squares linear regression model to our Dataframe. Let's split our data into a training and testing set and train our model!

We pose a null hypothesis that there is no relationship between house pricing and our principal components.

Thus we have created a linear regression model that predicts the price for a house in King County, WA between 2014-2015 given desired parameters for bedrooms, bathrooms, sqft_living, sqft_lot, floors, view, condition, grade, sqft_above, sqft_basement, yr_built, sqft_living15, and sqft_lot15.

We are able to reject the null hypothesis that there is no relationship between house prices and our principal components. Each principal component has a coefficient with P-Value significantly less than $\alpha = 0.01$. This means that our predictors have an effect on the dependent variable. For example, PC3 was mostly weighted by sqft_lot and sqft_lot15 and had a strong negative relationship on house pricing. As the square footage on a property's lot increased, price was seen to go down. This may be explained by the fact that houses with larger land lots are situated in rural areas, and will be priced much lower than houses in urban areas.

However, our model is not perfect. It's coefficient of determination - $R^2$ is just below 0.60. Since we are working with real world data that is inherently noisy, this is not a bad value. However, this still means that more than 40% of the variance in housing prices cannot be explained by the selected principal components or in turn the original input features. Additionally, the model's Root Mean Square Error is 0.3 which means the predict prices are within a factor of $10^{0.3} \approx 2$ of the observed prices. This harms the model's usability, as doubling a house's price significantly changes its affordability. Since our features already take into account several basic characteristics found in houses, we are unlikely to create more accurate price predictions without additional information.

Conclusion

Shopping for houses is an important step in many people's lives. Everyone wants to buy low and, possibly in the future, sell high. Figuring out what traits affect the prices of houses in what ways can go a long way in learning how to shop for houses and what strategies to employ to maximize gain while minimizing loss. So, what have we learned?

  1. We reject the null hypothesis that there's no relationship between house prices and factors like the number of bedrooms, bathrooms, square foot living space, etc. This suggests that these factors do have an effect on the price.
  2. Location matters! The location of the house seemed to play a role in the price greatly, with houses at waterfront and near Seattle or Bellevue being more expensive than houses inland in Kent.
  3. Bigger is better, for the most part. Square foot lot space was king in correlations. While aspects like number of bedrooms also showed a steady increasing trend, the number of bathrooms and floors were a little more inconsistent. Furthermore, the houses with high numbers in those factors were fairly rare, making it difficult to evaluate the overall trend with it.

Overall, we learned a lot about housing and the factors that go into price. Even with all this information, there's still plenty of room for learning in the future. For example, what about the proximity of these houses to the local grocery store? How about to good schools? Transportation? There's plenty of more data exploration out there that can be done to further optimize and figure out what makes a house worth what it's worth.