--- title: Predicting Property Sale Values in Ames, Iowa subtitle: Using regression, random forests, time series, and neural networks author: Rishi Goutam, Srikar Pamidi, James Goudreault github-url: 'https://github.com/rishigoutam/ames-iowa' date: '2022-03-10' abstract: Using regression and random forest models to predict property prices in Ames, IA. Bonus analysis of the housing market using time series keywords: Ames Housing Project; NYC Data Science Academy description: publish: false category: project --- The Ames housing [dataset](https://www.kaggle.com/competitions/stat101ahouseprice/overview) is famous as an open Kaggle competition and for its use in undergraduate statistical methods courses as an end-of-semester regression project[^decock]. Unlike the similar Boston housing dataset, it has a relatively large number of variables (81 versus 14) and observations (2580 versus 506). In tackling this problem, we would have to go beyond a simple, automatic algorithm such as stepwise selection, to construct a final model. [^decock]: [Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](http://jse.amstat.org/v19n3/decock.pdf) <!-- Boston Housing Harrison, D. and Rubinfeld, D. L. (1978). “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5, 81-102 --> In this project, we took on the persona of being a data science team at a made-up local realty firm, Regression Realty, in early 2011. Our stakeholders were primarily our firm's listing and selling agents and our goals were to: 1. Provide insight into Ames' housing market conditions given that we were just out of a major financial crisis. 2. Surface advice they could give to their clients. For instance, tell a buyer a house is overpriced or a seller that they could make some money by remodeling a fireplace. 3. Create predictive models that can be meaningfully used in our firm's mobile or web app. This means not using too many features (agents have to input data) or taking a long time to run a model. ## Data ### Cleaning The data comprises information collected from the [Ames City Assessor](https://www.cityofames.org/government/departments-divisions-a-h/city-assessor)'s Office[^not-mls] during 2006—2010 with variables being nominal, ordinal, continuous, and discrete in nature. As a real-world dataset, it required extensive cleaning and feature engineering to be able to garner insights and create models. [^not-mls]: Not from traditional MLS data sources We dropped one duplicate observation and removed two outliers[^outliers] as per the recommendation of the dataset author[^decock]. Another approach would have been to define outliers as properties whose sale prices were more than, say, four standard deviations from the mean. This would have dropped five points in our case. [^outliers]: `GrLivArea` > 4,000 feet<sup>2</sup> ### Imputation We imputed missing values for features by taking the mean, median, or mode as applicable of the feature category. Sometimes, we imputation was done by making reasonable assumptions. For instance, a missing `GarageYrBlt` can be imputed by the year the property was built. Finally, we imputed None (`MasVnrType`) or zero (`MasVnrArea`) for values that might not exist or for which an educated guess could not be made. For implementation details, see [`clean.ipynb`](https://github.com/rishigoutam/ames-iowa/blob/main/rishi/clean.ipynb) ## Feature Engineering In addition to the given dataset's features, we added two more—the public school districts for a property and the interest rate for the month the property was sold (determined by [TNX](https://finance.yahoo.com/quote/%5ETNX/)—Treasure Yield 10 years). Ames has five school districts and we might expect property value to vary based on the quality of the school. By determining whether a property's latitude and longitude coordinates fall within the district's boundaries, we can determine the school district. First, we used `geopy` to query OpenStreetMap's `Nominatim` geocoding software to acquire the property coordinates[^geo_locations]. There are several `python` and `R` tools to help solve the point-in-polygon problem. We used `R`'s `sf` package. [^geo_locations]: See [`geo_locations.ipynb`](https://github.com/rishigoutam/ames-iowa/blob/main/rishi/geo_locations.ipynb) ![](https://hackmd.io/_uploads/SkT8EnsEq.png) ![Map showing outlines of Ames public school districts labeled by district name](map-districts.png) We also derived features we believed might be useful, such as whether a property is near a park, an arterial road, or a rail line; if it is a Planned Unit Development (PUD), has been renovated, has a pool, the number of floors, etc. Feature engineering was conducted to also * Ordinalize some categorical features * (`*Qual`,`*Cond`, `Neighborhood`, etc) * Combine multiple features into a single feature * (`StreetAlley`, Total Outdoor foot<sup>2</sup>, etc) * Collapse features into smaller set of categories * (`MSSubClass`, etc) Full details can be found in [`engineer.ipynb`](https://github.com/rishigoutam/ames-iowa/blob/main/rishi/engineer.ipynb) ## Notable Findings Comprehensive exploratory data analysis produced a lot of insights and informed feature selection and engineering—here follows the distilled version. First, we see Ames properties plotted against school districts. We also plot the schools, a hospital, and Iowa State University. ![](https://hackmd.io/_uploads/HJfvuhs45.png) ![Map of properties colored by sale price on a scale where darker green indicates higher prices. Properties are overlaid on the previous image of public school districts](map-sale-price.png) We will later see whether school districts matter (and how much) for a city like Ames. Compared to features such as `Neighborhood`, we will see that `district` is not as important. Still, all that mapping was useful as an EDA tool and to check our assumptions. ![](https://hackmd.io/_uploads/BJzDO3sV5.png) ![Map of properties colored by age on a scale where darker green indicates higher prices. Properties are overlaid on the previous image of public school districts](map-year-built.png) Area features display increasing variance as we see below. We see the plots of `SalePrice` (y) versus `GrLivArea` (x) as well as the residuals. We went with a log-linear transformation in our models as it showed a lot of improvement in reducing the "<"-shaped spread in property sale prices. Log-log shows marginal improvement, but is also harder to explain to a layman. When taking a transformation, we must remember to reverse the transformation after our model makes a prediction. ![](https://hackmd.io/_uploads/B1S5c2oV9.png) ![Scatterplots of property sale price versus living area as well as residual plots. Scatterplots done for log-linear and log-log as well](area-features-display-variance.png) Size matters…in most cases. The price of a single family home is strongly correlated with the number of rooms in the home. Not so with other home types. ![](https://hackmd.io/_uploads/B1Kyo2iN9.png) ![Boxplots of sale price versus number of rooms faceted by building type](price-depends-on-numrooms.png) Neighborhood is very strongly predictive of price. As they say, *location, location, location!* What this means is that we can treat `Neighborhood` as an ordinal feature in our models. ![](https://hackmd.io/_uploads/SyFjc2iNc.png) ![Neighboor versus SalePrice boxplots show that the neighborhood feature can be ordinalized](neighborhood-is-predictive.png) ### Market Analysis In addition to exploring the dataset itself, we wanted to focus on the Ames housing market dynamics. We came in to this with a lot of assumptions—the 2008 crash *must* have caused house prices to fall. Seasonality patterns in the housing market must change too. Surely, single-family homes would be more resilient than multi-dwelling units to the depressed market conditions? We used a SARIMA model to look at average `SalePrice` for the dataset. We split up the properties by quintile to see if there was a difference between expensive and cheap houses. ![quintilesofhousingpriceovertime](https://hackmd.io/_uploads/S1Gjuk-Eq.png) ![Average price of properties by quintile from 2006 to 2010 in Ames, IA](quintiles-prices-over-time.png) We can see that in 2006-2010, the house prices by quintile stayed relatively constant. This is counterintuitive, since the [housing market collapsed](https://en.wikipedia.org/wiki/Financial_crisis_of_2007%E2%80%932008) in 2008, and we do not see that reflected in any of the quintiles. Given that we lived through this recession and knew house prices to be falling where we grew up, we were perplexed at what was going on in Ames. We conducted a Ljung-Box test to see if we could reject H<sub>0</sub>—whether the data are independently distributed. I.e., the correlations in the population from which the sample is taken are zero, so that any observed correlations in the data result from the randomness of the sampling process[^ljung-box]. [^ljung-box]: See [Ljung-Box test](https://en.wikipedia.org/wiki/Ljung%E2%80%93Box_test) ```r # Ljung-Box Test—We test to reject H_0, looking for p<.05 test1 <- Box.test(pur.ts,type="Ljung-Box", lag= log(nrow(pur.ts))) # p = .99 test2 <- Box.test(o.ts,type="Ljung-Box", lag= log(nrow(o.ts))) # p = .14 test3 <- Box.test(t.ts,type="Ljung-Box", lag= log(nrow(t.ts))) # p = .34 test4 <- Box.test(th.ts,type="Ljung-Box", lag= log(nrow(th.ts))) # p = .94 test5 <- Box.test(fo.ts,type="Ljung-Box", lag= log(nrow(fo.ts))) # p = .33 test6 <- Box.test(fi.ts,type="Ljung-Box", lag= log(nrow(fi.ts))) # p = .31 ``` The Ljung-Box test for each quintile reports a p-value that is over the threshold of 0.05, meaning these values are white noise distributions around an average value. This is an extremely stationary dataset—remarkably so. However, looking at the U.S. Federal Housing Finance Agency data from 2006-2010, we can see that in Iowa, the prices stayed remarkably constant even through the housing market crisis. This accounts for how unintuitive the results we've seen were–housing in Iowa itself behaved in a counterintuitive way. ![Sales exhibit seasonality. Shift towards cheap houses over expensive houses](sales-seasonality.svg) Looking on to the left, we can see that the whole of the average number of houses sold at each date oscillates with a regular seasonality through the year. However, when we split housing stock into the upper- and lower-half, we can see that, before 2008, the more expensive prices were sold more often. After the crash, the cheaper houses were sold more often. However, we start to see the original pattern return around 2010, the year which the housing market began to significantly recover. This is important for Regression Realty's realtors, as they know that they should now switch back to focusing on expensive houses and that the market might be recovering. Finally, we use a SARIMA model[^sarima] to predict property prices given past data with an 85% confidence interval. Our model uses the past four years of data to predict the following year with an RMSE of $14,167. Remember that this is the prediction for the housing market, not a particular house. We could adapt this model to train on a particular `Neighborhood` or quintile to get more targetted average predictions for a realtor. ![OneYearPriceSARIMA](https://hackmd.io/_uploads/r1HTIJbN5.png) ![One-year property price predictions using SARIMA](one-year-sarima.png) [^sarima] The Seasonal Autoregressive Integrated Moving Average Model, or SARIMA, is a statistical model that uses Seasonality, Autocorrelation, Differencing, and Moving Averages to predict future data given only past data. ## Predictive Models brief overview of models selected ### SARIMA The Seasonal Autoregressive Integrated Moving Average Model or SARIMA is a statistical model that uses Seasonality, Autocorrelation, Differencing, and Moving Averages to predict future data given only past data. We were able to use SARIMA to produce a predicition for average housing price, and gives us the 85% confidence interval. This model predicts the next year of average housing prices, given the previous 4 years, and had an RMSE value of $14,167. This means that the SARIMA model can predict average housing price for out to a year +/- around $14,200. ![OneYearPriceSARIMA](https://hackmd.io/_uploads/r1HTIJbN5.png) ### Model Scoring | Model | R2 Train | R2 Test | RMSE | | -------- | -------- | -------- | -------- | | Elastic-Net | 0.933 | 0.922 | 0.046 | | Random Forest | 0.986 | 0.915 | 0.047 | | Gradient Boosting| 0.994 | 0.927 | 0.043 | | SVR | 0.926 | 0.922 | 0.045 | | Backprop | 0.937 | 0.895 | 0.032 | | SARIMA | 0.528 | 0.077 |$14,167 | how did backprop have a worse R2 test but lower RMSE? That's a red flag ### Effect of 1-unit change on mean sale price (and other insights) | Change | Expected Cost | Sale Price Increase | | ------------------------------ | ------------- | ------------------- | | +1 sqft living area | N/A | +$57 | | +1 sqft basement area | N/A | +$22 | | +1 sqft finished outdoor space | N/A | +$11 | | Installing a fireplace | $2-5K | +$4,380 | | Installing central air | $3-15K | +$23,599 | | Proximity to rail line | N/A | -$7,675 | ## Conclusion ## Future Development ![](https://hackmd.io/_uploads/BJHvdyaQ5.png) ## Engineering Process ```mermaid flowchart LR d_r[(Ames_Real_Estate\n_Data.csv)]-->geopy d_TNX[(TNX)]-->tnx d_shapefile[(ames_school_districts_sf)]-->districts subgraph geo [New Features <various R/python>] direction LR geopy[[get coordinates\n using geopy]] tnx[[get interest rate\n using quandle]] districts[[get school district\n using shapefile]] end geo-->d_e d_h[(Ames_Housing_Price\n_Data.csv)]-->clean subgraph clean [Data Cleaning <clean.ipynb>] direction TB dupe[[remove \nduplicates]]-->outlier outlier[[remove \noutliers]]-->fillna fillna[[fill \nempty values]]-->impute impute[[impute \nmissing values]] end clean-->d_c[(cleaned.csv)] subgraph encode [Feature Engineering <engineer.ipynb>] direction TB ordinate[[convert categorical \nfeatures to ordinal]]-->indicator indicator[[create indicator \nfeatures]]-->new new[[create derived \nfeatures]] end d_c-->encode-->d_e[(engineered.csv)] subgraph eda [EDA <eda.ipynb, eda.R>] direction TB histograms-.- boxplots scatterplots-.- regplots end subgraph lin_regression [Multiple Linear Regression <linreg.ipynb, linreg.R>] direction TB subgraph lin_features [Select Model Features] direction LR style lin_features stroke-dasharray: 5 5, fill: khaki cat_encoded--->dummify[[dummify]] numerical---|plus|cat_encoded[categorical]---|plus|boolean end lin_features---> linear_model[[create model]]---> validate--->|iterate|lin_features subgraph validate [Validate and Score] direction LR style validate stroke-dasharray: 5 5, fill:khaki stats_significance[feature significance: p-values] stats_cor[multicolinearity: vif] stats_score[score: R^2, RMSE, AIC/BIC] end end subgraph algo [Additional Models] direction LR decision_tree>Decision Tree] arima>SARIMA] svr>SVR] backprop>Neural Net] decision_tree.->random_forest>Random Forest] decision_tree.->boosted>Boosting] end d_c-->eda d_e-->lin_regression d_e-->decision_tree d_e-->arima d_e-->svr d_e-->backprop ```