Using Machine Learning to Predict Urban Development

This post shows a published study the author conducted while working at Skidmore, Owings and Merill. The author developed with his team a complete workflow of constructing and applying several Machine Learning models to analyze historical development data in the City of San Francisco. The project predicts potential real estate development sites in San Francisco based on historic trends, property characteristics and geospatial proximities.

Project Background

In this project, I utilized Machine Learning algorithms to predict the future development potentials across the City of San Francisco, CA. The model takes into consideration a wide range of parcel information and geospatial features from the City’s publicly available development pipeline data in the past 10 years.

Where are the next real estate developments likely to happen?

SF_news

The Wealth of Data

Representative features about parcels were selected, including parcel size, year of structure built, assessed values, zoning regulations, building square footages, etc. These features were common factors to consider when a real estate developer investigates sites for potential development. These form the basis of the parcel data set that the Machine Learning model will further train on.

The Geospatial Layer of Things

Then, a series of geospatial analyses were conducted to generate parcel-based accessibility values to different types of urban amenities (transit stops, parks, schools, retails, etc.). These geospatial features were then added to the main parcel dataset as additional geospatial features. Geospatial_Feature_Generation

Below are some study maps visualizing the geospatial analyses performed to measure spatial proximity to various amenities that are important to real estate development. SF_Parks SF_BART SF_Seismic

All parcels that have been included in the development pipeline over the past 10 years were marked as “Susceptible”, and all the other parcels were marked as “Not Susceptible”. A first look at this existing parcel data reveals some interesting patterns, for example, parcels located near BART stations are more likely to appear on the development pipeline than far away, yet parcels with mid-distance to parks and open spaces made the most frequent appearances on the development pipeline. baseline

Machine Learning Models and Preliminary Result

To predict the parcels’ future development susceptibility, the City’s parcel dataset was randomly shuffled and split into three subsets: the training dataset, the testing/validation dataset, and the prediction dataset. Three different Machine Learning algorithms were trained and compared, namely, Logistic Regression, Naive Bayes, and Random Forest. Among the three, the Random Forest model had the lowest classification error. Then a separate prediction was made using the prediction dataset with the Logistic Regression model, since we were interested in explicitly tuning the subset of features that would influence the development susceptibility significantly. The modeling was implemented using Orange, an open-source data mining and analytics tool. ML_orange

Feature Importance Ranking and Prediction Results (mapped) from the Random Forest Model

A variety of data mining and modeling tools were utilized in the analyis process, including Python, Orange, and QGIS. ML_features

Interactive Mapping Tool

Finally, the prediction results were visualized, using a combination of Mapbox Studio, Mapbox GL, and JavaScript codes, in an interactive map with customized toggles and additional layers of information for references. map1 map2 map3

Explore the map here!