The aim of this notebook is to demonstrate how GeoWrangler can be used to make the process of generating ML-ready or analytics-ready datasets from raw geospatial data easier.
Concretely, this shows a sample workflow for a wealth estimation model trained on a dataset constructed using:
- Ground Truth - DHS household clusters + wealth indices
- Features - Derived from Night Time Lights, OSM POIs, and Ookla internet speeds.
Note: the goal here is to showcase GeoWrangler’s functions for geospatial data wrangling and feature engineering. Hence, this notebook keeps the ML modelling portions as simple as possible.*
from pathlib import Path
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from shapely import wkt
from sklearn.ensemble import RandomForestRegressor
import geowrangler.area_zonal_stats as azs
import geowrangler.distance_zonal_stats as dzs
import geowrangler.raster_zonal_stats as rzs
import geowrangler.vector_zonal_stats as vzs
from geowrangler import grids
from geowrangler.datasets import geofabrik, ookla
# download data if not yet available
![ ! -e ../data/phl_adm0.geojson ] && curl -s -o ../data/phl_adm0.geojson https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_adm0.geojson
![ ! -e ../data/phl_dhs_cluster_level.csv ] && curl -s -o ../data/phl_dhs_cluster_level.csv https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_dhs_cluster_level.csv
![ ! -e ../data/phl_ntl.tif ] && curl -s -o ../data/phl_ntl.tif https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_ntl.tif
DATA_DIR = Path("../data/")
# Auto-creates the folder if it does not exist
DATA_DIR.mkdir(parents=True, exist_ok=True)
Your data directory should look something like this.
DHS Ground Truth
The ground truth used in this demo notebook is from a DHS 2017 on-the-ground household survey conducted in the Philippines regarding the households' socio-demographic information.
The file we provide is already a pre-processed version of the data that is aggregated on a household-cluster level, meaning the information cannot be tied back to any individual household. Specifically, we only provide a list of household clusters with their corresponding (jittered) GPS coordinate and DHS-computed wealth index.
Due to the sensitive nature of the data and the DHS program terms of use, we cannot provide the raw data. You can, however, request for access to raw data yourself on the DHS website. In that case, you can use GeoWrangler's DHS processing utils help perform the said pre-processing.
Our first step is to create a GeoDataFrame from the data.
GROUND_TRUTH_CSV = DATA_DIR / "phl_dhs_cluster_level.csv"
df = pd.read_csv(GROUND_TRUTH_CSV)
# Some of the coordinates in the data are invalid. This filters them out.
df = df[(df.longitude > 0) & (df.latitude > 0)]
# Create a GeoDataFrame from the longitude, latitude columns.
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="epsg:4326"
)
print(f"There are {len(gdf):,} clusters.")
gdf.head()
Next, we want to create a buffer around each cluster centroid that represents the area's neighborhood. This is so we can engineer some features and characterize these neighborhoods using open data.
This is a design decision, but for this demo, we'll create a circlular area with a 2km radius following the random displacement introduced by DHS to preserve household privacy.
gdf = gdf.to_crs("epsg:3123")
gdf.geometry = gdf.geometry.buffer(2000)
gdf = gdf.to_crs("epsg:4326")
gdf.head(5)
We can visualize what we've done so far on a map.
# gdf.explore()
Ookla
Ookla has released global open data gathered from speedtests made on their platform. This gives us access to internet speed information across various geographies. In our context, this can give us a signal predictor.
First, let's download a local copy of data on fixed internet in the 1st quarter of 2019 (earliest data available).
We can use GeoWrangler's Ookla data utility to automatically download and cache the desired data on your machine given the type, year, and quarter.
This is just a simplification for the demo. In practice, you might want to aggregate data across multiple time periods or incorporate wireless data.
ookla_fixed_2019_q1_filepath = ookla.download_ookla_file(
type_="fixed", year="2019", quarter="1"
)
# This is where the downloaded file is located.
# By default this downloads to your data/ folder, but you can customize this.
ookla_fixed_2019_q1_filepath
def load_ookla_data(filename, mask=None):
# Ookla's parquet file doesn't seem to have geo metadata so need to read through pandas first
ookla_df = pd.read_parquet(filename)
ookla_gdf = gpd.GeoDataFrame(
ookla_df,
geometry=ookla_df["tile"].apply(lambda x: wkt.loads(x)),
crs="epsg:4326",
)
ookla_gdf.drop(columns=["tile"], inplace=True)
# Ookla's data files contain data for the whole world.
# For our case, we're retaining only tiles that intersect with the given mask.
# This is to speed-up any processing we do later on.
if mask is not None:
keep_cols = ookla_gdf.columns
ookla_gdf = ookla_gdf.sjoin(mask, how="inner", predicate="intersects")
ookla_gdf = ookla_gdf[keep_cols]
ookla_gdf.head()
# Convert kbps to mbps for easier reading
ookla_gdf["avg_d_mbps"] = ookla_gdf["avg_d_kbps"] / 1000
ookla_gdf["avg_u_mbps"] = ookla_gdf["avg_u_kbps"] / 1000
return ookla_gdf
The Ookla data is quite large, and takes around 5 minutes to load.
%%time
ookla_gdf = load_ookla_data(ookla_fixed_2019_q1_filepath, mask=gdf)
print(
f"{len(ookla_gdf):,} Ookla data tiles retained that intersect with our DHS cluster neighborhoods."
)
ookla_gdf.head()
OSM (Open Street Maps)
One source of open data on points of interest is OSM. We'll use the version of the data provided by Geofabrik vs accessing the OSM Overpass API. This allows us to not be bottlenecked by API calls.
Similar to Ookla, we can also use GeoWrangler's OSM data download utils to download and cache OSM data on your machine. All you have to do is specify the country you want.
Geofabrik's OSM data is a zip file containing different files for POIs, waterways, roads, etc. For this demo notebook, we'll only use the POIs.
ph_osm_zip_path = geofabrik.download_geofabrik_region("philippines")
# This line unzips the zip file if we haven't done so yet.
!echo 'None' | unzip $ph_osm_zip_path -d data/ph_osm/
# !ls data/ph_osm
ph_osm_pois_filepath = "data/ph_osm/gis_osm_pois_free_1.shp"
ph_osm = gpd.read_file(ph_osm_pois_filepath)
print(f"There are {len(ph_osm)} OSM POIs.")
ph_osm.head()
Feature Engineering
Now that we've prepared the DHS data and loaded Ookla and OSM data into memory, we're ready to generate some neighborhood features for our ML model.
For convenience, we're wrapping all the feature engineering code in functions so we can re-use them later when we apply the model to the whole country.
Note: if you are modifying the feature engineering portions, it is recommended to run the whole section end-to-end. This ensures you're starting fresh from a new copy of the DHS clusters GeoDataFrame. This also ensures the function definitions are up-to-date, and can be re-used properly later for model rollout.
gdf_with_features = gdf.copy()
OSM
Our goal with OSM data is to generate neighborhood characteristics based on counts and distance to certain POIs, such as schools, hospitals, etc.
To do this, we utilize GeoWrangler's vector zonal stats and distance zonal stats features.
# ph_osm.fclass.sort_values().unique()
def generate_osm_features(aoi, osm, metric_crs="epsg:3123"):
aoi = aoi.copy()
# GeoWrangler: Count number of all POIs per tile
aoi = vzs.create_zonal_stats(
aoi,
osm,
overlap_method="intersects",
aggregations=[{"func": "count", "output": "poi_count", "fillna": True}],
)
# Count specific aoi types
poi_types = ["restaurant", "school", "bank", "supermarket", "mall", "atm"]
for poi_type in poi_types:
# GeoWrangler: Count with vector zonal stats
aoi = vzs.create_zonal_stats(
aoi,
osm[osm["fclass"] == poi_type],
overlap_method="intersects",
aggregations=[
{"func": "count", "output": f"{poi_type}_count", "fillna": True}
],
)
# GeoWrangler: Distance with distance zonal stats
col_name = f"{poi_type}_nearest"
aoi = dzs.create_distance_zonal_stats(
aoi.to_crs(metric_crs),
osm[osm["fclass"] == poi_type].to_crs(metric_crs),
max_distance=10_000,
aggregations=[],
distance_col=col_name,
).to_crs("epsg:4326")
# If no POI was found within the distance limit, set the distance to a really high value
aoi[col_name] = aoi[col_name].fillna(value=999999)
return aoi
%%time
gdf_with_features = generate_osm_features(gdf_with_features, ph_osm)
gdf_with_features.head()
Ookla
Our goal with Ookla data is to generate neighborhood characteristics based on internet speeds (download, upload, latency).
To do this, we utilize GeoWrangler's area zonal stats feature.
def generate_ookla_features(aoi, ookla_gdf, metric_crs="epsg:3123"):
aoi = aoi.copy()
orig_aoi_crs = aoi.crs
aoi = aoi.to_crs(metric_crs)
ookla_gdf = ookla_gdf.to_crs(metric_crs)
# For better formatting, rename some columns
ookla_gdf = ookla_gdf.rename(
columns={"avg_d_mbps": "d_mbps", "avg_u_mbps": "u_mbps", "avg_lat_ms": "lat_ms"}
)
# GeoWrangler: Compute stats on the various columns
aoi = azs.create_area_zonal_stats(
aoi,
ookla_gdf,
aggregations=[
# Count number of devices in the area
dict(column="devices", func=["raw_sum"], fillna=True),
# Get stats on the download speeds
dict(
column="d_mbps",
func=["imputed_mean", "max", "min", "std"],
fillna=[True, True, True, True],
),
# Get stats on the upload speeds
dict(
column="u_mbps",
func=["imputed_mean", "max", "min", "std"],
fillna=[True, True, True, True],
),
],
# Don't include the land area that intersected as a column
include_intersect=False,
# Don't set minimum values to 0 if the neighborhood's area doesn't fully intersect with Ookla tiles.
fix_min=False,
)
aoi = aoi.fillna(value="0")
aoi = aoi.to_crs(orig_aoi_crs)
return aoi
%%time
gdf_with_features = generate_ookla_features(gdf_with_features, ookla_gdf)
gdf_with_features.head()
Night Time Lights
Lastly, we'll generate features based on night time lights. GeoWrangler provides a raster zonal stats function that acts as a thin layer on top of rasterio's rasterstats function. It provides a little convenience in terms of formatting and automatically joins back the features to your input GeoDataFrame.
def generate_ntl_features(aoi, raster_path):
aoi = aoi.copy()
aoi = rzs.create_raster_zonal_stats(
aoi,
raster_path,
aggregation=dict(
func=["mean", "min", "max", "std"],
column="ntl",
),
extra_args=dict(nodata=0),
)
aoi.fillna(0, inplace=True)
return aoi
%%time
gdf_with_features = generate_ntl_features(gdf_with_features, DATA_DIR / "phl_ntl.tif")
# gdf_with_features.explore()
feature_cols = gdf_with_features.drop(
columns=["DHSCLUST", "DHSID", "Wealth Index", "longitude", "latitude", "geometry"]
).columns
# Separate data (X) and the target (y)
X = gdf_with_features[feature_cols]
y = gdf_with_features["Wealth Index"]
# Bin y into 5 buckets for stratification when splitting the dataset
cat_y = pd.cut(y, bins=5, labels=list(range(1, 6)))
# Generate train and test partitions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, test_size=0.2, random_state=2022, stratify=cat_y
)
X_train.head()
y_train.head()
def evaluate(model, X_train, X_test, y_train, y_test):
# R^2
train_predictions = model.predict(X_train)
print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
test_predictions = model.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
# Plot
# Train and test predictions vs actual values
plt.scatter(train_predictions, y_train, label="Train samples", c="#d95f02")
plt.scatter(test_predictions, y_test, label="Test samples", c="#7570b3")
# Identity line
xpoints = ypoints = plt.xlim()
plt.plot(
xpoints, ypoints, linestyle="--", color="k", lw=3, scalex=False, scaley=False
)
# Labels and legends
plt.xlabel("Predicted value")
plt.ylabel("True value")
plt.legend()
plt.tight_layout()
plt.show()
model = RandomForestRegressor()
model.fit(X_train, y_train)
evaluate(model, X_train, X_test, y_train, y_test)
Generate country-wide tiles
First, we need to split the country up into smaller areas.
That usually means generating grid tiles or hex tiles for your area of interest. In this notebook, we'll generate Bing tiles at zoom level 13 (which has the closest area to our 2km neighborhoods in training).
GeoWrangler provides various grid generation utilities for these different scenarios.
# This file is a simplified version of the admin boundary to make the file size smaller and load times faster.
phl_adm = gpd.read_file(f"{DATA_DIR}/phl_adm0.geojson")
# Uncomment this line if you want to visualize it
# phl_adm.explore()
The generation can take around 1-2 minutes.
%%time
phl_tiles = grids.BingTileGridGenerator(13).generate_grid(phl_adm)
print(f"There are {len(phl_tiles):,} tiles.")
phl_tiles.head()
phl_tiles_with_features = phl_tiles.copy()
%%time
phl_tiles_with_features = generate_osm_features(phl_tiles_with_features, ph_osm)
%%time
phl_tiles_with_features = generate_ookla_features(phl_tiles_with_features, ookla_gdf)
%%time
phl_tiles_with_features = generate_ntl_features(
phl_tiles_with_features, f"{DATA_DIR}/phl_ntl.tif"
)
phl_tiles_with_features["predicted_wealth_index"] = model.predict(
phl_tiles_with_features[feature_cols]
)
def viz_chloropleth(gdf, target_feature="predicted_wealth_index"):
map = folium.Map(location=[14.6091, 121.0223], width=1000, height=800, zoom_start=7)
subset = gdf.copy()
subset["id"] = list(range(len(subset)))
folium.Choropleth(
geo_data=subset,
data=subset,
name="Wealth Predictions",
columns=["id", target_feature],
key_on="feature.properties.id",
fill_color="Spectral",
legend_name=target_feature,
).add_to(map)
return map
Visualize the output
# viz_chloropleth(phl_tiles_with_features)