Explore Poverty Mapping Machine Learning Models

Building poverty mapping machine learning (ML) models with Geowrangler

The aim of this notebook is to demonstrate how Geowrangler can be used to make the process of generating ML-ready or analytics-ready datasets from raw geospatial data easier.

Concretely, this shows a sample workflow for a wealth estimation model trained on a dataset constructed using:

Note

The goal here is to showcase Geowrangler’s functions for geospatial data wrangling and feature engineering. Hence, this notebook keeps the ML modelling portions as simple as possible.

Set-up

Install and import some libraries.

from pathlib import Path

import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from shapely import wkt
from sklearn.ensemble import RandomForestRegressor

import geowrangler.area_zonal_stats as azs
import geowrangler.distance_zonal_stats as dzs
import geowrangler.raster_zonal_stats as rzs
import geowrangler.vector_zonal_stats as vzs
from geowrangler import grids
from geowrangler.datasets import geofabrik, ookla

Pre-requisite: Manual Data Download

# download data if not yet available
![ ! -e ../data/phl_adm0.geojson ] && curl -s -o ../data/phl_adm0.geojson https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_adm0.geojson
![ ! -e ../data/phl_dhs_cluster_level.csv ] && curl -s -o ../data/phl_dhs_cluster_level.csv https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_dhs_cluster_level.csv
![ ! -e ../data/phl_ntl.tif ] && curl -s -o ../data/phl_ntl.tif https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_ntl.tif
DATA_DIR = Path("../data/")
# Auto-creates the folder if it does not exist
DATA_DIR.mkdir(parents=True, exist_ok=True)

Your data directory should look something like this.

data_dir.PNG

Data Preparation

DHS Ground Truth

The ground truth used in this demo notebook is from a DHS 2017 on-the-ground household survey conducted in the Philippines regarding the households’ socio-demographic information.

The file we provide is already a pre-processed version of the data that is aggregated on a household-cluster level, meaning the information cannot be tied back to any individual household. Specifically, we only provide a list of household clusters with their corresponding (jittered) GPS coordinate and DHS-computed wealth index.

Due to the sensitive nature of the data and the DHS program terms of use, we cannot provide the raw data. You can, however, request for access to raw data yourself on the DHS website. In that case, you can use GeoWrangler’s DHS processing utils help perform the said pre-processing.

Our first step is to create a GeoDataFrame from the data.

# Load ground truth data as a DataFrame first
GROUND_TRUTH_CSV = DATA_DIR / "phl_dhs_cluster_level.csv"
df = pd.read_csv(GROUND_TRUTH_CSV)

# Some of the coordinates in the data are invalid. This filters them out.
df = df[(df.longitude > 0) & (df.latitude > 0)]

# Create a GeoDataFrame from the longitude, latitude columns.
gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="epsg:4326"
)

print(f"There are {len(gdf):,} clusters.")
gdf.head()
There are 1,213 clusters.
DHSCLUST Wealth Index DHSID longitude latitude geometry
0 1 -31881.60870 PH201700000001 122.109807 6.674652 POINT (122.10981 6.67465)
1 2 -2855.37500 PH201700000002 122.132027 6.662256 POINT (122.13203 6.66226)
2 3 -57647.04762 PH201700000003 122.179496 6.621822 POINT (122.1795 6.62182)
3 4 -54952.66667 PH201700000004 122.137965 6.485298 POINT (122.13796 6.4853)
5 6 -80701.69565 PH201700000006 121.916094 6.629457 POINT (121.91609 6.62946)

Next, we want to create a buffer around each cluster centroid that represents the area’s neighborhood. This is so we can engineer some features and characterize these neighborhoods using open data.

This is a design decision, but for this demo, we’ll create a circlular area with a 2km radius following the random displacement introduced by DHS to preserve household privacy.

# Make sure to convert to PH crs first for the buffer in meters, and then back to WGS84
gdf = gdf.to_crs("epsg:3123")
gdf.geometry = gdf.geometry.buffer(2000)
gdf = gdf.to_crs("epsg:4326")

gdf.head(5)
DHSCLUST Wealth Index DHSID longitude latitude geometry
0 1 -31881.60870 PH201700000001 122.109807 6.674652 POLYGON ((122.12789 6.67461, 122.1278 6.67284,...
1 2 -2855.37500 PH201700000002 122.132027 6.662256 POLYGON ((122.15011 6.66221, 122.15002 6.66044...
2 3 -57647.04762 PH201700000003 122.179496 6.621822 POLYGON ((122.19758 6.62178, 122.19749 6.62001...
3 4 -54952.66667 PH201700000004 122.137965 6.485298 POLYGON ((122.15604 6.48526, 122.15595 6.48349...
5 6 -80701.69565 PH201700000006 121.916094 6.629457 POLYGON ((121.93418 6.62942, 121.93409 6.62765...

We can visualize what we’ve done so far on a map. DHS Clusters Philippines (2017)

# Uncomment the next line to display an interactive map
# gdf.explore()

Download data and load them into memory

Next, we’ll download some data from Ookla (internet speeds) and OSM (points of interest), which we’ll use to generate neighborhood characteristics to be used as ML features.

Ookla

Ookla has released global open data gathered from speedtests made on their platform. This gives us access to internet speed information across various geographies. In our context, this can give us a signal predictor.

First, let’s download a local copy of data on fixed internet in the 1st quarter of 2019 (earliest data available).

We can use GeoWrangler’s Ookla data utility to automatically download and cache the desired data on your machine given the type, year, and quarter.

This is just a simplification for the demo. In practice, you might want to aggregate data across multiple time periods or incorporate wireless data.

ookla_fixed_2019_q1_filepath = ookla.download_ookla_file(
    type_="fixed", year="2019", quarter="1"
)
# This is where the downloaded file is located.
# By default this downloads to your data/ folder, but you can customize this.
ookla_fixed_2019_q1_filepath
2024-08-12 22:32:01.747 | INFO     | geowrangler.datasets.utils:urlretrieve:25 - Retrieving https://ookla-open-data.s3.us-west-2.amazonaws.com/parquet/performance/type=fixed/year=2019/quarter=1/2019-01-01_performance_fixed_tiles.parquet into data/2019-01-01_performance_fixed_tiles.parquet
100.00% [227835904/227829253 00:16<00:00]
Path('data/2019-01-01_performance_fixed_tiles.parquet')
# This is a function to load and do some light pre-processing on the Ookla data.


def load_ookla_data(filename, mask=None):

    # Ookla's parquet file doesn't seem to have geo metadata so need to read through pandas first
    ookla_df = pd.read_parquet(filename)
    ookla_gdf = gpd.GeoDataFrame(
        ookla_df,
        geometry=ookla_df["tile"].apply(lambda x: wkt.loads(x)),
        crs="epsg:4326",
    )
    ookla_gdf = ookla_gdf.drop(columns=["tile"])

    # Ookla's data files contain data for the whole world.
    # For our case, we're retaining only tiles that intersect with the given mask.
    # This is to speed-up any processing we do later on.
    if mask is not None:
        keep_cols = ookla_gdf.columns
        ookla_gdf = ookla_gdf.sjoin(mask, how="inner", predicate="intersects")
        ookla_gdf = ookla_gdf[keep_cols]
        ookla_gdf.head()

    # Convert kbps to mbps for easier reading
    ookla_gdf["avg_d_mbps"] = ookla_gdf["avg_d_kbps"] / 1000
    ookla_gdf["avg_u_mbps"] = ookla_gdf["avg_u_kbps"] / 1000

    return ookla_gdf

The Ookla data is quite large, and takes around 5 minutes to load.

ookla_gdf = load_ookla_data(ookla_fixed_2019_q1_filepath, mask=gdf)
CPU times: user 22.3 s, sys: 3.6 s, total: 25.9 s
Wall time: 28.3 s
print(
    f"{len(ookla_gdf):,} Ookla data tiles retained that intersect with our DHS cluster neighborhoods."
)
ookla_gdf.head()
17,245 Ookla data tiles retained that intersect with our DHS cluster neighborhoods.
quadkey avg_d_kbps avg_u_kbps avg_lat_ms tests devices geometry avg_d_mbps avg_u_mbps
4368240 1323011210311031 10 21 126 1 1 POLYGON ((121.96472 20.4579, 121.97021 20.4579... 0.010 0.021
4368240 1323011210311031 10 21 126 1 1 POLYGON ((121.96472 20.4579, 121.97021 20.4579... 0.010 0.021
4368240 1323011210311031 10 21 126 1 1 POLYGON ((121.96472 20.4579, 121.97021 20.4579... 0.010 0.021
4368240 1323011210311031 10 21 126 1 1 POLYGON ((121.96472 20.4579, 121.97021 20.4579... 0.010 0.021
4368241 1323011210311033 5931 10633 228 3 2 POLYGON ((121.96472 20.45275, 121.97021 20.452... 5.931 10.633

OSM (Open Street Maps)

One source of open data on points of interest is OSM. We’ll use the version of the data provided by Geofabrik vs accessing the OSM Overpass API. This allows us to not be bottlenecked by API calls.

Similar to Ookla, we can also use GeoWrangler’s OSM data download utils to download and cache OSM data on your machine. All you have to do is specify the country you want.

Geofabrik’s OSM data is a zip file containing different files for POIs, waterways, roads, etc. For this demo notebook, we’ll only use the POIs.

# This download can take around 1-2 minutes.
ph_osm_zip_path = geofabrik.download_geofabrik_region("philippines")

# This line unzips the zip file if we haven't done so yet.
!echo 'None' | unzip $ph_osm_zip_path -d data/ph_osm/
2024-08-12 22:32:49.191 | INFO     | geowrangler.datasets.utils:urlretrieve:25 - Retrieving https://download.geofabrik.de/asia/philippines-latest-free.shp.zip into data/philippines-latest-free.shp.zip
100.00% [1287127040/1287122270 06:08<00:00]
Archive:  data/philippines-latest-free.shp.zip
  inflating: data/ph_osm/README      
 extracting: data/ph_osm/gis_osm_buildings_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_buildings_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_buildings_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_buildings_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_buildings_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_landuse_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_landuse_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_landuse_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_landuse_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_landuse_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_natural_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_natural_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_natural_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_natural_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_natural_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_natural_free_1.cpg  
  inflating: data/ph_osm/gis_osm_natural_free_1.dbf  
  inflating: data/ph_osm/gis_osm_natural_free_1.prj  
  inflating: data/ph_osm/gis_osm_natural_free_1.shp  
  inflating: data/ph_osm/gis_osm_natural_free_1.shx  
 extracting: data/ph_osm/gis_osm_places_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_places_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_places_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_places_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_places_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_places_free_1.cpg  
  inflating: data/ph_osm/gis_osm_places_free_1.dbf  
  inflating: data/ph_osm/gis_osm_places_free_1.prj  
  inflating: data/ph_osm/gis_osm_places_free_1.shp  
  inflating: data/ph_osm/gis_osm_places_free_1.shx  
 extracting: data/ph_osm/gis_osm_pofw_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_pofw_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_pofw_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_pofw_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_pofw_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_pofw_free_1.cpg  
  inflating: data/ph_osm/gis_osm_pofw_free_1.dbf  
  inflating: data/ph_osm/gis_osm_pofw_free_1.prj  
  inflating: data/ph_osm/gis_osm_pofw_free_1.shp  
  inflating: data/ph_osm/gis_osm_pofw_free_1.shx  
 extracting: data/ph_osm/gis_osm_pois_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_pois_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_pois_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_pois_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_pois_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_pois_free_1.cpg  
  inflating: data/ph_osm/gis_osm_pois_free_1.dbf  
  inflating: data/ph_osm/gis_osm_pois_free_1.prj  
  inflating: data/ph_osm/gis_osm_pois_free_1.shp  
  inflating: data/ph_osm/gis_osm_pois_free_1.shx  
 extracting: data/ph_osm/gis_osm_railways_free_1.cpg  
  inflating: data/ph_osm/gis_osm_railways_free_1.dbf  
  inflating: data/ph_osm/gis_osm_railways_free_1.prj  
  inflating: data/ph_osm/gis_osm_railways_free_1.shp  
  inflating: data/ph_osm/gis_osm_railways_free_1.shx  
 extracting: data/ph_osm/gis_osm_roads_free_1.cpg  
  inflating: data/ph_osm/gis_osm_roads_free_1.dbf  
  inflating: data/ph_osm/gis_osm_roads_free_1.prj  
  inflating: data/ph_osm/gis_osm_roads_free_1.shp  
  inflating: data/ph_osm/gis_osm_roads_free_1.shx  
 extracting: data/ph_osm/gis_osm_traffic_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_traffic_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_traffic_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_traffic_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_traffic_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_traffic_free_1.cpg  
  inflating: data/ph_osm/gis_osm_traffic_free_1.dbf  
  inflating: data/ph_osm/gis_osm_traffic_free_1.prj  
  inflating: data/ph_osm/gis_osm_traffic_free_1.shp  
  inflating: data/ph_osm/gis_osm_traffic_free_1.shx  
 extracting: data/ph_osm/gis_osm_transport_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_transport_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_transport_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_transport_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_transport_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_transport_free_1.cpg  
  inflating: data/ph_osm/gis_osm_transport_free_1.dbf  
  inflating: data/ph_osm/gis_osm_transport_free_1.prj  
  inflating: data/ph_osm/gis_osm_transport_free_1.shp  
  inflating: data/ph_osm/gis_osm_transport_free_1.shx  
 extracting: data/ph_osm/gis_osm_water_a_free_1.cpg  
  inflating: data/ph_osm/gis_osm_water_a_free_1.dbf  
  inflating: data/ph_osm/gis_osm_water_a_free_1.prj  
  inflating: data/ph_osm/gis_osm_water_a_free_1.shp  
  inflating: data/ph_osm/gis_osm_water_a_free_1.shx  
 extracting: data/ph_osm/gis_osm_waterways_free_1.cpg  
  inflating: data/ph_osm/gis_osm_waterways_free_1.dbf  
  inflating: data/ph_osm/gis_osm_waterways_free_1.prj  
  inflating: data/ph_osm/gis_osm_waterways_free_1.shp  
  inflating: data/ph_osm/gis_osm_waterways_free_1.shx  
# Uncomment the ff line to see all the available OSM files aside from POIs.
# !ls data/ph_osm
ph_osm_pois_filepath = "data/ph_osm/gis_osm_pois_free_1.shp"
ph_osm = gpd.read_file(ph_osm_pois_filepath)
print(f"There are {len(ph_osm)} OSM POIs.")
ph_osm.head()
There are 150311 OSM POIs.
osm_id code fclass name geometry
0 21717820 2907 camera_surveillance None POINT (121.0212 14.57608)
1 21717872 2722 museum Ayala Museum POINT (121.02324 14.55358)
2 24078959 2907 camera_surveillance None POINT (121.05945 14.60098)
3 24797511 2542 bicycle_shop Christine Sports Cycle Marketing POINT (120.99506 14.55224)
4 24797535 2518 beverages G and K wine POINT (120.99267 14.55285)

Feature Engineering

Now that we’ve prepared the DHS data and loaded Ookla and OSM data into memory, we’re ready to generate some neighborhood features for our ML model.

For convenience, we’re wrapping all the feature engineering code in functions so we can re-use them later when we apply the model to the whole country.

Note: if you are modifying the feature engineering portions, it is recommended to run the whole section end-to-end. This ensures you’re starting fresh from a new copy of the DHS clusters GeoDataFrame. This also ensures the function definitions are up-to-date, and can be re-used properly later for model rollout.

# We'll make a copy of the GDF to avoid overwriting the original DHS GeoDataFrame.
gdf_with_features = gdf.copy()

OSM

Our goal with OSM data is to generate neighborhood characteristics based on counts and distance to certain POIs, such as schools, hospitals, etc.

To do this, we utilize GeoWrangler’s vector zonal stats and distance zonal stats features.

# Uncomment the ff. line if you want to see the different POI classes available
# ph_osm.fclass.sort_values().unique()
def generate_osm_features(aoi, osm, metric_crs="epsg:3123"):

    aoi = aoi.copy()

    # GeoWrangler: Count number of all POIs per tile
    aoi = vzs.create_zonal_stats(
        aoi,
        osm,
        overlap_method="intersects",
        aggregations=[{"func": "count", "output": "poi_count", "fillna": True}],
    )

    # Count specific aoi types
    poi_types = ["restaurant", "school", "bank", "supermarket", "mall", "atm"]

    for poi_type in poi_types:
        # GeoWrangler: Count with vector zonal stats
        aoi = vzs.create_zonal_stats(
            aoi,
            osm[osm["fclass"] == poi_type],
            overlap_method="intersects",
            aggregations=[
                {"func": "count", "output": f"{poi_type}_count", "fillna": True}
            ],
        )

        # GeoWrangler: Distance with distance zonal stats
        col_name = f"{poi_type}_nearest"
        aoi = dzs.create_distance_zonal_stats(
            aoi.to_crs(metric_crs),
            osm[osm["fclass"] == poi_type].to_crs(metric_crs),
            max_distance=10_000,
            aggregations=[],
            distance_col=col_name,
        ).to_crs("epsg:4326")

        # If no POI was found within the distance limit, set the distance to a really high value
        aoi[col_name] = aoi[col_name].fillna(value=999999)

    return aoi
gdf_with_features = generate_osm_features(gdf_with_features, ph_osm)
CPU times: user 746 ms, sys: 29.4 ms, total: 775 ms
Wall time: 793 ms
gdf_with_features.head()
DHSCLUST Wealth Index DHSID longitude latitude geometry poi_count restaurant_count restaurant_nearest school_count school_nearest bank_count bank_nearest supermarket_count supermarket_nearest mall_count mall_nearest atm_count atm_nearest
0 1 -31881.60870 PH201700000001 122.109807 6.674652 POLYGON ((122.12789 6.67461, 122.1278 6.67284,... 0.0 0.0 3152.996430 0.0 1392.503631 0.0 1764.736950 0.0 999999.0 0.0 999999.0 0.0 999999.000000
1 2 -2855.37500 PH201700000002 122.132027 6.662256 POLYGON ((122.15011 6.66221, 122.15002 6.66044... 6.0 0.0 5788.165918 0.0 3601.132502 1.0 0.000000 0.0 999999.0 0.0 999999.0 0.0 999999.000000
2 3 -57647.04762 PH201700000003 122.179496 6.621822 POLYGON ((122.19758 6.62178, 122.19749 6.62001... 0.0 0.0 999999.000000 0.0 642.871037 0.0 4026.300327 0.0 999999.0 0.0 999999.0 0.0 999999.000000
3 4 -54952.66667 PH201700000004 122.137965 6.485298 POLYGON ((122.15604 6.48526, 122.15595 6.48349... 0.0 0.0 999999.000000 0.0 2287.839903 0.0 999999.000000 0.0 999999.0 0.0 999999.0 0.0 999999.000000
5 6 -80701.69565 PH201700000006 121.916094 6.629457 POLYGON ((121.93418 6.62942, 121.93409 6.62765... 0.0 0.0 8408.925995 0.0 3930.606257 0.0 8332.324084 0.0 999999.0 0.0 999999.0 0.0 8311.332736

Ookla

Our goal with Ookla data is to generate neighborhood characteristics based on internet speeds (download, upload, latency).

To do this, we utilize GeoWrangler’s area zonal stats feature.

def generate_ookla_features(aoi, ookla_gdf, metric_crs="epsg:3123"):

    aoi = aoi.copy()

    orig_aoi_crs = aoi.crs
    aoi = aoi.to_crs(metric_crs)
    ookla_gdf = ookla_gdf.to_crs(metric_crs)

    # For better formatting, rename some columns
    ookla_gdf = ookla_gdf.rename(
        columns={"avg_d_mbps": "d_mbps", "avg_u_mbps": "u_mbps", "avg_lat_ms": "lat_ms"}
    )

    # GeoWrangler: Compute stats on the various columns
    aoi = azs.create_area_zonal_stats(
        aoi,
        ookla_gdf,
        aggregations=[
            # Count number of devices in the area
            dict(column="devices", func=["raw_sum"], fillna=True),
            # Get stats on the download speeds
            dict(
                column="d_mbps",
                func=["imputed_mean", "max", "min", "std"],
                fillna=[True, True, True, True],
            ),
            # Get stats on the upload speeds
            dict(
                column="u_mbps",
                func=["imputed_mean", "max", "min", "std"],
                fillna=[True, True, True, True],
            ),
        ],
        # Don't include the land area that intersected as a column
        include_intersect=False,
        # Don't set minimum values to 0 if the neighborhood's area doesn't fully intersect with Ookla tiles.
        fix_min=False,
    )

    aoi = aoi.fillna(value="0")
    aoi = aoi.to_crs(orig_aoi_crs)

    return aoi
gdf_with_features = generate_ookla_features(gdf_with_features, ookla_gdf)
CPU times: user 1.01 s, sys: 27.6 ms, total: 1.04 s
Wall time: 1.05 s
gdf_with_features.head()
DHSCLUST Wealth Index DHSID longitude latitude geometry poi_count restaurant_count restaurant_nearest school_count ... atm_nearest devices_sum d_mbps_mean d_mbps_max d_mbps_min d_mbps_std u_mbps_mean u_mbps_max u_mbps_min u_mbps_std
0 1 -31881.60870 PH201700000001 122.109807 6.674652 POLYGON ((122.12789 6.67461, 122.1278 6.67284,... 0.0 0.0 3152.996430 0.0 ... 999999.000000 0.0 0.000000 0.000 0.000 0.000000 0.000000 0.000 0.000 0.000000
1 2 -2855.37500 PH201700000002 122.132027 6.662256 POLYGON ((122.15011 6.66221, 122.15002 6.66044... 6.0 0.0 5788.165918 0.0 ... 999999.000000 38.0 0.401851 14.628 2.177 4.982053 0.255031 6.867 1.866 1.753667
2 3 -57647.04762 PH201700000003 122.179496 6.621822 POLYGON ((122.19758 6.62178, 122.19749 6.62001... 0.0 0.0 999999.000000 0.0 ... 999999.000000 0.0 0.000000 0.000 0.000 0.000000 0.000000 0.000 0.000 0.000000
3 4 -54952.66667 PH201700000004 122.137965 6.485298 POLYGON ((122.15604 6.48526, 122.15595 6.48349... 0.0 0.0 999999.000000 0.0 ... 999999.000000 0.0 0.000000 0.000 0.000 0.000000 0.000000 0.000 0.000 0.000000
5 6 -80701.69565 PH201700000006 121.916094 6.629457 POLYGON ((121.93418 6.62942, 121.93409 6.62765... 0.0 0.0 8408.925995 0.0 ... 8311.332736 0.0 0.000000 0.000 0.000 0.000000 0.000000 0.000 0.000 0.000000

5 rows × 28 columns

Night Time Lights

Lastly, we’ll generate features based on night time lights. GeoWrangler provides a raster zonal stats function that acts as a thin layer on top of rasterio’s rasterstats function. It provides a little convenience in terms of formatting and automatically joins back the features to your input GeoDataFrame.

def generate_ntl_features(aoi, raster_path):
    aoi = aoi.copy()

    aoi = rzs.create_raster_zonal_stats(
        aoi,
        raster_path,
        aggregation=dict(
            func=["mean", "min", "max", "std"],
            column="ntl",
        ),
        extra_args=dict(nodata=0),
    )

    aoi = aoi.fillna(0)

    return aoi
gdf_with_features = generate_ntl_features(gdf_with_features, DATA_DIR / "phl_ntl.tif")
CPU times: user 1.24 s, sys: 198 ms, total: 1.44 s
Wall time: 1.57 s

Visualize the final dataset

Finally, let’s visualize our cluster neighborhoods now with the generated features. Hover over each cluster to see its data.

DHS Cluster Features (Zoomed)
# Visualize the clusters with the generated features
# gdf_with_features.explore()

ML Training

In this section, we will now train an ML model to predict the wealth index for an area given its features.

Prepare data into X, y and generate train/test partitions

# For convenience, store the list of features in a variable. Remove all extraneous columns.
feature_cols = gdf_with_features.drop(
    columns=["DHSCLUST", "DHSID", "Wealth Index", "longitude", "latitude", "geometry"]
).columns

# Separate data (X) and the target (y)
X = gdf_with_features[feature_cols]
y = gdf_with_features["Wealth Index"]

# Bin y into 5 buckets for stratification when splitting the dataset
cat_y = pd.cut(y, bins=5, labels=list(range(1, 6)))

# Generate train and test partitions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, test_size=0.2, random_state=2022, stratify=cat_y
)
X_train.head()
poi_count restaurant_count restaurant_nearest school_count school_nearest bank_count bank_nearest supermarket_count supermarket_nearest mall_count ... d_mbps_min d_mbps_std u_mbps_mean u_mbps_max u_mbps_min u_mbps_std ntl_mean ntl_min ntl_max ntl_std
944 1.0 0.0 4332.445808 1.0 0.000000 0.0 999999.000000 0.0 5595.357740 0.0 ... 0.000 0.000000 0.000000 0.000 0.000 0.000000 0.279636 0.279636 0.279636 0.000000
328 1203.0 143.0 0.000000 23.0 0.000000 73.0 0.000000 14.0 0.000000 1.0 ... 8.948 6.519399 0.033496 30.564 2.496 5.933372 9.881545 4.587300 17.190298 2.768707
696 0.0 0.0 4396.530482 0.0 7331.181601 0.0 999999.000000 0.0 999999.000000 0.0 ... 0.000 0.000000 0.000000 0.000 0.000 0.000000 0.000000 0.000000 0.000000 0.000000
155 2.0 0.0 352.510311 0.0 669.141767 0.0 2606.478381 0.0 2566.227871 0.0 ... 1.215 31.305303 0.863634 151.303 0.106 35.998549 0.299591 0.182455 0.519110 0.125637
614 286.0 12.0 0.000000 4.0 0.000000 14.0 0.000000 8.0 0.000000 1.0 ... 2.986 6.722738 0.174738 32.774 0.527 7.027801 0.335981 0.216280 0.543996 0.084581

5 rows × 26 columns

y_train.head()
944    -27114.20833
328    175912.44440
696    -29987.93750
155     75826.10526
614     54849.37500
Name: Wealth Index, dtype: float64

Evaluation Function

We define an evaluation function that computes R^2 and makes a plot of the actual vs predicted values.

def evaluate(model, X_train, X_test, y_train, y_test):

    # R^2
    train_predictions = model.predict(X_train)
    print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))

    test_predictions = model.predict(X_test)
    print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))

    # Plot

    # Train and test predictions vs actual values
    plt.scatter(train_predictions, y_train, label="Train samples", c="#d95f02")
    plt.scatter(test_predictions, y_test, label="Test samples", c="#7570b3")

    # Identity line
    xpoints = ypoints = plt.xlim()
    plt.plot(
        xpoints, ypoints, linestyle="--", color="k", lw=3, scalex=False, scaley=False
    )

    # Labels and legends
    plt.xlabel("Predicted value")
    plt.ylabel("True value")
    plt.legend()
    plt.tight_layout()
    plt.show()

Fit and Evaluate Model

For ML modelling, let’s fit a simple random forest regressor.

As mentioned, we are keeping the ML modelling as simple as possible in this notebook. This is definitely not the best-performing wealth estimation model out there, but should be decent enough for demo purposes.

model = RandomForestRegressor()
model.fit(X_train, y_train)
evaluate(model, X_train, X_test, y_train, y_test)
Train R2 score: 0.9244278211179482
Test R2 score: 0.5882081532536176

ML Rollout (Phl)

Now that we have a wealth estimation model, we can apply it to the whole country.

Generate country-wide tiles

First, we need to split the country up into smaller areas.

That usually means generating grid tiles or hex tiles for your area of interest. In this notebook, we’ll generate Bing tiles at zoom level 13 (which has the closest area to our 2km neighborhoods in training).

GeoWrangler provides various grid generation utilities for these different scenarios.

# Read in the admin boundary for PHL.
# This file is a simplified version of the admin boundary to make the file size smaller and load times faster.
phl_adm = gpd.read_file(f"{DATA_DIR}/phl_adm0.geojson")

# Uncomment this line if you want to visualize it
# phl_adm.explore()

The generation can take around 1-2 minutes.

phl_tiles = grids.BingTileGridGenerator(13).generate_grid(phl_adm)
print(f"There are {len(phl_tiles):,} tiles.")
phl_tiles.head()
There are 16,257 tiles.
CPU times: user 7.34 s, sys: 57.5 ms, total: 7.39 s
Wall time: 7.5 s
quadkey geometry
0 1323230031202 POLYGON ((119.17969 4.65308, 119.17969 4.69688...
1 1323230030300 POLYGON ((119.00391 4.69688, 119.00391 4.74068...
2 1323230030311 POLYGON ((119.13574 4.69688, 119.13574 4.74068...
3 1323230031201 POLYGON ((119.22363 4.69688, 119.22363 4.74068...
4 1323212220302 POLYGON ((118.30078 6.05316, 118.30078 6.09686...

Feature Engineering

This part is simpler, as we’re just re-using the feature engineering functions we previously created. Note though that we are generating features for a bigger area, so the runtime is a little longer (in total should be within 2 minutes).

phl_tiles_with_features = phl_tiles.copy()
phl_tiles_with_features = generate_osm_features(phl_tiles_with_features, ph_osm)
CPU times: user 1.17 s, sys: 39.2 ms, total: 1.21 s
Wall time: 1.24 s
phl_tiles_with_features = generate_ookla_features(phl_tiles_with_features, ookla_gdf)
CPU times: user 465 ms, sys: 16.3 ms, total: 481 ms
Wall time: 485 ms
phl_tiles_with_features = generate_ntl_features(
    phl_tiles_with_features, f"{DATA_DIR}/phl_ntl.tif"
)
CPU times: user 13.4 s, sys: 1.8 s, total: 15.2 s
Wall time: 15.4 s

Make predictions and viz output

Finally, we are ready to apply the model and make predictions throughout the whole country. We can also visualize it on a Cholorpleth map.

phl_tiles_with_features["predicted_wealth_index"] = model.predict(
    phl_tiles_with_features[feature_cols]
)
def viz_chloropleth(gdf, target_feature="predicted_wealth_index"):

    m = folium.Map(location=[14.6091, 121.0223], width=1000, height=800, zoom_start=7)
    subset = gdf.copy()
    subset["id"] = list(range(len(subset)))

    folium.Choropleth(
        geo_data=subset,
        data=subset,
        name="Wealth Predictions",
        columns=["id", target_feature],
        key_on="feature.properties.id",
        fill_color="Spectral",
        legend_name=target_feature,
    ).add_to(m)

    return m

Visualize the output

Predicted Wealth Index Rollout (Philippines)
# Uncomment the line below to visualize rollout
# viz_chloropleth(phl_tiles_with_features)