from pathlib import Path
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from shapely import wkt
from sklearn.ensemble import RandomForestRegressor
import geowrangler.area_zonal_stats as azs
import geowrangler.distance_zonal_stats as dzs
import geowrangler.raster_zonal_stats as rzs
import geowrangler.vector_zonal_stats as vzs
from geowrangler import grids
from geowrangler.datasets import geofabrik, ookla
Explore Poverty Mapping Machine Learning Models
Building poverty mapping machine learning (ML) models with Geowrangler
The aim of this notebook is to demonstrate how Geowrangler can be used to make the process of generating ML-ready or analytics-ready datasets from raw geospatial data easier.
Concretely, this shows a sample workflow for a wealth estimation model trained on a dataset constructed using:
- Ground Truth - DHS household clusters + wealth indices
- Features - Derived from Night Time Lights, OSM POIs, and Ookla internet speeds.
The goal here is to showcase Geowrangler’s functions for geospatial data wrangling and feature engineering. Hence, this notebook keeps the ML modelling portions as simple as possible.
Set-up
Install and import some libraries.
Pre-requisite: Manual Data Download
# download data if not yet available
![ ! -e ../data/phl_adm0.geojson ] && curl -s -o ../data/phl_adm0.geojson https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_adm0.geojson
![ ! -e ../data/phl_dhs_cluster_level.csv ] && curl -s -o ../data/phl_dhs_cluster_level.csv https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_dhs_cluster_level.csv
![ ! -e ../data/phl_ntl.tif ] && curl -s -o ../data/phl_ntl.tif https://raw.githubusercontent.com/thinkingmachines/geowrangler/master/data/phl_ntl.tif
= Path("../data/")
DATA_DIR # Auto-creates the folder if it does not exist
=True, exist_ok=True) DATA_DIR.mkdir(parents
Your data directory should look something like this.
Data Preparation
DHS Ground Truth
The ground truth used in this demo notebook is from a DHS 2017 on-the-ground household survey conducted in the Philippines regarding the households’ socio-demographic information.
The file we provide is already a pre-processed version of the data that is aggregated on a household-cluster level, meaning the information cannot be tied back to any individual household. Specifically, we only provide a list of household clusters with their corresponding (jittered) GPS coordinate and DHS-computed wealth index.
Due to the sensitive nature of the data and the DHS program terms of use, we cannot provide the raw data. You can, however, request for access to raw data yourself on the DHS website. In that case, you can use GeoWrangler’s DHS processing utils help perform the said pre-processing.
Our first step is to create a GeoDataFrame from the data.
# Load ground truth data as a DataFrame first
= DATA_DIR / "phl_dhs_cluster_level.csv"
GROUND_TRUTH_CSV = pd.read_csv(GROUND_TRUTH_CSV)
df
# Some of the coordinates in the data are invalid. This filters them out.
= df[(df.longitude > 0) & (df.latitude > 0)]
df
# Create a GeoDataFrame from the longitude, latitude columns.
= gpd.GeoDataFrame(
gdf =gpd.points_from_xy(df.longitude, df.latitude), crs="epsg:4326"
df, geometry
)
print(f"There are {len(gdf):,} clusters.")
gdf.head()
There are 1,213 clusters.
DHSCLUST | Wealth Index | DHSID | longitude | latitude | geometry | |
---|---|---|---|---|---|---|
0 | 1 | -31881.60870 | PH201700000001 | 122.109807 | 6.674652 | POINT (122.10981 6.67465) |
1 | 2 | -2855.37500 | PH201700000002 | 122.132027 | 6.662256 | POINT (122.13203 6.66226) |
2 | 3 | -57647.04762 | PH201700000003 | 122.179496 | 6.621822 | POINT (122.1795 6.62182) |
3 | 4 | -54952.66667 | PH201700000004 | 122.137965 | 6.485298 | POINT (122.13796 6.4853) |
5 | 6 | -80701.69565 | PH201700000006 | 121.916094 | 6.629457 | POINT (121.91609 6.62946) |
Next, we want to create a buffer around each cluster centroid that represents the area’s neighborhood. This is so we can engineer some features and characterize these neighborhoods using open data.
This is a design decision, but for this demo, we’ll create a circlular area with a 2km radius following the random displacement introduced by DHS to preserve household privacy.
# Make sure to convert to PH crs first for the buffer in meters, and then back to WGS84
= gdf.to_crs("epsg:3123")
gdf = gdf.geometry.buffer(2000)
gdf.geometry = gdf.to_crs("epsg:4326")
gdf
5) gdf.head(
DHSCLUST | Wealth Index | DHSID | longitude | latitude | geometry | |
---|---|---|---|---|---|---|
0 | 1 | -31881.60870 | PH201700000001 | 122.109807 | 6.674652 | POLYGON ((122.12789 6.67461, 122.1278 6.67284,... |
1 | 2 | -2855.37500 | PH201700000002 | 122.132027 | 6.662256 | POLYGON ((122.15011 6.66221, 122.15002 6.66044... |
2 | 3 | -57647.04762 | PH201700000003 | 122.179496 | 6.621822 | POLYGON ((122.19758 6.62178, 122.19749 6.62001... |
3 | 4 | -54952.66667 | PH201700000004 | 122.137965 | 6.485298 | POLYGON ((122.15604 6.48526, 122.15595 6.48349... |
5 | 6 | -80701.69565 | PH201700000006 | 121.916094 | 6.629457 | POLYGON ((121.93418 6.62942, 121.93409 6.62765... |
We can visualize what we’ve done so far on a map.
# Uncomment the next line to display an interactive map
# gdf.explore()
Download data and load them into memory
Next, we’ll download some data from Ookla (internet speeds) and OSM (points of interest), which we’ll use to generate neighborhood characteristics to be used as ML features.
Ookla
Ookla has released global open data gathered from speedtests made on their platform. This gives us access to internet speed information across various geographies. In our context, this can give us a signal predictor.
First, let’s download a local copy of data on fixed internet in the 1st quarter of 2019 (earliest data available).
We can use GeoWrangler’s Ookla data utility to automatically download and cache the desired data on your machine given the type, year, and quarter.
This is just a simplification for the demo. In practice, you might want to aggregate data across multiple time periods or incorporate wireless data.
= ookla.download_ookla_file(
ookla_fixed_2019_q1_filepath ="fixed", year="2019", quarter="1"
type_
)# This is where the downloaded file is located.
# By default this downloads to your data/ folder, but you can customize this.
ookla_fixed_2019_q1_filepath
2024-08-12 22:32:01.747 | INFO | geowrangler.datasets.utils:urlretrieve:25 - Retrieving https://ookla-open-data.s3.us-west-2.amazonaws.com/parquet/performance/type=fixed/year=2019/quarter=1/2019-01-01_performance_fixed_tiles.parquet into data/2019-01-01_performance_fixed_tiles.parquet
Path('data/2019-01-01_performance_fixed_tiles.parquet')
# This is a function to load and do some light pre-processing on the Ookla data.
def load_ookla_data(filename, mask=None):
# Ookla's parquet file doesn't seem to have geo metadata so need to read through pandas first
= pd.read_parquet(filename)
ookla_df = gpd.GeoDataFrame(
ookla_gdf
ookla_df,=ookla_df["tile"].apply(lambda x: wkt.loads(x)),
geometry="epsg:4326",
crs
)= ookla_gdf.drop(columns=["tile"])
ookla_gdf
# Ookla's data files contain data for the whole world.
# For our case, we're retaining only tiles that intersect with the given mask.
# This is to speed-up any processing we do later on.
if mask is not None:
= ookla_gdf.columns
keep_cols = ookla_gdf.sjoin(mask, how="inner", predicate="intersects")
ookla_gdf = ookla_gdf[keep_cols]
ookla_gdf
ookla_gdf.head()
# Convert kbps to mbps for easier reading
"avg_d_mbps"] = ookla_gdf["avg_d_kbps"] / 1000
ookla_gdf["avg_u_mbps"] = ookla_gdf["avg_u_kbps"] / 1000
ookla_gdf[
return ookla_gdf
The Ookla data is quite large, and takes around 5 minutes to load.
= load_ookla_data(ookla_fixed_2019_q1_filepath, mask=gdf) ookla_gdf
CPU times: user 22.3 s, sys: 3.6 s, total: 25.9 s
Wall time: 28.3 s
print(
f"{len(ookla_gdf):,} Ookla data tiles retained that intersect with our DHS cluster neighborhoods."
) ookla_gdf.head()
17,245 Ookla data tiles retained that intersect with our DHS cluster neighborhoods.
quadkey | avg_d_kbps | avg_u_kbps | avg_lat_ms | tests | devices | geometry | avg_d_mbps | avg_u_mbps | |
---|---|---|---|---|---|---|---|---|---|
4368240 | 1323011210311031 | 10 | 21 | 126 | 1 | 1 | POLYGON ((121.96472 20.4579, 121.97021 20.4579... | 0.010 | 0.021 |
4368240 | 1323011210311031 | 10 | 21 | 126 | 1 | 1 | POLYGON ((121.96472 20.4579, 121.97021 20.4579... | 0.010 | 0.021 |
4368240 | 1323011210311031 | 10 | 21 | 126 | 1 | 1 | POLYGON ((121.96472 20.4579, 121.97021 20.4579... | 0.010 | 0.021 |
4368240 | 1323011210311031 | 10 | 21 | 126 | 1 | 1 | POLYGON ((121.96472 20.4579, 121.97021 20.4579... | 0.010 | 0.021 |
4368241 | 1323011210311033 | 5931 | 10633 | 228 | 3 | 2 | POLYGON ((121.96472 20.45275, 121.97021 20.452... | 5.931 | 10.633 |
OSM (Open Street Maps)
One source of open data on points of interest is OSM. We’ll use the version of the data provided by Geofabrik vs accessing the OSM Overpass API. This allows us to not be bottlenecked by API calls.
Similar to Ookla, we can also use GeoWrangler’s OSM data download utils to download and cache OSM data on your machine. All you have to do is specify the country you want.
Geofabrik’s OSM data is a zip file containing different files for POIs, waterways, roads, etc. For this demo notebook, we’ll only use the POIs.
# This download can take around 1-2 minutes.
= geofabrik.download_geofabrik_region("philippines")
ph_osm_zip_path
# This line unzips the zip file if we haven't done so yet.
!echo 'None' | unzip $ph_osm_zip_path -d data/ph_osm/
2024-08-12 22:32:49.191 | INFO | geowrangler.datasets.utils:urlretrieve:25 - Retrieving https://download.geofabrik.de/asia/philippines-latest-free.shp.zip into data/philippines-latest-free.shp.zip
Archive: data/philippines-latest-free.shp.zip
inflating: data/ph_osm/README
extracting: data/ph_osm/gis_osm_buildings_a_free_1.cpg
inflating: data/ph_osm/gis_osm_buildings_a_free_1.dbf
inflating: data/ph_osm/gis_osm_buildings_a_free_1.prj
inflating: data/ph_osm/gis_osm_buildings_a_free_1.shp
inflating: data/ph_osm/gis_osm_buildings_a_free_1.shx
extracting: data/ph_osm/gis_osm_landuse_a_free_1.cpg
inflating: data/ph_osm/gis_osm_landuse_a_free_1.dbf
inflating: data/ph_osm/gis_osm_landuse_a_free_1.prj
inflating: data/ph_osm/gis_osm_landuse_a_free_1.shp
inflating: data/ph_osm/gis_osm_landuse_a_free_1.shx
extracting: data/ph_osm/gis_osm_natural_a_free_1.cpg
inflating: data/ph_osm/gis_osm_natural_a_free_1.dbf
inflating: data/ph_osm/gis_osm_natural_a_free_1.prj
inflating: data/ph_osm/gis_osm_natural_a_free_1.shp
inflating: data/ph_osm/gis_osm_natural_a_free_1.shx
extracting: data/ph_osm/gis_osm_natural_free_1.cpg
inflating: data/ph_osm/gis_osm_natural_free_1.dbf
inflating: data/ph_osm/gis_osm_natural_free_1.prj
inflating: data/ph_osm/gis_osm_natural_free_1.shp
inflating: data/ph_osm/gis_osm_natural_free_1.shx
extracting: data/ph_osm/gis_osm_places_a_free_1.cpg
inflating: data/ph_osm/gis_osm_places_a_free_1.dbf
inflating: data/ph_osm/gis_osm_places_a_free_1.prj
inflating: data/ph_osm/gis_osm_places_a_free_1.shp
inflating: data/ph_osm/gis_osm_places_a_free_1.shx
extracting: data/ph_osm/gis_osm_places_free_1.cpg
inflating: data/ph_osm/gis_osm_places_free_1.dbf
inflating: data/ph_osm/gis_osm_places_free_1.prj
inflating: data/ph_osm/gis_osm_places_free_1.shp
inflating: data/ph_osm/gis_osm_places_free_1.shx
extracting: data/ph_osm/gis_osm_pofw_a_free_1.cpg
inflating: data/ph_osm/gis_osm_pofw_a_free_1.dbf
inflating: data/ph_osm/gis_osm_pofw_a_free_1.prj
inflating: data/ph_osm/gis_osm_pofw_a_free_1.shp
inflating: data/ph_osm/gis_osm_pofw_a_free_1.shx
extracting: data/ph_osm/gis_osm_pofw_free_1.cpg
inflating: data/ph_osm/gis_osm_pofw_free_1.dbf
inflating: data/ph_osm/gis_osm_pofw_free_1.prj
inflating: data/ph_osm/gis_osm_pofw_free_1.shp
inflating: data/ph_osm/gis_osm_pofw_free_1.shx
extracting: data/ph_osm/gis_osm_pois_a_free_1.cpg
inflating: data/ph_osm/gis_osm_pois_a_free_1.dbf
inflating: data/ph_osm/gis_osm_pois_a_free_1.prj
inflating: data/ph_osm/gis_osm_pois_a_free_1.shp
inflating: data/ph_osm/gis_osm_pois_a_free_1.shx
extracting: data/ph_osm/gis_osm_pois_free_1.cpg
inflating: data/ph_osm/gis_osm_pois_free_1.dbf
inflating: data/ph_osm/gis_osm_pois_free_1.prj
inflating: data/ph_osm/gis_osm_pois_free_1.shp
inflating: data/ph_osm/gis_osm_pois_free_1.shx
extracting: data/ph_osm/gis_osm_railways_free_1.cpg
inflating: data/ph_osm/gis_osm_railways_free_1.dbf
inflating: data/ph_osm/gis_osm_railways_free_1.prj
inflating: data/ph_osm/gis_osm_railways_free_1.shp
inflating: data/ph_osm/gis_osm_railways_free_1.shx
extracting: data/ph_osm/gis_osm_roads_free_1.cpg
inflating: data/ph_osm/gis_osm_roads_free_1.dbf
inflating: data/ph_osm/gis_osm_roads_free_1.prj
inflating: data/ph_osm/gis_osm_roads_free_1.shp
inflating: data/ph_osm/gis_osm_roads_free_1.shx
extracting: data/ph_osm/gis_osm_traffic_a_free_1.cpg
inflating: data/ph_osm/gis_osm_traffic_a_free_1.dbf
inflating: data/ph_osm/gis_osm_traffic_a_free_1.prj
inflating: data/ph_osm/gis_osm_traffic_a_free_1.shp
inflating: data/ph_osm/gis_osm_traffic_a_free_1.shx
extracting: data/ph_osm/gis_osm_traffic_free_1.cpg
inflating: data/ph_osm/gis_osm_traffic_free_1.dbf
inflating: data/ph_osm/gis_osm_traffic_free_1.prj
inflating: data/ph_osm/gis_osm_traffic_free_1.shp
inflating: data/ph_osm/gis_osm_traffic_free_1.shx
extracting: data/ph_osm/gis_osm_transport_a_free_1.cpg
inflating: data/ph_osm/gis_osm_transport_a_free_1.dbf
inflating: data/ph_osm/gis_osm_transport_a_free_1.prj
inflating: data/ph_osm/gis_osm_transport_a_free_1.shp
inflating: data/ph_osm/gis_osm_transport_a_free_1.shx
extracting: data/ph_osm/gis_osm_transport_free_1.cpg
inflating: data/ph_osm/gis_osm_transport_free_1.dbf
inflating: data/ph_osm/gis_osm_transport_free_1.prj
inflating: data/ph_osm/gis_osm_transport_free_1.shp
inflating: data/ph_osm/gis_osm_transport_free_1.shx
extracting: data/ph_osm/gis_osm_water_a_free_1.cpg
inflating: data/ph_osm/gis_osm_water_a_free_1.dbf
inflating: data/ph_osm/gis_osm_water_a_free_1.prj
inflating: data/ph_osm/gis_osm_water_a_free_1.shp
inflating: data/ph_osm/gis_osm_water_a_free_1.shx
extracting: data/ph_osm/gis_osm_waterways_free_1.cpg
inflating: data/ph_osm/gis_osm_waterways_free_1.dbf
inflating: data/ph_osm/gis_osm_waterways_free_1.prj
inflating: data/ph_osm/gis_osm_waterways_free_1.shp
inflating: data/ph_osm/gis_osm_waterways_free_1.shx
# Uncomment the ff line to see all the available OSM files aside from POIs.
# !ls data/ph_osm
= "data/ph_osm/gis_osm_pois_free_1.shp"
ph_osm_pois_filepath = gpd.read_file(ph_osm_pois_filepath) ph_osm
print(f"There are {len(ph_osm)} OSM POIs.")
ph_osm.head()
There are 150311 OSM POIs.
osm_id | code | fclass | name | geometry | |
---|---|---|---|---|---|
0 | 21717820 | 2907 | camera_surveillance | None | POINT (121.0212 14.57608) |
1 | 21717872 | 2722 | museum | Ayala Museum | POINT (121.02324 14.55358) |
2 | 24078959 | 2907 | camera_surveillance | None | POINT (121.05945 14.60098) |
3 | 24797511 | 2542 | bicycle_shop | Christine Sports Cycle Marketing | POINT (120.99506 14.55224) |
4 | 24797535 | 2518 | beverages | G and K wine | POINT (120.99267 14.55285) |
Feature Engineering
Now that we’ve prepared the DHS data and loaded Ookla and OSM data into memory, we’re ready to generate some neighborhood features for our ML model.
For convenience, we’re wrapping all the feature engineering code in functions so we can re-use them later when we apply the model to the whole country.
Note: if you are modifying the feature engineering portions, it is recommended to run the whole section end-to-end. This ensures you’re starting fresh from a new copy of the DHS clusters GeoDataFrame. This also ensures the function definitions are up-to-date, and can be re-used properly later for model rollout.
# We'll make a copy of the GDF to avoid overwriting the original DHS GeoDataFrame.
= gdf.copy() gdf_with_features
OSM
Our goal with OSM data is to generate neighborhood characteristics based on counts and distance to certain POIs, such as schools, hospitals, etc.
To do this, we utilize GeoWrangler’s vector zonal stats and distance zonal stats features.
# Uncomment the ff. line if you want to see the different POI classes available
# ph_osm.fclass.sort_values().unique()
def generate_osm_features(aoi, osm, metric_crs="epsg:3123"):
= aoi.copy()
aoi
# GeoWrangler: Count number of all POIs per tile
= vzs.create_zonal_stats(
aoi
aoi,
osm,="intersects",
overlap_method=[{"func": "count", "output": "poi_count", "fillna": True}],
aggregations
)
# Count specific aoi types
= ["restaurant", "school", "bank", "supermarket", "mall", "atm"]
poi_types
for poi_type in poi_types:
# GeoWrangler: Count with vector zonal stats
= vzs.create_zonal_stats(
aoi
aoi,"fclass"] == poi_type],
osm[osm[="intersects",
overlap_method=[
aggregations"func": "count", "output": f"{poi_type}_count", "fillna": True}
{
],
)
# GeoWrangler: Distance with distance zonal stats
= f"{poi_type}_nearest"
col_name = dzs.create_distance_zonal_stats(
aoi
aoi.to_crs(metric_crs),"fclass"] == poi_type].to_crs(metric_crs),
osm[osm[=10_000,
max_distance=[],
aggregations=col_name,
distance_col"epsg:4326")
).to_crs(
# If no POI was found within the distance limit, set the distance to a really high value
= aoi[col_name].fillna(value=999999)
aoi[col_name]
return aoi
= generate_osm_features(gdf_with_features, ph_osm) gdf_with_features
CPU times: user 746 ms, sys: 29.4 ms, total: 775 ms
Wall time: 793 ms
gdf_with_features.head()
DHSCLUST | Wealth Index | DHSID | longitude | latitude | geometry | poi_count | restaurant_count | restaurant_nearest | school_count | school_nearest | bank_count | bank_nearest | supermarket_count | supermarket_nearest | mall_count | mall_nearest | atm_count | atm_nearest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | -31881.60870 | PH201700000001 | 122.109807 | 6.674652 | POLYGON ((122.12789 6.67461, 122.1278 6.67284,... | 0.0 | 0.0 | 3152.996430 | 0.0 | 1392.503631 | 0.0 | 1764.736950 | 0.0 | 999999.0 | 0.0 | 999999.0 | 0.0 | 999999.000000 |
1 | 2 | -2855.37500 | PH201700000002 | 122.132027 | 6.662256 | POLYGON ((122.15011 6.66221, 122.15002 6.66044... | 6.0 | 0.0 | 5788.165918 | 0.0 | 3601.132502 | 1.0 | 0.000000 | 0.0 | 999999.0 | 0.0 | 999999.0 | 0.0 | 999999.000000 |
2 | 3 | -57647.04762 | PH201700000003 | 122.179496 | 6.621822 | POLYGON ((122.19758 6.62178, 122.19749 6.62001... | 0.0 | 0.0 | 999999.000000 | 0.0 | 642.871037 | 0.0 | 4026.300327 | 0.0 | 999999.0 | 0.0 | 999999.0 | 0.0 | 999999.000000 |
3 | 4 | -54952.66667 | PH201700000004 | 122.137965 | 6.485298 | POLYGON ((122.15604 6.48526, 122.15595 6.48349... | 0.0 | 0.0 | 999999.000000 | 0.0 | 2287.839903 | 0.0 | 999999.000000 | 0.0 | 999999.0 | 0.0 | 999999.0 | 0.0 | 999999.000000 |
5 | 6 | -80701.69565 | PH201700000006 | 121.916094 | 6.629457 | POLYGON ((121.93418 6.62942, 121.93409 6.62765... | 0.0 | 0.0 | 8408.925995 | 0.0 | 3930.606257 | 0.0 | 8332.324084 | 0.0 | 999999.0 | 0.0 | 999999.0 | 0.0 | 8311.332736 |
Ookla
Our goal with Ookla data is to generate neighborhood characteristics based on internet speeds (download, upload, latency).
To do this, we utilize GeoWrangler’s area zonal stats feature.
def generate_ookla_features(aoi, ookla_gdf, metric_crs="epsg:3123"):
= aoi.copy()
aoi
= aoi.crs
orig_aoi_crs = aoi.to_crs(metric_crs)
aoi = ookla_gdf.to_crs(metric_crs)
ookla_gdf
# For better formatting, rename some columns
= ookla_gdf.rename(
ookla_gdf ={"avg_d_mbps": "d_mbps", "avg_u_mbps": "u_mbps", "avg_lat_ms": "lat_ms"}
columns
)
# GeoWrangler: Compute stats on the various columns
= azs.create_area_zonal_stats(
aoi
aoi,
ookla_gdf,=[
aggregations# Count number of devices in the area
dict(column="devices", func=["raw_sum"], fillna=True),
# Get stats on the download speeds
dict(
="d_mbps",
column=["imputed_mean", "max", "min", "std"],
func=[True, True, True, True],
fillna
),# Get stats on the upload speeds
dict(
="u_mbps",
column=["imputed_mean", "max", "min", "std"],
func=[True, True, True, True],
fillna
),
],# Don't include the land area that intersected as a column
=False,
include_intersect# Don't set minimum values to 0 if the neighborhood's area doesn't fully intersect with Ookla tiles.
=False,
fix_min
)
= aoi.fillna(value="0")
aoi = aoi.to_crs(orig_aoi_crs)
aoi
return aoi
= generate_ookla_features(gdf_with_features, ookla_gdf) gdf_with_features
CPU times: user 1.01 s, sys: 27.6 ms, total: 1.04 s
Wall time: 1.05 s
gdf_with_features.head()
DHSCLUST | Wealth Index | DHSID | longitude | latitude | geometry | poi_count | restaurant_count | restaurant_nearest | school_count | ... | atm_nearest | devices_sum | d_mbps_mean | d_mbps_max | d_mbps_min | d_mbps_std | u_mbps_mean | u_mbps_max | u_mbps_min | u_mbps_std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | -31881.60870 | PH201700000001 | 122.109807 | 6.674652 | POLYGON ((122.12789 6.67461, 122.1278 6.67284,... | 0.0 | 0.0 | 3152.996430 | 0.0 | ... | 999999.000000 | 0.0 | 0.000000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000 | 0.000 | 0.000000 |
1 | 2 | -2855.37500 | PH201700000002 | 122.132027 | 6.662256 | POLYGON ((122.15011 6.66221, 122.15002 6.66044... | 6.0 | 0.0 | 5788.165918 | 0.0 | ... | 999999.000000 | 38.0 | 0.401851 | 14.628 | 2.177 | 4.982053 | 0.255031 | 6.867 | 1.866 | 1.753667 |
2 | 3 | -57647.04762 | PH201700000003 | 122.179496 | 6.621822 | POLYGON ((122.19758 6.62178, 122.19749 6.62001... | 0.0 | 0.0 | 999999.000000 | 0.0 | ... | 999999.000000 | 0.0 | 0.000000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000 | 0.000 | 0.000000 |
3 | 4 | -54952.66667 | PH201700000004 | 122.137965 | 6.485298 | POLYGON ((122.15604 6.48526, 122.15595 6.48349... | 0.0 | 0.0 | 999999.000000 | 0.0 | ... | 999999.000000 | 0.0 | 0.000000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000 | 0.000 | 0.000000 |
5 | 6 | -80701.69565 | PH201700000006 | 121.916094 | 6.629457 | POLYGON ((121.93418 6.62942, 121.93409 6.62765... | 0.0 | 0.0 | 8408.925995 | 0.0 | ... | 8311.332736 | 0.0 | 0.000000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000 | 0.000 | 0.000000 |
5 rows × 28 columns
Night Time Lights
Lastly, we’ll generate features based on night time lights. GeoWrangler provides a raster zonal stats function that acts as a thin layer on top of rasterio’s rasterstats function. It provides a little convenience in terms of formatting and automatically joins back the features to your input GeoDataFrame.
def generate_ntl_features(aoi, raster_path):
= aoi.copy()
aoi
= rzs.create_raster_zonal_stats(
aoi
aoi,
raster_path,=dict(
aggregation=["mean", "min", "max", "std"],
func="ntl",
column
),=dict(nodata=0),
extra_args
)
= aoi.fillna(0)
aoi
return aoi
= generate_ntl_features(gdf_with_features, DATA_DIR / "phl_ntl.tif") gdf_with_features
CPU times: user 1.24 s, sys: 198 ms, total: 1.44 s
Wall time: 1.57 s
Visualize the final dataset
Finally, let’s visualize our cluster neighborhoods now with the generated features. Hover over each cluster to see its data.
# Visualize the clusters with the generated features
# gdf_with_features.explore()
ML Training
In this section, we will now train an ML model to predict the wealth index for an area given its features.
Prepare data into X, y and generate train/test partitions
# For convenience, store the list of features in a variable. Remove all extraneous columns.
= gdf_with_features.drop(
feature_cols =["DHSCLUST", "DHSID", "Wealth Index", "longitude", "latitude", "geometry"]
columns
).columns
# Separate data (X) and the target (y)
= gdf_with_features[feature_cols]
X = gdf_with_features["Wealth Index"]
y
# Bin y into 5 buckets for stratification when splitting the dataset
= pd.cut(y, bins=5, labels=list(range(1, 6)))
cat_y
# Generate train and test partitions
= sklearn.model_selection.train_test_split(
X_train, X_test, y_train, y_test =0.2, random_state=2022, stratify=cat_y
X, y, test_size )
X_train.head()
poi_count | restaurant_count | restaurant_nearest | school_count | school_nearest | bank_count | bank_nearest | supermarket_count | supermarket_nearest | mall_count | ... | d_mbps_min | d_mbps_std | u_mbps_mean | u_mbps_max | u_mbps_min | u_mbps_std | ntl_mean | ntl_min | ntl_max | ntl_std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
944 | 1.0 | 0.0 | 4332.445808 | 1.0 | 0.000000 | 0.0 | 999999.000000 | 0.0 | 5595.357740 | 0.0 | ... | 0.000 | 0.000000 | 0.000000 | 0.000 | 0.000 | 0.000000 | 0.279636 | 0.279636 | 0.279636 | 0.000000 |
328 | 1203.0 | 143.0 | 0.000000 | 23.0 | 0.000000 | 73.0 | 0.000000 | 14.0 | 0.000000 | 1.0 | ... | 8.948 | 6.519399 | 0.033496 | 30.564 | 2.496 | 5.933372 | 9.881545 | 4.587300 | 17.190298 | 2.768707 |
696 | 0.0 | 0.0 | 4396.530482 | 0.0 | 7331.181601 | 0.0 | 999999.000000 | 0.0 | 999999.000000 | 0.0 | ... | 0.000 | 0.000000 | 0.000000 | 0.000 | 0.000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
155 | 2.0 | 0.0 | 352.510311 | 0.0 | 669.141767 | 0.0 | 2606.478381 | 0.0 | 2566.227871 | 0.0 | ... | 1.215 | 31.305303 | 0.863634 | 151.303 | 0.106 | 35.998549 | 0.299591 | 0.182455 | 0.519110 | 0.125637 |
614 | 286.0 | 12.0 | 0.000000 | 4.0 | 0.000000 | 14.0 | 0.000000 | 8.0 | 0.000000 | 1.0 | ... | 2.986 | 6.722738 | 0.174738 | 32.774 | 0.527 | 7.027801 | 0.335981 | 0.216280 | 0.543996 | 0.084581 |
5 rows × 26 columns
y_train.head()
944 -27114.20833
328 175912.44440
696 -29987.93750
155 75826.10526
614 54849.37500
Name: Wealth Index, dtype: float64
Evaluation Function
We define an evaluation function that computes R^2 and makes a plot of the actual vs predicted values.
def evaluate(model, X_train, X_test, y_train, y_test):
# R^2
= model.predict(X_train)
train_predictions print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
= model.predict(X_test)
test_predictions print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))
# Plot
# Train and test predictions vs actual values
="Train samples", c="#d95f02")
plt.scatter(train_predictions, y_train, label="Test samples", c="#7570b3")
plt.scatter(test_predictions, y_test, label
# Identity line
= ypoints = plt.xlim()
xpoints
plt.plot(="--", color="k", lw=3, scalex=False, scaley=False
xpoints, ypoints, linestyle
)
# Labels and legends
"Predicted value")
plt.xlabel("True value")
plt.ylabel(
plt.legend()
plt.tight_layout() plt.show()
Fit and Evaluate Model
For ML modelling, let’s fit a simple random forest regressor.
As mentioned, we are keeping the ML modelling as simple as possible in this notebook. This is definitely not the best-performing wealth estimation model out there, but should be decent enough for demo purposes.
= RandomForestRegressor()
model
model.fit(X_train, y_train) evaluate(model, X_train, X_test, y_train, y_test)
Train R2 score: 0.9244278211179482
Test R2 score: 0.5882081532536176
ML Rollout (Phl)
Now that we have a wealth estimation model, we can apply it to the whole country.
Generate country-wide tiles
First, we need to split the country up into smaller areas.
That usually means generating grid tiles or hex tiles for your area of interest. In this notebook, we’ll generate Bing tiles at zoom level 13 (which has the closest area to our 2km neighborhoods in training).
GeoWrangler provides various grid generation utilities for these different scenarios.
# Read in the admin boundary for PHL.
# This file is a simplified version of the admin boundary to make the file size smaller and load times faster.
= gpd.read_file(f"{DATA_DIR}/phl_adm0.geojson")
phl_adm
# Uncomment this line if you want to visualize it
# phl_adm.explore()
The generation can take around 1-2 minutes.
= grids.BingTileGridGenerator(13).generate_grid(phl_adm)
phl_tiles print(f"There are {len(phl_tiles):,} tiles.")
phl_tiles.head()
There are 16,257 tiles.
CPU times: user 7.34 s, sys: 57.5 ms, total: 7.39 s
Wall time: 7.5 s
quadkey | geometry | |
---|---|---|
0 | 1323230031202 | POLYGON ((119.17969 4.65308, 119.17969 4.69688... |
1 | 1323230030300 | POLYGON ((119.00391 4.69688, 119.00391 4.74068... |
2 | 1323230030311 | POLYGON ((119.13574 4.69688, 119.13574 4.74068... |
3 | 1323230031201 | POLYGON ((119.22363 4.69688, 119.22363 4.74068... |
4 | 1323212220302 | POLYGON ((118.30078 6.05316, 118.30078 6.09686... |
Feature Engineering
This part is simpler, as we’re just re-using the feature engineering functions we previously created. Note though that we are generating features for a bigger area, so the runtime is a little longer (in total should be within 2 minutes).
= phl_tiles.copy() phl_tiles_with_features
= generate_osm_features(phl_tiles_with_features, ph_osm) phl_tiles_with_features
CPU times: user 1.17 s, sys: 39.2 ms, total: 1.21 s
Wall time: 1.24 s
= generate_ookla_features(phl_tiles_with_features, ookla_gdf) phl_tiles_with_features
CPU times: user 465 ms, sys: 16.3 ms, total: 481 ms
Wall time: 485 ms
= generate_ntl_features(
phl_tiles_with_features f"{DATA_DIR}/phl_ntl.tif"
phl_tiles_with_features, )
CPU times: user 13.4 s, sys: 1.8 s, total: 15.2 s
Wall time: 15.4 s
Make predictions and viz output
Finally, we are ready to apply the model and make predictions throughout the whole country. We can also visualize it on a Cholorpleth map.
"predicted_wealth_index"] = model.predict(
phl_tiles_with_features[
phl_tiles_with_features[feature_cols] )
def viz_chloropleth(gdf, target_feature="predicted_wealth_index"):
= folium.Map(location=[14.6091, 121.0223], width=1000, height=800, zoom_start=7)
m = gdf.copy()
subset "id"] = list(range(len(subset)))
subset[
folium.Choropleth(=subset,
geo_data=subset,
data="Wealth Predictions",
name=["id", target_feature],
columns="feature.properties.id",
key_on="Spectral",
fill_color=target_feature,
legend_name
).add_to(m)
return m
Visualize the output
# Uncomment the line below to visualize rollout
# viz_chloropleth(phl_tiles_with_features)