A simple use case demo for using geowrangler modules to find the Philippine provinces/towns with the slowest/fastest internet speeds.
Summary
This Use Case Demo shows how to use the geowrangler.datasets.ookla and the geowrangler.area_zonal_stats modules to find the slowest/fastest internet speeds within an area or region of interest.
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import geowrangler.area_zonal_stats as azs
Download Admin Areas
Next, we get the administrative boundaries geodataset using data from Humanitarian Data Exchange
REGION_FILTER = "National Capital Region" # limit to 1 to speed up processing
Download the geodataset containing the admin areas of your country.
Here we are using the data for the Philippines.
phl_admin3_file = "phl_adminboundaries_candidate_adm3"
phl_admin3_zip = f"{phl_admin3_file}.zip"
# shapefiles
phl_admin3_link = f"https://data.humdata.org/dataset/caf116df-f984-4deb-85ca-41b349d3f313/resource/12457689-6a86-4474-8032-5ca9464d38a8/download/{phl_admin3_zip}"
Download the zipped file. Depending on your internet connection, it can take several minutes.
%%time
# no_test
![ ! -e ../data/{phl_admin3_zip} ] && curl -L -o ../data/{phl_admin3_zip} {phl_admin3_link}
!mkdir -p ../data/{phl_admin3_file}
main_file = "phl_admbnda_adm3_psa_namria_20200529"
phl_admin3_shp = f"../data/{phl_admin3_file}/{main_file}.shp"
%%time
![ ! -e {phl_admin3_shp} ] && unzip -d ../data/{phl_admin3_file} ../data/{phl_admin3_zip}
Load the admin area geo dataset.
In our example we are loading the .shp
or shape file as a geopandas dataframe.
%%time
print(f"loading {phl_admin3_shp}")
admin3 = gpd.read_file(phl_admin3_shp)
admin3.head()
list(admin3.columns.values)
admin3.ADM1_EN.unique()
Limit the admin regions to only 1 in order to make the process run faster.
The REGION FILTER is set in the Set Region Filter Section
if REGION_FILTER:
admin3 = admin3[admin3.ADM1_EN == REGION_FILTER]
import matplotlib.pyplot as plt
ax = plt.axes()
ax = admin3.plot(ax=ax, facecolor="none", edgecolor="blue")
%%time
ax = plt.axes()
ax = admin3[admin3.ADM3_EN == "Pateros"].plot(ax=ax, facecolor="none", edgecolor="blue")
admin3.crs
admin3.total_bounds
len(admin3)
from geowrangler.datasets import ookla
List the publically available ookla datasets
%%time
ookla_dsets = ookla.list_ookla_files()
ookla_dsets
Download the latest data (as of the time of writing this)
ookla_params = dict(year="2022", quarter="2", directory="../data")
%%time
ookla_fixed = ookla.download_ookla_file(type_="fixed", **ookla_params)
ookla_fixed
import pandas as pd
The downloaded ookla data contains the internet speed data for the whole world and can take a minute or two to load.
Later, we will show how to filter the data to only include the data relevant to our AOI.
%%time
fixed = pd.read_parquet(ookla_fixed)
len(fixed)
fixed.head()
fixed.dtypes
The data is now a Pandas DataFrame
but this needs to be converted to a GeoDataFrame
by converting the tile
column into a geometry
with a crs
(Coordinate Reference System).
See EPSG:4326 for more details about the CRS.
Converting the data can also take a minute or two.
%%time
fixed["geometry"] = gpd.GeoSeries.from_wkt(fixed.tile, crs="EPSG:4326")
fixed.head()
%%time
fixed.drop(columns=["tile"], inplace=True)
%%time
fixed = gpd.GeoDataFrame(fixed, geometry="geometry", crs="EPSG:4326")
from geowrangler.validation import GeometryValidation
admin3_gvm = GeometryValidation(admin3)
%%time
valid_admin3 = admin3_gvm.validate_all()
valid_admin3.head()
valid_cols = [
"is_oriented_properly",
"is_not_null",
"is_not_self_intersecting",
"is_within_crs_bounds",
"area_is_not_zero",
]
[valid_admin3[col].value_counts() for col in valid_cols]
So the admin areas have their geometry improperly oriented (i.e. layed out in a counter clockwise direction instead of clockwise direction) and this has been fixed, but passes all the other default validations.
valid_admin3.drop(
columns=valid_cols,
inplace=True,
)
Validate Data Geometries
Before validating, let's filter the data to only those intersecting our AOI because we don't need to check all the data from around the world when we're only interested in our AOI.
Before we validate, we need to create the spatial indexes for both the AOI and data geometries. Generating the spatial indexes of the data geometries can take a minute or two due to the size of the data and their geometries.
%%time
valid_admin3.geometry.sindex
%%time
fixed.geometry.sindex
Lets now filter the data (filtered_fixed
).
Filtering the data using a spatial join can also take a minute or two.
%%time
filtered_fixed = fixed.sjoin(
valid_admin3[["geometry"]], how="inner", predicate="intersects"
)
filtered_fixed.drop(columns=["index_right"], inplace=True)
fixed_gvm = GeometryValidation(filtered_fixed)
%%time
valid_fixed = fixed_gvm.validate_all()
%%time
[valid_fixed[col].value_counts() for col in valid_cols]
Again the data geometries have an improperly oriented geometry and have been fixed by the validator
valid_fixed.drop(columns=valid_cols, inplace=True)
filtered_fixed = valid_fixed
Lets now generate the zonal stats -- the statistics of the data we are interested in.
filtered_fixed.head()
list(filtered_fixed.columns.values)
len(filtered_fixed)
Convert to planar CRS (since we are computing areas)
%%time
valid_admin3 = valid_admin3.to_crs("EPSG:3857")
%%time
filtered_fixed = filtered_fixed.to_crs("EPSG:3857")
Lets save the files so we can retrieve them as necessary without having to reprocess them again.
%%time
valid_admin3.to_file("../data/valid_admin3.geojson", driver="GeoJSON")
%%time
filtered_fixed.to_file("../data/filtered_fixed.geojson", driver="GeoJSON")
# valid_admin3 = gpd.read_file("../data/valid_admin3.geojson")
# filtered_fixed = gpd.read_file("../data/filtered_fixed.geojson")
We want the following statistics - mean
(aka average), min
(minimum), max
(maximum) and the standard deviation (std
)
for the 3 data columns that ookla provides - average download speed (avg_d_kbps
), average upload speed (avg_u_kbps
) and average latency in milliseconds (avg_lat_ms
)
funcs = ["mean", "min", "max", "std"]
columns = ["avg_d_kbps", "avg_u_kbps", "avg_lat_ms"]
aggregations = [dict(func=funcs, column=c) for c in columns]
These are the aggregations to be performed for each data column
aggregations
The output columns use the default {column}_{func}
format if not explicitly specified.
%%time
aoi = azs.create_area_zonal_stats(
valid_admin3, filtered_fixed, aggregations=aggregations
)
These are the results - the same dataframe as the original, with the addition of zonal statistics we specified as the aggregations.
aoi.head()
Lets save the results
%%time
aoi.to_file("../data/admin3_internet_aoi.geojson", driver="GeoJSON")
list(aoi.columns.values)
Let's sort the different admin level 3 areas (cities/municipalities/district) by average download speed (avg_d_kbps_mean
) in descending order
%%time
fastest_mean_download = aoi.sort_values("avg_d_kbps_mean", ascending=False)
len(fastest_mean_download)
So according the latest ookla data (as of this writing) the district/town/province within the NCR Region with the fastest average download speed is Binondo!! :)
fastest_mean_download.iloc[0]
While the district/city/municipality within the NCR Regions with the slowest average download speed is Quezon City!!!
fastest_mean_download.iloc[-1]
The top 5 fastest areas are
fastest_mean_download[["ADM3_EN", "ADM2_EN", "avg_d_kbps_mean"]].head()
The top 5 slowest areas are
fastest_mean_download[["ADM3_EN", "ADM2_EN", "avg_d_kbps_mean"]].tail()
ax = plt.axes()
ax = fastest_mean_download.plot(ax=ax, facecolor="none", edgecolor="blue")
ax = fastest_mean_download.iloc[:1].plot(ax=ax, facecolor="red", edgecolor="red")
ax = fastest_mean_download.iloc[29:].plot(ax=ax, facecolor="green", edgecolor="green")
ax = fastest_mean_download.plot(ax=ax, column="avg_d_kbps_mean", alpha=0.8)
So now, we've answered the question of where the fastest/slowest internet speeds are for the National Capital Region (NCR)
Additional Exercises
You can further experiment and try out other exercises to explore ookla data, or even try it out using a different country using a different geodataset from the Humanitarian Data Exchange site.
The following are more exercises you can try:
- Repeat the same process to find the fastest and slowest mobile internet speeds
- Repeat the same process for a different REGION_FILTER
- Aggregate by adm2 level instead of adm3