A basic introduction to Vector Area Zonal Stats
import geopandas as gpd
import matplotlib.pyplot as plt
import geowrangler.area_zonal_stats as azs
simple_aoi = gpd.read_file("../data/simple_planar_aoi.geojson")
simple_data = gpd.read_file("../data/simple_planar_data.geojson")
Given an aoi (simple_aoi
) and geodataframe containing sample data (simple_data
)
simple_aoi
simple_data
In order correctly apportion the statistic, we need to make sure that the aoi and data geodataframes are using a planar
CRS (i.e. gdf.crs.is_geographic == False
)
simple_aoi.crs
simple_data.crs
ax = plt.axes()
ax = simple_data.plot(
ax=ax, color=["orange", "brown", "purple"], edgecolor="yellow", alpha=0.4
)
ax = simple_aoi.plot(ax=ax, facecolor="none", edgecolor=["r", "g", "b"])
The red,green,blue outlines are the 3 regions of interest (aoi) while the orange,brown, purple areas are the data areas.
empty_aoi_results = azs.create_area_zonal_stats(simple_aoi, simple_data)
If no aggregations are specified, the include_intersect=True
arg specifies that the sum of the data areas intersecting our aoi is computed in the column intersect_area_sum
.
empty_aoi_results
Apportioning aggregated statistics
To aggregate statistics over the aoi areas intersecting with the data areas, the default behavior of "apportioning" the statistic over the intersection of the data area overlapping the aoi area depends on the statistic.
- For
sum
it apportions the total value of the statistic over the proportion of the data area overlapping the aoi area divided by the total area of thedata
. - For
mean
it apportions the total value of the statistic over the proportion of the data area overlapping the aoi area divided by the total area of theaoi
. - For other statistics, there is no apportioning done and uses the
raw
statistics from the data areas overlapping the aoi area.Note: this default behavior can be overridden by specifying a prefix (raw_
,data_
,aoi_
) to the statistic (i.e. func) -- e.g. if you wish the apportion thesum
as a proportion of the aoi area instead of the data area, specify the func asaoi_sum
instead ofsum
.
%%time
simple_aoi_results = azs.create_area_zonal_stats(
simple_aoi,
simple_data,
[
dict(func=["sum", "count"], column="population"),
dict(func=["mean", "max", "min", "std"], column="internet_speed"),
],
)
simple_aoi_results
simple_aoi_results.population_sum.sum(axis=None)
Note the value of the mean of the internet speed for the 1st area above.
While there is one data area overlapping with an internet speed of 20.0
the computed aggregate statistic for the mean is 15.0
due to the apportioning which only accounts for 0.75
(the total aoi area is 1.0
) -- meaning it assigns a value of 0.0
for the 0.25
area with no overlapping data area.
If you wish to override this behavior and impute the value of the statistic as the means of all the overlapping areas (in proportion to the overlap of their area over the total area of the aoi), you can additionally add a prefix of imputed_
as shown in the next example. Note that adding a prefix doesn't change the default output column name, which is why we explicitly specified an output column name (internet_speed_imputed_mean
in order to differentiate it from the original internet_speed_mean
column.
Notice as well that adding a raw_
prefix to the min
,max
,std
statistics doesn't change their values since by default these statistics don't use any apportioning (so specifying raw_
is redundant in this case).
Lastly, check the effect of setting of the fix_min
arg to False
(the default value is True
) -- since it uses the raw column to compute the min
, it is not aware that there are areas in the aoi that have no overlapping data areas and uses the min
value from the overlapping areas (in this case, since there is only 1 partially overlapping area, it uses that as the min
value). The fix_min
arg "fixes" this by checking if the data areas completely overlap the aoi area and sets the minimum to 0
if there is a portion of the aoi area that is not completely intersected by a data area.
%%time
corrected_aoi_results = azs.create_area_zonal_stats(
simple_aoi,
simple_data,
[
dict(func=["sum", "count"], column="population"),
dict(
func=["mean", "imputed_mean", "raw_max", "raw_min", "raw_std"],
column="internet_speed",
output=[
"internet_speed_mean",
"internet_speed_imputed_mean",
"internet_speed_max",
"internet_speed_min",
"internet_speed_std",
],
),
],
fix_min=False,
)
corrected_aoi_results
Custom Grids over admin area data
Another example using aggregated statistics over admin areas (barangay level) that are apportioned into grid areas by their overlap.
Notice that since the grid areas does not completely overlap the admin areas (and vice versa) , the sum of their total populations and land areas are not equal.
%%time
region3_admin_grids = gpd.read_file("../data/region3_admin_grids.geojson")
region3_admin_grids = region3_admin_grids.to_crs("EPSG:3857") # convert to planar
%%time
region3_pop_bgy_level = gpd.read_file("../data/region3_population_bgy_level.geojson")
%%time
aoi_result = azs.create_area_zonal_stats(
region3_admin_grids,
region3_pop_bgy_level,
[
dict(func=["sum", "count"], column="population"),
],
)
aoi_result
(
aoi_result.geometry.area.sum(),
aoi_result.intersect_area_sum.sum(),
aoi_result.population_sum.sum(),
)
(region3_pop_bgy_level.geometry.area.sum(), region3_pop_bgy_level.population.sum())
gdf = region3_admin_grids
gdf2 = region3_pop_bgy_level
ax = plt.axes()
ax = gdf2.plot(ax=ax, column="population", alpha=0.4)
ax = gdf.plot(ax=ax, facecolor="none", edgecolor="blue", alpha=0.4)
ax = plt.axes()
ax = gdf2.plot(ax=ax, facecolor="none", alpha=0.4, edgecolor="red")
ax = aoi_result.plot(column="population_sum", ax=ax, alpha=0.4, edgecolor="blue")
ax = plt.axes()
ax = gdf2.plot(ax=ax, column="population", alpha=0.4, edgecolor="red")
ax = aoi_result.plot(column="intersect_area_sum", ax=ax, alpha=0.4, edgecolor="blue")