A basic introduction to using spatial join using highest intersection
type | default | optional/ required | details | |
---|---|---|---|---|
gdf1 | GeoDataFrame | none | required | basis of output geometry |
gdf2 | GeoDataFrame | none | required | data to be included during intersection |
proj_crs | String | none | required | metric CRS (e.g., Philippines uses EPSG:32651) |
A technical step-by-step explanation of how get_highest_intersection
works is detailed in the cell blocks below. An example on how to use it with its arguments is shown in the sample use case section thereafter.
- Define the function and its arguments.
def get_highest_intersection(
gdf1: gpd.GeoDataFrame,
gdf2: gpd.GeoDataFrame,
proj_crs: str,
) -> gpd.GeoDataFrame:
- Create a copy of the two geodataframes.
gdf1 = gdf1.copy()
gdf2 = gdf2.copy()
- Rename new columns with "__" prefixes and suffixes to prevent overwriting existing columns in the original geodataframes.
uid_col = "__uid__"
area_col = "__area_highest_intersection__"
auxiliary_cols = [uid_col, area_col]
- Conduct checks to make sure we are not overwriting existing columns.
for col in auxiliary_cols:
if col in gdf1.columns:
raise ValueError(f"Make sure {col} isn't already a column in gdf1")
if col in gdf2.columns:
raise ValueError(f"Make sure {col} isn't already a column in gdf2")
- Assign a unique ID to the first geodataframe. Note that uid_col is also defined in Step 2.
gdf1[uid_col] = range(len(gdf1))
- Get intersection of the geodataframes.
overlay = gdf1.overlay(gdf2, how="intersection")
- Add a column for overlapping area. Note that area_col is also defined in Step 2.
overlay["geometry"] = overlay["geometry"].to_crs(proj_crs)
overlay[area_col] = (overlay.geometry.area)
- Sort values by area. Drop duplicates and null values.
overlay = overlay.sort_values(by=area_col, ascending=True)
overlay = overlay.drop_duplicates(subset=[uid_col], keep="last")
overlay = overlay.dropna(subset=[uid_col])
assert not overlay[uid_col].duplicated().any()
overlay = overlay.sort_values(by=[uid_col], ascending=True)
- Drop geometry from the overlay dataframe and merge the original geometry from gdf1. Also drop additional columns (uid_col, area_col) used to accomplish calculate overlapping area for the function.
overlay_merge = overlay.drop("geometry", axis=1)
output = pd.merge(
left=gdf1[[uid_col, "geometry"]],
right=overlay_merge,
on=uid_col,
how="left",
validate="one_to_one",
)
output = output.drop(columns=auxiliary_cols)
- Drop the additional columns (uid_col, area_col) and return the final output.
return output
Input:
grid
- GeoDataFrame that will become the basis of geometries for the outputadm_bounds
- GeoDataFrame containing attributes that we want to append to the gridproj_crs
- metric coordinate reference system (e.g., "EPSG:32651" for the Philippines)
Output
- grid with one province name per cell
import geopandas as gpd
import geowrangler.grids as grids
import geowrangler.spatialjoin_highest_intersection as spatial_join
grid
grid.plot(facecolor="none", edgecolor="blue");
admin_bounds
admin_bounds.plot(facecolor="none", edgecolor="blue");
output = spatial_join.get_highest_intersection(grid, admin_bounds, "EPSG:32651")
Note that each grid now has admin bounds columns (shapeName, shapeISO, shapeID, ... etc.) based on the intersection with the admin boundaries with the highest overlapping area over each grid.
output
output.plot(facecolor="none", edgecolor="blue");