A spatial join operation is analogous to a database table join or DataFrame merge operation, but considers the geographic relationships between records. In the context of EarthAI Notebooks, a spatial join is an operation that merges two DataFrames, each having a geometric object column, by some spatial relationship of their geometries.
The spatial join is important because it allows a variety of geographic data sources to be combined and reasoned over. We can use spatial joins to combine domain-specific information with raster @ref:catalogs.
This page discusses the case where both of the DataFrames are PySpark DataFrames. See also the @ref:GeoPandas spatial join discussion.
Let's get started with some basic imports.
from earthai.init import * from shapely.geometry import Point from pyspark import SparkFiles import os
geo_admin_url = 'https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson' spark.sparkContext.addFile(geo_admin_url) adm1 = spark.read.geojson(SparkFiles.get(os.path.basename(geo_admin_url))).drop('id') adm1_alar = adm1[adm1.state_code.isin('AL', 'AR')] adm1_alar
Now we will construct a small DataFrame containing city locations as point geometries. Note the inclusion of Charlotte, North Carolina.
city_df = spark.createDataFrame( [ {'city_name': 'Hot Springs', 'geom': Point(-93.055278, 34.497222)}, {'city_name': 'Tuscaloosa', 'geom': Point(-87.534607, 33.20654)}, {'city_name': 'Mobile', 'geom': Point(-88.043056, 30.694444)}, {'city_name': 'Little Rock', 'geom': Point(-92.331111, 34.736111)}, {'city_name': 'Charlotte', 'geom': Point(-80.843056, 35.227222)} ] ).hint('broadcast')
Spatial Join
We use the standard PySpark DataFrame join
, with GeoMesa spatial column filters{ open=new}. Let's join the city as the left hand side. We see that the resulting DataFrame has a single geometric column on it (from the left hand side). We will plot and color the city locations by state that is found from the joined data.
Note in the code above we used the .hint('broadcast')
on the city DataFrame. This is useful for joins where we know one of the DataFrames is able to be shared on every executor in Spark. In this case, either could be broadcast, but it is an important consideration for performance of spatial joins.
sjoin_city_state = city_df.join(adm1_alar, st_intersects(city_df.geom, adm1_alar.geometry)) sjoin_city_state
If we invert the join, the result in PySpark is logically equivalent. This is one difference between the spatial join with @ref:GeoPandas, which attempts to return a single geometry column.
sjoin_state_city = adm1_alar.join(city_df, st_intersects(adm1_alar.geometry, city_df.geom)) sjoin_state_city
city_df.join(adm1_alar, st_intersects(city_df.geom, adm1_alar.geometry), how='left')
Comments
0 comments
Article is closed for comments.