In this article we'll look at how to define EarthAI Catalog API queries using vector files that you import into the EarthAI Notebook environment from an external source. It is often useful in geospatial analysis to associate raster data with vector shapes and their attributes. In this case we'll use the polygons of two U.S. states, Florida and Washington, then use these to query the EarthAI Catalog.
We'll start by importing the needed libraries.
from earthai.init import *
import pyspark.sql.functions as F
import geopandas as gpd
import pandas as pd
The first step in the workflow is reading in the vector data we'll use to query the imagery catalog. We will read in these data from a GeoJSON file by referencing its URL and loading it into a GeoPandas DataFrame. We will use state polygons from Natural Earth, which we access through DataHub.io. Note that you can also upload the GeoJSON file into EarthAI Notebook and access it locally if you prefer.
To get the URL, go to https://github.com/datasets/geo-admin1-us/blob/master/data/admin1-us.geojson. You will see an embedded web map as shown in the screenshot below.
Click on the Raw tab in the upper right of the embedded web map; this will open the raw GeoJSON file in a web browser. Copy this URL and assign a variable to it. You can then load the geometries into a GeoPandas DataFrame.
states_url = 'https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson'
states_df = gpd.read_file(states_url)
Now select the desired states, downsample to the area of interest, and save it to a new DataFrame.
We selected Florida and Washington here, but you can choose whichever state(s) you like for this exercise. In the code below, we're using the condition that the "state_code" attribute is in the list of the states abbreviations we want, and only selecting that data. You can plot the state outlines to confirm the shapes.
states_flwa = states_df[states_df.state_code.isin(['FL', 'WA'])]
Now we have geographic constraints for an EarthAI Catalog query, namely the geographic areas of Florida and Washington. These are be geometry coordinates of the two polygons representing the states. We'll query using specified start and end dates, a maximum cloud cover, and the Landsat 8 collection. For the "geo" argument we'll simply pass in our selected states vector data.
catalog = earth_ondemand.read_catalog(
start_datetime = '2018-01-01',
end_datetime = '2018-03-31',
max_cloud_cover = 10,
collections = 'landsat8_l1tp')
The "geo" argument is very flexible and can take a wide range of vector geometry representations such as GeoPandas DataFrames and Well-known text (WKT).
Now we need to attach the states' features information to the imagery catalog generated from our query. We'll do this using a spatial join in GeoPandas. Part of this involves ensuring both the imagery and the vector data are in the same coordinate reference system (CRS). In this case we'll convert the query catalog's CRS to the states' vector data CRS. You can also check that the resulting join is still a GeoPandas DataFrame.
states_catalog = gpd.sjoin(states_flwa, catalog.to_crs(states_flwa.crs))
We'll also quickly check to see how many images were returned for each state from the query. The resulting line of code shows that there are 48 images for Florida and 11 for Washington.
Just as another check look to see what columns (attributes) are in the newly joined DataFrame.
For this exercise, we're interested in the short-wave and near-IR bands of Landsat 8 for our raster data. For clarity we'll check what the band names are in Landsat 8 and then convert those to common names.
For Landsat 8, the near-IR ('nir') corresponds to band 'B5' and shortwave-IR ('swir16') corresponds to band 'B6'. We'll now read these two raster bands (B5 and B6) into a new SparkDataFrame, df, and then rename each column with the common names.
df = spark.read.raster(states_catalog, catalog_col_names=['B6', 'B5']) \
.withColumnRenamed('B6', 'swir16').withColumnRenamed('B5', 'nir')
Finally, we'll create a new column in the df DataFrame that reprojects the state polygons to the same CRS as the imagery data. This column is called "geo_native". We'll also filter out rows in the DataFrame where the image does not cover the entire state and include a filter to remove fill values that appear in Landsat imagery. The fill values are the result of null values in the imagery data.
df = df.withColumn('geo_native',
st_reproject('geometry', rf_mk_crs(str(states_flwa.crs)), rf_crs('swir16'))) \
.filter(st_intersects(rf_geometry('swir16'), 'geo_native')).filter(rf_tile_max('swir16') > 0)
Now look at the first five rows of the data. The code below shows the shortwave-IR and near-IR bands along with the state name and state code. It also shows the coordinates of the center of the state and the tile center. Since these first five are all for Florida the state center remains the same for each, but the tiles (each row) represent different portions of the larger scenes. As such, the center of the tiles will vary.
df.select('swir16', 'nir', 'state_code', 'name', st_centroid('geo_native') \
.alias('state_center'), st_centroid(rf_geometry('swir16')) \