This example introduces the use of EarthAI's chip reading function, spark.read.chip
. Chipping is a procedure for taking an input vector representation of geometries and reading only the subset of an image that intersects with those bounds. Here, we compare the use of spark.read.chip
and the spark.read.raster
function, which reads in full scenes only.
We demonstrate the use of spark.read.chip
on a vector data set of hand-labelled sites and their land cover classifications to load Landsat 8 imagery from the EarthAI Catalog.
Note: if you would like to run through this example in EarthAI Notebook, you can download the companion notebook and vector data source from the attachments provided at the end of this article.
Import Libraries
from earthai.init import * import geopandas import earthai.chipping.strategy import pyspark.sql.functions as F import ipyleaflet
The STEP Dataset
The vector file we will use is the land use / land cover labelled dataset from the System for Terrestrial Ecosystem Parameterization (STEP). This dataset includes land cover classifications for 1,983 labelled regions that was collected from very high resolution imagery in 2014. The labelled sites range from 1 to 200 square kilometers in size, and are selected for long term stability of land cover / land use.
There are 17 land cover classes in this dataset. These classes come from the International Geosphere-Biosphere Programme (IGBP), which has detailed descriptions of each land cover type.
IGBP Class | Land Cover Desription |
---|---|
1 | Evergreen needleleaf forest |
2 | Evergreen broadleaf forest |
3 | Deciduous needleleaf forest |
4 | Deciduous broadleaf forest |
5 | Mixed forest |
6 | Closed shrubland |
7 | Open shrubland |
8 | Woody savanna |
9 | Savanna |
10 | Grassland |
11 | Permanent wetland |
12 | Cropland |
13 | Urban |
14 | Cropland and natural vegetation mosaic |
15 | Snow and ice |
16 | Barren |
17 | Water |
Read and Display STEP Dataset
The STEP data can be downloaded from http://www.gofcgold.wur.nl/sites/gofcgold_refdataportal-step.php. We first upload the GeoJSON file for this data set to a subfolder called "data", and then read the STEP data into a GeoDataFrame.
step_gdf = geopandas.read_file("data/step_september152014_70rndsel_igbpcl.geojson")
In this example, we are only going to use a subset of the dataset, so we can quickly explore the data. We filter step_gdf down to the cropland and urban sites, which are represented by the igbp labels 12 and 13, respectively.
step_subset_gdf = step_gdf[step_gdf.igbp.isin([12, 13])] len(step_subset_gdf)
To see the distribution of cropland and urban sites, you can view them on a basemap using ipyleaflet.
m = ipyleaflet.Map(center=(24.9, 3.2), zoom=2) cropland_layer = ipyleaflet.GeoData(geo_dataframe=step_subset_gdf[step_subset_gdf.igbp == 12], style={'color': 'green', 'weight': 5}) urban_layer = ipyleaflet.GeoData(geo_dataframe=step_subset_gdf[step_subset_gdf.igbp == 13], style={'color': 'red', 'weight': 5}) legend = ipyleaflet.LegendControl({"Cropland":"green", "Urban":"red"}, position="topright") m.add_layer(cropland_layer) m.add_layer(urban_layer) m.add_control(legend) m

Query Imagery at STEP Sites
To obtain a catalog of imagery for each STEP site, we next pass the STEP site polygons in the geometry column of the step_subset_gdf GeoDataFrame to the earth_ondemand.read_catalog
function. This function returns a GeoDataFrame with rows of file paths to each image that meets your query parameters. In the query below, we are going to request Landsat 8 imagery for the first half of June 2014 with a maximum of 10% cloud cover over our STEP sites.
cat = earth_ondemand.read_catalog( step_subset_gdf.geometry, start_datetime='2014-06-01', end_datetime='2014-06-15', max_cloud_cover=10, collections='landsat8_c2l1t1' )
We then join the catalog to the vector data in order to match the STEP sites to the image scenes that intersect with their geometries. This step is critical for use of the chip reader.
step_cat = geopandas.sjoin(step_subset_gdf, cat)
step_cat can include multiple Landsat 8 scenes for each STEP site, taken at different dates/times. For simplicity in demonstrating chipping, we select just a single scene for each site. The code below selects the scene with the least cloud coverage.
step_cat['grp_col'] = step_cat['siteid'] step_cat = step_cat.sort_values('eo_cloud_cover').groupby(['grp_col']).first()
Read in Full Scenes: spark.read.raster
spark.read.raster
reads in entire scenes of imagery broken up into a gridded set of tiles. Each row in the returned Spark DataFrame is a set of tiles (dimensions of 256 x 256 by default) of the bands selected. Below, we use spark.read.raster
to read in the red, green, and blue Landsat 8 bands.
scene_rf = spark.read.raster(step_cat, ['B4', 'B3', 'B2']) \ .withColumnRenamed('B4', 'red') \ .withColumnRenamed('B3', 'green') \ .withColumnRenamed('B2', 'blue')
As shown below, the chip dimensions are uniformly 256 x 256.
scene_rf.select('siteid', 'red', 'green', 'blue', rf_dimensions('blue'), rf_extent('blue')) \ .filter(rf_tile_max('red') > 0)

If you count the number of rows, you will see that there are over 198,000 tiles, because we are reading in all Landsat 8 scenes that intersect with any of the STEP sites. Image scenes tend to be very large, so reading them all in uses a lot of memory. However, if you need imagery covering a large area surrounding each site, then it makes sense to use spark.read.raster
.
scene_rf.count()
Read in Chips: spark.read.chip
The term chipping refers to the process of cropping small extents from a scene, guided by input labeled vector data. Below we use spark.read.chip
to read in the Landsat 8 red, green, and blue bands from the same STEP data set. However this function only reads in tiles that intersect the STEP site geometries.
chip_rf = spark.read.chip(step_cat, ['B4', 'B3', 'B2'], chipping_strategy=earthai.chipping.strategy.IntersectingExtent()) \ .withColumnRenamed('B4', 'red') \ .withColumnRenamed('B3', 'green') \ .withColumnRenamed('B2', 'blue')
In the above code block, the chipping_strategy
parameter allows the user to define different ways to create image chips from the input vector geometries. We discuss these further and demonstrate different chipping strategies in another article. The IntersectingExtent
option used here creates one tile per polygon, cropped to where the polygon intersects with the Landsat 8 scene.
As shown below, the dimensions of each tile varies because the boundaries for each STEP site are different.
chip_rf.select('siteid', 'red', 'green', 'blue', rf_dimensions('blue')) \ .filter(rf_tile_max('red') > 0)

Counting the number of rows, we see that we are now only reading in 207 tiles. This demonstrates that when you have small polygons and only need the imagery that intersects with them, it makes more sense to use spark.read.chip
so that you don't read more data into memory than necessary, which will speed up downstream computations.
chip_rf.count()
View Scene and Chip Extents
The code cells below create a map that shows the extents of image tiles generated using spark.read.raster
(blue) and spark.read.chip
(red) for a single urban site.
The chip reader creates a single, tightly-cropped chip for the site. It only returns the parts of the image scene that intersects with the site geometry, while the raster reader returns entire image scenes (gridded into smaller tiles) that intersects with the site geometry.
If you only need the imagery that intersects with the site, note that the raster reader will return a lot of unnecessary imagery that may slow down data processing in your notebook.
scene_df = scene_rf.filter(F.col('siteid') == 108763744000) \ .withColumn('extent_4326', st_reproject(st_geometry(rf_extent('red')), rf_crs('red'), lit('EPSG:4326'))) \ .select('extent_4326').toPandas() scene_gdf = geopandas.GeoDataFrame(scene_df, geometry="extent_4326", crs="EPSG:4326")
chip_df = chip_rf.filter(F.col('siteid') == 108763744000) \ .withColumn('extent_4326', st_reproject(st_geometry(rf_extent('red')), rf_crs('red'), lit('EPSG:4326'))) \ .select('extent_4326').toPandas() chip_gdf = geopandas.GeoDataFrame(chip_df, geometry="extent_4326", crs="EPSG:4326")
m1 = ipyleaflet.Map(center=(44.0, 88.0), zoom=9) raster_layer = ipyleaflet.GeoData(geo_dataframe=scene_gdf, style={'color': 'blue'}, name='Raster Outlines') chip_layer = ipyleaflet.GeoData(geo_dataframe=chip_gdf, style={'color': 'red'}, name='Chip Outlines') m1.add_layer(raster_layer) m1.add_layer(chip_layer) m1.add_control(ipyleaflet.LayersControl()) m1

As mentioned previously, spark.read.chip
includes several different types of chipping strategies, which we discuss in this companion article.
Comments
0 comments
Article is closed for comments.