In a previous article, we introduced the
spark.read.chip function for reading in subsets of scenes from Earth observation data, and in another article, we demonstrated the different chipping strategies available with the
spark.read.chip function. In this article, we will show how to write out your chips in GeoTIFF format.
Note: if you would like to run through this example in EarthAI Notebook, you can download the companion notebook and vector data source from the attachments provided at the end of this article.
We will start by importing all of the Python libraries used in this example.
from earthai.init import * import earthai.chipping.strategy import pyspark.sql.functions as F import os import geopandas import rasterio import ipyleaflet
Query Imagery at STEP Sites
In a previous article, we introduced the System for Terrestrial Ecosystem Parameterization (STEP) data set, and used it to query the EarthAI Catalog to identify Landsat 8 scenes that intersect with cropland and urban sites around the world. The code in the cell block below replicates those steps for use in the following sections. Please refer to the previous article for more details on these operations.
# Read in the STEP data set step_gdf = geopandas.read_file("data/step_september152014_70rndsel_igbpcl.geojson") # Filter to include only the cropland and urban classes step_subset_gdf = step_gdf[step_gdf.igbp.isin([12, 13])] # Query Landsat 8 imagery at STEP sites cat = earth_ondemand.read_catalog( step_subset_gdf.geometry, start_datetime='2014-06-01', end_datetime='2014-06-15', max_cloud_cover=10, collections='landsat8_c2l1t1' ) # Join the imagery catalog back to the STEP data step_cat = geopandas.sjoin(step_subset_gdf, cat)
step_cat can include multiple Landsat 8 scenes for each STEP site, taken at different dates/times. For simplicity in demonstrating chip writing, we select just a single scene for each site. The code below selects the scene with the least cloud coverage.
step_cat['grp_col'] = step_cat['siteid'] step_cat = step_cat.sort_values('eo_cloud_cover').groupby(['grp_col']).first()
We use the centroid-centered chipping strategy, which creates chips of the specified dimensions centered at a point, or at the centroids of each polygon, depending on what input geometry is passed. The returned RasterFrame will have chips of uniform dimensions - one for each input geometry. This chipping strategy is useful for deep learning applications.
We pass the chipping strategy,
earthai.chipping.strategy.CentroidCentered, to the
spark.read.chip function. We specify the chip dimensions as 50 by 50 pixels.
To see a list of all chipping strategies and a description of their behavior, run
rf = spark.read.chip(step_cat, ['B4', 'B3', 'B2'], chipping_strategy=earthai.chipping.strategy.CentroidCentered(50, 50)) \ .withColumnRenamed('B4', 'red') \ .withColumnRenamed('B3', 'green') \ .withColumnRenamed('B2', 'blue') \ .filter(rf_tile_max('red') > 0).cache() # filter out chips with all NoData values
To write chips in GeoTIFF format, we use the
rf.write.chip function. This function requires a file path and file name column as input. The file path points to the directory that will store the chips when they are written out. The file name column provides the file name to use for each chip. The file name column can also include a subdirectory structure if desired.
In the cell below, we create the file_path_name column that concatenates the igbp label with the unique siteid value to create a subdirectory structure that organizes the chips by label. The cropland chips will be written out the "12" folder and the urban chips will be written out to the "13" folder within the main directory.
rf = rf.withColumn('file_path_name', F.concat_ws('/', F.col('igbp'), F.col('siteid')))
As specified below, the main folder containing the chips will be called "chips". It will be created in the same directory where your notebook resides.
The remaining parameters in
rf.write.chip are optional. We pass True to the
catalog parameter, which tells the chip writer to write a CSV file directory for all of the chips written out. This CSV file includes the metadata columns specified in the
metadata parameter as well as CRS and bounding box information for each chip.
A single GeoTIFF will be written out for each row of your DataFrame. If there are multiple tile columns in your RasterFrame, each GeoTIFF will be multi-band. Run the cell below to start writing chips.
It takes 3-4 minutes to write out the 149 chips in this RasterFrame on a Dedicated Instance type.
rf.write.chip('chips', filenameCol='file_path_name', catalog=True, metadata=['siteid', 'igbp', 'geometry', 'datetime', ])
Once the chips are written out, you can navigate through the chip directory in the left menu, right click on any of the files, and select Download to save the file to your local machine.
Each chip contains a lot of metadata, including the metadata columns we passed to the chip writer. We open a single chip using Rasterio to view some of the available metadata.
sample_chip = 'chips/12/100283546777.tif' with rasterio.open(sample_chip) as src: for k, v in src.meta.items(): print(k, '\t', v) print('\n') print('T A G S :') for k, v in src.tags().items(): print(k, '\t', v) print('\n') print('B A N D S :') for b in range(1, src.count + 1): print("Band", b, '\t', src.colorinterp[b-1])