{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This example introduces the use of EarthAI's chip reading function, `spark.read.chip`. Chipping is a procedure for taking an input vector representation of geometries and reading only the subset of an image that intersects with those bounds. Here, we compare the use of `spark.read.chip` and the `spark.read.raster` function, which reads in full scenes only.\n", "\n", "We demonstrate the use of `spark.read.chip` on a vector data set of hand-labelled sites and their land cover classifications to load Landsat 8 imagery from the EarthAI Catalog.\n", "\n", "*Note: if you would like to run through this example in EarthAI Notebook, you can download the companion notebook and vector data source from the attachments provided at the end of this article.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from earthai.init import *\n", "\n", "import geopandas\n", "import earthai.chipping.strategy\n", "import pyspark.sql.functions as F\n", "import ipyleaflet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The STEP Dataset\n", "\n", "The vector file we will use is the land use / land cover labelled dataset from the [System for Terrestrial Ecosystem Parameterization](http://www.gofcgold.wur.nl/sites/gofcgold_refdataportal-step.php#:~:text=The%20System%20for%20Terrestrial%20Ecosystem,%2C%20ecosystems%2C%20and%20vegetation%20types.) (STEP). This dataset includes land cover classifications for 1,983 labelled regions that was collected from very high resolution imagery in 2014. The labelled sites range from 1 to 200 square kilometers in size, and are selected for long term stability of land cover / land use. \n", "\n", "There are 17 land cover classes in this dataset. These classes come from the International Geosphere-Biosphere Programme (IGBP), which has detailed descriptions of each land cover type.\n", "\n", "| IGBP Class | Land Cover Desription |\n", "|---|---|\n", "| 1 | Evergreen needleleaf forest |\n", "| 2 | Evergreen broadleaf forest |\n", "| 3 | Deciduous needleleaf forest |\n", "| 4 | Deciduous broadleaf forest |\n", "| 5 | Mixed forest |\n", "| 6 | Closed shrubland |\n", "| 7 | Open shrubland |\n", "| 8 | Woody savanna |\n", "| 9 | Savanna |\n", "| 10 | Grassland |\n", "| 11 | Permanent wetland |\n", "| 12 | Cropland |\n", "| 13 | Urban |\n", "| 14 | Cropland and natural vegetation mosaic |\n", "| 15 | Snow and ice |\n", "| 16 | Barren |\n", "| 17 | Water |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Read and Display STEP Dataset\n", "\n", "The STEP data can be downloaded from http://www.gofcgold.wur.nl/sites/gofcgold_refdataportal-step.php. We first upload the GeoJSON file for this data set to a subfolder called __\"data\"__, and then read the STEP data into a GeoDataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step_gdf = geopandas.read_file(\"data/step_september152014_70rndsel_igbpcl.geojson\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we are only going to use a subset of the dataset, so we can quickly explore the data. We filter **step_gdf** down to the cropland and urban sites, which are represented by the **igbp** labels 12 and 13, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step_subset_gdf = step_gdf[step_gdf.igbp.isin([12, 13])]\n", "len(step_subset_gdf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see the distribution of cropland and urban sites, you can view them on a basemap using ipyleaflet. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m = ipyleaflet.Map(center=(24.9, 3.2), zoom=2)\n", "\n", "cropland_layer = ipyleaflet.GeoData(geo_dataframe=step_subset_gdf[step_subset_gdf.igbp == 12], \n", " style={'color': 'green', 'weight': 5})\n", "\n", "urban_layer = ipyleaflet.GeoData(geo_dataframe=step_subset_gdf[step_subset_gdf.igbp == 13], \n", " style={'color': 'red', 'weight': 5})\n", "\n", "legend = ipyleaflet.LegendControl({\"Cropland\":\"green\", \"Urban\":\"red\"}, position=\"topright\")\n", "\n", "m.add_layer(cropland_layer)\n", "m.add_layer(urban_layer)\n", "m.add_control(legend)\n", "m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Query Imagery at STEP Sites\n", "\n", "To obtain a catalog of imagery for each STEP site, we next pass the STEP site polygons in the **geometry** column of the **step_subset_gdf** GeoDataFrame to the `earth_ondemand.read_catalog` function. This function returns a GeoDataFrame with rows of file paths to each image that meets your query parameters. In the query below, we are going to request Landsat 8 imagery for the first half of June 2014 with a maximum of 10% cloud cover over our STEP sites." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat = earth_ondemand.read_catalog(\n", " step_subset_gdf.geometry,\n", " start_datetime='2014-06-01', \n", " end_datetime='2014-06-15',\n", " max_cloud_cover=10,\n", " collections='landsat8_l1tp'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then join the catalog to the vector data in order to match the STEP sites to the image scenes that intersect with their geometries. This step is critical for use of the chip reader." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step_cat = geopandas.sjoin(step_subset_gdf, cat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**step_cat** can include multiple Landsat 8 scenes for each STEP site, taken at different dates/times. For simplicity in demonstrating chipping, we select just a single scene for each site. The code below selects the scene with the least cloud coverage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step_cat['grp_col'] = step_cat['siteid']\n", "step_cat = step_cat.sort_values('eo_cloud_cover').groupby(['grp_col']).first()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Read in Full Scenes: `spark.read.raster`\n", "\n", "`spark.read.raster` reads in entire scenes of imagery broken up into a gridded set of tiles. Each row in the returned Spark DataFrame is a set of tiles (dimensions of 256 x 256 by default) of the bands selected. Below, we use `spark.read.raster` to read in the red, green, and blue Landsat 8 bands." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scene_rf = spark.read.raster(step_cat, ['B4', 'B3', 'B2']) \\\n", " .withColumnRenamed('B4', 'red') \\\n", " .withColumnRenamed('B3', 'green') \\\n", " .withColumnRenamed('B2', 'blue')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown below, the chip dimensions are uniformly 256 x 256." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scene_rf.select('siteid', 'red', 'green', 'blue', \n", " rf_dimensions('blue'), rf_extent('blue')) \\\n", " .filter(rf_tile_max('red') > 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you count the number of rows, you will see that there are over 198,000 tiles, because we are reading in all Landsat 8 scenes that intersect with any of the STEP sites. Image scenes tend to be very large, so reading them all in uses a lot of memory. However, if you need imagery covering a large area surrounding each site, then it makes sense to use `spark.read.raster`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scene_rf.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Read in Chips: `spark.read.chip`\n", "\n", "The term _chipping_ refers to the process of cropping small extents from a scene, guided by input labeled vector data. Below we use `spark.read.chip` to read in the Landsat 8 red, green, and blue bands from the same STEP data set. However this function only reads in tiles that intersect the STEP site geometries. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chip_rf = spark.read.chip(step_cat, ['B4', 'B3', 'B2'], \n", " chipping_strategy=earthai.chipping.strategy.IntersectingExtent()) \\\n", " .withColumnRenamed('B4', 'red') \\\n", " .withColumnRenamed('B3', 'green') \\\n", " .withColumnRenamed('B2', 'blue')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above code block, the `chipping_strategy` parameter allows the user to define different ways to create image chips from the input vector geometries. We discuss these further and demonstrate different chipping strategies in [another article](TBD link). The `IntersectingExtent` option used here creates one tile per polygon, cropped to where the polygon intersects with the Landsat 8 scene.\n", "\n", "As shown below, the dimensions of each tile varies because the boundaries for each STEP site are different. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chip_rf.select('siteid', 'red', 'green', 'blue', rf_dimensions('blue')) \\\n", " .filter(rf_tile_max('red') > 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting the number of rows, we see that we are now only reading in 207 tiles. This demonstrates that when you have small polygons and only need the imagery that intersects with them, it makes more sense to use `spark.read.chip` so that you don't read more data into memory than necessary, which will speed up downstream computations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chip_rf.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# View Scene and Chip Extents\n", "\n", "The code cells below create a map that shows the extents of image tiles generated using `spark.read.raster` (blue) and `spark.read.chip` (red) for a single urban site.\n", "\n", "The chip reader creates a single, tightly-cropped chip for the site. It only returns the parts of the image scene that intersects with the site geometry, while the raster reader returns entire image scenes (gridded into smaller tiles) that intersects with the site geometry. \n", "\n", "If you only need the imagery that intersects with the site, note that the raster reader will return a lot of unnecessary imagery that may slow down data processing in your notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scene_df = scene_rf.filter(F.col('siteid') == 108763744000) \\\n", " .withColumn('extent_4326', st_reproject(st_geometry(rf_extent('red')), rf_crs('red'), lit('EPSG:4326'))) \\\n", " .select('extent_4326').toPandas()\n", "\n", "scene_gdf = geopandas.GeoDataFrame(scene_df, geometry=\"extent_4326\", crs=\"EPSG:4326\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chip_df = chip_rf.filter(F.col('siteid') == 108763744000) \\\n", " .withColumn('extent_4326', st_reproject(st_geometry(rf_extent('red')), rf_crs('red'), lit('EPSG:4326'))) \\\n", " .select('extent_4326').toPandas()\n", "\n", "chip_gdf = geopandas.GeoDataFrame(chip_df, geometry=\"extent_4326\", crs=\"EPSG:4326\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m1 = ipyleaflet.Map(center=(44.0, 88.0), zoom=9)\n", "\n", "raster_layer = ipyleaflet.GeoData(geo_dataframe=scene_gdf, style={'color': 'blue'}, name='Raster Outlines')\n", "chip_layer = ipyleaflet.GeoData(geo_dataframe=chip_gdf, style={'color': 'red'}, name='Chip Outlines')\n", "\n", "m1.add_layer(raster_layer)\n", "m1.add_layer(chip_layer)\n", "m1.add_control(ipyleaflet.LayersControl())\n", "m1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned previously, `spark.read.chip` includes several different types of chipping strategies, which we discuss [in this companion article](https://astraeahelp.zendesk.com/hc/en-us/articles/TBD)." ] } ], "metadata": { "kernelspec": { "display_name": "EarthAI Environment", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "zendesk": { "draft": true, "id": 360043452652, "section_id": 360008732711, "title": "Chipping in EarthAI" } }, "nbformat": 4, "nbformat_minor": 4 }