{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Crop Type Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use multispectral Landsat 8 imagery acquired during the growing season to classify the type of crop growing in known crop fields.\n", "\n", "The main components of this job are very similar to those in the [supervised learning example](https://rasterframes.io/supervised-learning.html) shown in the RasterFrames documentation.\n", "There are a few key differences. The target data is already in the form of a raster.\n", "And this raster is in a different [CRS](https://rasterframes.io/concepts.html#coordinate-reference-system-crs-) than the imagery.\n", "To deal with this, we will perform a _raster join_ between the two DataFrames." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Create a SparkSession with options to improve the way tasks are spread across workers.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": null, "option_string": "", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "from earthai.all import *\n", "import pyspark.sql.functions as F\n", "parallelism = 1000\n", "spark = create_earthai_spark_session(**{\n", " 'spark.default.parallelism': parallelism,\n", " 'spark.sql.shuffle.partitions': parallelism,\n", " 'spark.driver.memory': '4G'\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Pull Landsat 8 from Earth OnDemand\n", "\n", "We can use the `earth_ondemand.collections` and `earth_ondemand.bands` functions to learn more about the data.\n", "Here we will query Landsat 8 L1C data using all the Multi-Spectral Instrument (MSI) bands except ultra-blue.\n", "We also pull the BQA in order to mask out cloudy pixels.\n", "\n", "We use the `earth_ondemand.grid` package to specify a desired Landsat product grid in place of a spatial query.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "read_feature_raster", "option_string": "name = \"read_feature_raster\"", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "from earthai.earth_ondemand.grid import LandsatGrid\n", "\n", "catalog = earth_ondemand.read_catalog(\n", " max_cloud_cover=10,\n", " collections='landsat8_l1tp',\n", " start_datetime='2018-07-01T00:00:00',\n", " end_datetime='2018-08-31T23:59:59',\n", " grid_ids=LandsatGrid(30, 27)\n", ")\n", "\n", "df1 = spark.read.raster(\n", " catalog,\n", " catalog_col_names=['B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'BQA'],\n", " lazy_tiles=False\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Cloud Masking\n", "\n", "As discussed in detail [here](https://astraeahelp.zendesk.com/hc/en-us/articles/360043452492), we will mask for high probability clouds, high probability cloud shadows, and fill areas.\n", "\n", "In this job, it is sufficient to mask a single band. This will be discussed later when we apply the [`TileVectorizer`](https://astraeahelp.zendesk.com/hc/en-us/articles/360043452472)\n", "\n", "\n", "After masking we drop the BQA column because we no longer need it, and we do not want it to be included in our feature set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "masking", "option_string": "name = \"masking\"", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "df2 = df1.select(\n", " rf_interpret_cell_type_as('B2', 'uint16').alias('blue'),\n", " rf_interpret_cell_type_as('B3', 'uint16').alias('green'),\n", " rf_interpret_cell_type_as('B4', 'uint16').alias('red'),\n", " rf_interpret_cell_type_as('B5', 'uint16').alias('nir'),\n", " rf_interpret_cell_type_as('B6', 'uint16').alias('swir1'),\n", " rf_interpret_cell_type_as('B7', 'uint16').alias('swir2'),\n", " 'BQA'\n", ")\n", "\n", "df_masked = df2.withColumn('blue_masked', # cloud, high prob\n", " rf_mask_by_bits('blue', 'bqa', 5, 2, [3])) \\\n", " .withColumn('blue_masked', # cloud shadow, high prob\n", " rf_mask_by_bits('blue_masked', 'bqa', 7, 2, [3])) \\\n", " .withColumn('blue_masked', # mask yes\n", " rf_mask_by_bit('blue_masked', 'bqa', 0, 1)) \\\n", " .filter(rf_data_cells('blue_masked') > 0) \\\n", " .drop('blue', 'BQA') \\\n", " .cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Read Crop Target\n", "\n", "Let's inspect the metadata about our crop target. The data come from the [USDA Cropland Data Layer](https://nassgeodata.gmu.edu/CropScape/) \n", "\n", "In this raster the value 100 indicates areas that are not crop fields. We will exclude them from our analysis by setting them to NoData.\n", "This also means that our ML model will not have been trained on a background class.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": null, "option_string": "", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "target_url = 'https://s22s-test-geotiffs.s3.amazonaws.com/crop_class/scene_30_27_target.tif'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": false, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": null, "option_string": "evaluate=False", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "! gdalinfo /vsicurl/{target_url} 2> /dev/null" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "read_target_raster", "option_string": "name = \"read_target_raster\"", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "df3 = spark.read.raster(target_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "target_mask", "option_string": "name = \"target_mask\"", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "target_df = df3.select(rf_mask_by_value('proj_raster', 'proj_raster', 100).alias('target')).cache()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Join Features and Target\n", "\n", "The _raster join_ operation uses the embedded spatial information from the raster read to bring rows from each table together. Then each right hand side record that spatially intersects the left hand side is reprojected (warped) to align with the grid of the left hand side.\n", "\n", "The _raster join_ is a left outer join, so we will filter away any null rows.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "raster_join", "option_string": "name = \"raster_join\"", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "joined = df_masked.raster_join(target_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Feature Creation\n", "\n", "We will compute some commonly used derived features for this kind of task, and putting the dataframe in the final form for training." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "features", "option_string": "name = \"features\"", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "analytics_base_table = joined.select(\n", " 'crs', 'extent', \n", " rf_tile('green').alias('green'), rf_tile('red').alias('red'), rf_tile('nir').alias('nir'),\n", " rf_tile('blue_masked').alias('blue_masked'),\n", " rf_tile(rf_normalized_difference('nir', 'red')).alias('ndvi'),\n", " rf_tile(rf_local_divide(rf_local_add('swir1', 'swir2'), lit(2.0))).alias('mean_swir'),\n", " rf_tile(rf_normalized_difference('nir', 'swir1')).alias('ndwi'),\n", " rf_tile('target').alias('target')\n", ")\n", "display(analytics_base_table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Train - Test Split\n", "\n", "We will divide the rows of the resulting DataFrame into approximately 70/30 train/test split.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df, test_df = analytics_base_table.randomSplit([.7, .3], 2345)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Create ML Pipeline\n", "\n", "The key components of the ML pipeline are:\n", "\n", "* [`TileVectorizer`](https://astraeahelp.zendesk.com/hc/en-us/articles/360043452472) - packs Tile cells into ML Vectors\n", "* [`StringIndexer`](https://spark.apache.org/docs/latest/ml-features.html#stringindexer) - handle categorical labels\n", "* [`DecisionTreeClassifier`](https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-trees) \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": true, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "pipeline", "option_string": "name = \"pipeline\"", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "\n", "from earthai.transformers import TileVectorizer\n", "\n", "from pyspark.ml.feature import StringIndexer\n", "from pyspark.ml.classification import DecisionTreeClassifier\n", "from pyspark.ml import Pipeline\n", "\n", "exploder = TileVectorizer().setFilterNA(True).setTargetLabelCol('target')\n", "\n", "labelIndexer = StringIndexer() \\\n", " .setInputCol('target') \\\n", " .setOutputCol('indexedTarget')\n", "\n", "classifier = DecisionTreeClassifier(maxDepth=5) \\\n", " .setLabelCol('indexedTarget') \\\n", " .setFeaturesCol('features')\n", "\n", "pipeline = Pipeline() \\\n", " .setStages([exploder, labelIndexer, classifier])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Train the Model\n", "\n", "This may take two or three minutes to evaluate on the Extra Large launch option.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": false, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "fit_model", "option_string": "name = \"fit_model\", evaluate=False", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "model = pipeline.fit(train_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Optionally, you can save the model to read in later to make predictions.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": false, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "save_pipe", "option_string": "name = \"save_pipe\", evaluate=False", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "model.write().overwrite().save('crop_model/decision_tree')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Evaluate the Model\n", "\n", "Key:\n", "- 0 = other\n", "- 1 = corn\n", "- 2 = soybeans\n", "- 3 = alfalfa\n", "- 4 = durum/spring wheat\n", "- 5 = sugar beets\n", "- 6 = dry beans\n", "\n", "\n", "\n", "Note that it may take about one minute to evaluate the `toPandas` statement.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": false, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": "predict_test", "option_string": "name = \"predict_test\", evaluate=False", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "prediction_df = model.transform(test_df)\n", "vals = prediction_df.select(classifier.getPredictionCol(), classifier.getLabelCol()).toPandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "autoscroll": "auto", "collapsed": false, "jupyter": { "outputs_hidden": false }, "options": { "caption": false, "complete": true, "display_data": true, "display_stream": true, "dpi": 200, "echo": true, "evaluate": false, "f_env": null, "f_pos": "htpb", "f_size": [ 6, 4 ], "f_spines": true, "fig": true, "include": true, "name": null, "option_string": "evaluate=False", "results": "verbatim", "term": false, "wrap": "output" } }, "outputs": [], "source": [ "from earthai.ml import *\n", "showConfusionMatrix(vals.prediction.values, vals.indexedTarget.values)\n", "printOverallStats(vals.prediction.values, vals.indexedTarget.values)\n", "printClassStats(vals.prediction.values, vals.indexedTarget.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] } ], "metadata": { "kernel_info": { "name": "echo" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "zendesk": { "id": 360043452532, "position": 30 } }, "nbformat": 4, "nbformat_minor": 4 }