This article describes some first steps in loading, analyzing, and visualizing imagery data in EarthAI Notebook. We will first generate a new query that will be used in a simple example to demonstrate the types of raster-based DataFrame analyses that can be carried out in EarthAI.
First log into your account with an EarthAI Notebook open. In the first cell we’ll run an import statement for the necessary libraries and modules.
from earthai.init import *
As a refresher there will be a * in the brackets beside the cell as the import is executing. When it is finished there will be a number in the brackets along with dialog related to a SparkSession being created.
In this example we will generate a query using the EarthAI Catalog API. This process was covered in detail in a separate article. If you want to know what data collections are available you can in a new cell run:
We are going to carry out a normalized difference vegetation index (NDVI) calculation over the Yellowstone National Park region. NDVI is a calculated quantity derived from the red and rear-infrared (NIR) radiation bands measured by remote sensing platforms. NDVI provides an index for classifying land cover with a focus on vegetation. In this case, we’ll be using data from the MODIS spectroradiometer onboard the Terra and Aqua satellites. There are multiple MODIS collections available. For our analysis we’ll use the MCD43A4 product; note from the "id" column in the collections DataFrame above that this product's collection identification is "mcd43a4".
In a new code cell run the following and, by convention, assign the query result to the variable name catalog.
catalog = earth_ondemand.read_catalog(
The catalog is a DataFrame with a shape of (62, 40). That is there are 62 rows and 40 columns. You can run
catalog.columns to see a list of the columns which are the attributes of the data. Included in these attributes are the radiation bands that we will use to calculate NDVI and need to access. The rows represent images acquired by MODIS, or scenes. Scenes are raster data with a specific spatial extent, date-time, and coordinate reference system (CRS). You can check the CRS of the catalog by running
catalog.crs. MODIS images the planet once every day, so the 62 rows represent daily images of our region over the two month period of the query. The format of catalog is then a collection of 62 scenes and the 40 attributes for each.
For the NDVI analysis we only need to read in the red and NIR bands. They have the non-intuitive names of ‘B01’ and ‘B02’, but those can be renamed for clarity when they are selected as we’ll do below.
We are going to read these bands in as a Spark DataFrame. The differences and pros/cons between Spark DataFrames and GeoDataFrames will be covered in a later article. In this example we choose to load the raster data into a Spark DataFrame in order to demonstrate geospatial operations using the
The code below takes the rows of MODIS scenes from catalog and loads the imagery data into a Spark DataFrame called bandsNDVI. We select only the "B01" (red) and "B02" (NIR) columns, which avoids reading in bands that we do not intend to use. We want to rename the columns to something a bit more descriptive. This is accomplished by the
withColumnRenamed method that is chained onto the original
spark.read.raster function. Method chaining can be thought of as a “do this then that” process from left to right. In this case the columns of interest are read and then each is renamed. If you’ve not seen it before, the backslash at the end of each line signifies that the code continues onto the next line.
bandsNDVI = spark.read.raster(catalog, catalog_col_names=['B01', 'B02'],)\
.withColumnRenamed('B01', 'red') \
You can run
type(bandsNDVI) and see it is now a Spark DataFrame, and running
bandsNDVI.columns shows the renamed columns.
At a base level these procedures have all dealt with MODIS imagery, but we’ve yet to see any imagery visually. We can run the
select method on bandsNDVI to return visual representations of the "red" and "NIR" bands. We’ll also select the "id" and "datetime" columns, and return the extent and crs using the functions
rf_crs(). We’ll also give them aliases for clarity in the output. Running this cell may take a few seconds, and returns the top five rows of the selection.
After running the cell you should see a DataFrame now with images. Double clicking the images enlarges them a bit. These images are false color representations of the red and NIR bands. More specifically each of these images represent tiles contained within a scene. In this case the scene is from July 7, 2018 as evidenced that they all have the same datetime, id, and crs. What differs for each is the extent. You are seeing how a scene is represented as a group of smaller tiles.
Now we will use the tile data from the red and NIR scenes to calculate the NDVI. NDVI is the ratio of (NIR - red) to (NIR + red) and ranges from -1.0 to 1.0. This calculation will take place in every cell or pixel in the tiles. The resulting NDVI is itself a tile of raster data. EarthAI has a built-in function
rf_normalized_difference() that performs the calculations with the two bands as input parameters. The code below calculates this and appends it to the bandsNDVI DataFrame.
bandsNDVI = bandsNDVI.withColumn('ndvi', rf_normalized_difference('nir', 'red'))
You can check to see that NDVI is appended to the DataFrame by again running:
and see that it is listed at the end. You can also visualize the newly created NDVI tiles by running:
Finally, we’ll look at how NDVI changed over the course of the time period we used for the original query.
First, we’ll import libraries needed for this. One quick note is that all of the needed libraries can be imported in the first cell of the notebook. They do not need to be imported as needed throughout the notebook.
import pyspark.sql.functions as F
import matplotlib.pyplot as plt
The last statement ensures that plots appear in the notebook and not in a separate window.
The timeseries for change detection can be generated in any number of ways, but here we will calculate weekly averages of NDVI over the two months of the period. We’ll first group the data by year (which in this case is all in a single year so trivial) and then group it by week of the year. Finally, we’ll compute the mean each week and label it as “mean_ndvi”. This may take some time to complete after running.
time_series = bandsNDVI.groupBy(F.year('datetime').alias('year'),
ts_pd = time_series.toPandas()
Let’s unpack this a bit before proceeding. The above code is applying two functions to the bandsNDVI DataFrame. First, it is grouping the data by year and week by unpacking the "datetime" attribute using a PySpark function that is designed to parse datetime strings. Next we are calculating the means of the NDVI values and aggregating those. Finally, we’re converting the time_series Spark DataFrame to a Pandas DataFrame to make plotting easier. If you wish you can check the type of ts_pd to confirm the conversion.
Below is the code to plot our new weekly averages of NDVI. Much of the code may be hard to follow if you have not used Matplotlib before. This is fine; at this point it is more to show you that we calculated mean NDVI from imagery and can display it. Data visualization in Python is a rich and extensive topic. The Matplotlib module that we’ve used here is quite verbose, but gives you a great deal of control over the graphics. There are numerous other libraries that abstract this verbosity away, allow for dynamic graphics, dashboard development, and web deployment. A future article will survey the options for geospatial data visualization.
plt.plot(ts_pd.week, ts_pd.mean_ndvi, ‘g*-’, )
plt.xlabel(‘Week of 2018’)
The resulting plot shows an overall decreasing trend in average NDVI over the time period. This represents a de-greening of the land which could be a result of drought, wildfire, forest mortality, slowing of seasonal photosynthetic activity or other reasons. The take home points in this article are that raster imagery was queried and loaded, used to calculate NDVI, and this was visualized as a time series to detect changes across the scenes.