## What is Masking?

In processing EO data, _masking_ is a very common operation to mark certain pixel locations as _null_ values. For more discussion, see the [RasterFrames masking page](https://rasterframes.io//masking.html).

## MODIS MCD43A4


Let’s demonstrate the masking procedure of [MODIS MCD43A4](https://lpdaac.usgs.gov/products/mcd43a4v006/) data on the red, blue and green bands. Each MCD43A4 band has quality information stored in a separate QA band.

The specifications for the QA band are as follows:

```
Data conversions:
 Mandatory QA 0 = processed, good quality (full BRDF inversions)
 1 = processed, see other QA (magnitude BRDF inversions)
 255 = Fill Value
```

So in order to mask out poor quality pixels, we need to ensure that the QA value for that pixel is 0.

Using EOD, let's query some MODIS imagery. We will query the 3 measurement bands along with their QA bands. First, let's use Earth OnDemand to see what image collections are available to us.




In [None]:
from earthai.init import *

from shapely.geometry import Point
from pyspark.sql.functions import col, lit

In [None]:
earth_ondemand.collections()


In the above DataFrame note that the collection we want has an id of `mcd43a4`. Let's see which band names correspond to red, green, and blue.




In [None]:
earth_ondemand.item_assets('mcd43a4').sort_values('asset_name')


The bands for red, green, and blue are `B01`, `B04`, and `B03` respectively. We also see each band has a QA band named `B0Xqa`.

Now that we know the appropriate collection id and band names, let's perform our query. We will create a Shapely point geometry near Charlottesville, Virginia, USA to create an Earth OnDemand catalog. Then we will read the bands we want into a Spark DataFrame named `df`.




In [None]:
catalog = spark.read.earth_ondemand_catalog(
 geo=Point(-78.461530, 38.039243),
 start_datetime='20190901',
 end_datetime='20190905',
 collections='mcd43a4',
 max_cloud_cover=10
 )
band_names = ['B01', 'B04', 'B03',
 'B01qa', 'B04qa', 'B03qa']

df = spark.read.raster(catalog, catalog_col_names=band_names)

#keep only tile columns and rename
df = df.select(
 col('B01').alias('red'),
 col('B01qa').alias('red_qa'),
 col('B04').alias('green'),
 col('B04qa').alias('green_qa'),
 col('B03').alias('blue'),
 col('B03qa').alias('blue_qa'),
 )


MCD43A4 measurement bands already have some NoData defined as it is the output of a time composited model of surface reflectance. We can inspect this as shown below. The cell type of `int16ud32767` indicates that the value of 32767 will be interpreted as NoData. Having a NoData value defined allows us to do the masking.




In [None]:
df.select(rf_cell_type('red'), rf_cell_type('green'), rf_cell_type('blue'))


Let's now use `rf_mask_by_value` to set all pixel locations where the QA band is 1 to NoData in the measurement band.




In [None]:
masked = df.withColumn('red_masked', rf_mask_by_value('red', 'red_qa', 1)) \
 .withColumn('green_masked', rf_mask_by_value('green', 'green_qa', 1)) \
 .withColumn('blue_masked', rf_mask_by_value('blue', 'blue_qa', 1))


Inspect tiles with many data pixels masked out using the filter below.




In [None]:
masked.select('red_qa', 'red', 'red_masked', 'blue_qa', 'blue', 'blue_masked') \
 .filter(rf_no_data_cells('red_masked') > 2500) \
 .filter(rf_no_data_cells('blue_masked') > 2500)