In this article, we provide an introduction to the use of Pandas, which is an extension of NumPy.

Pandas Series, from which DataFrames can be constructed, are built on NumPy arrays. In addition to a wide range of ways Pandas DataFrames can be manipulated, one major advantage that Pandas has over NumPy is that indexes and columns can have labels provided by the user. This allows Pandas Series and DataFrames to be more expressive in the information they convey.

As Numpy and Pandas are tightly coupled they are commonly imported together. The typical convention is to alias numpy as np, and pandas as pd.

import numpy as np import pandas as pd

# Pandas Series

Before exploring the Pandas DataFrame object we'll first build-up to that with the simpler Pandas Series. You can convert a common Python list, a NumPy array, or a Python dictionary to a Pandas Series. First, let's create those objects and see how they are converted to a Pandas Series.

my_list = [2, 23, 42, 538, 1024] # python list my_labels = ['A', 'B', 'C', 'D', 'E'] # another python list my_array = np.array([1, 1, 2, 3, 5]) # numpy array my_dict = {'foo': 15, 'bar': 23, 'hi': 72} # python dictionary

We can use the `pd.Series`

method in Pandas and pass in a list to create a Pandas Series object.

my_series = pd.Series(data=my_list)

my_series

type(my_series)

We have now created **my_series** which is the type pandas.core.series.Series, or a Pandas Series. You'll also notice there is an index ranging from 0 to 4. One other thing to note is that we had to assign the function parameter `data`

to the list we passed in. This is required because as mentioned above, we can also pass in labels for our data:

my_labeled_series = pd.Series(data=my_list, index=my_labels) my_labeled_series

Now the list values have labels that we have assigned them. The real advantage in this is we can now retrieve values in the Series based on labels that may be more intuitive. Let's change our labels to student names.

student_names = ['Grumpy', 'Sneezy', 'Sleepy', 'Happy', 'Bashful']

my_new_series = pd.Series(data=my_list, index=student_names)

my_new_series

Let's say we wanted the value for Sleepy. In a NumPy Array that would mean we would need to know the index of Sleepy's value. That may not be much to ask in a list of five values, but thousands could pose a problem. Now in Pandas, we can access Sleepy's value by passing his name in as follows:

sleepy_value = my_new_series['Sleepy'] sleepy_value

With a Pandas Series, we just need to pass in the index name or label to access a value.

You can also create a Pandas Series from a Python Dictionary or NumPy Array:

dict_to_series = pd.Series(my_dict) # the dictionary is constructed with labels already in place arr_to_series = pd.Series(my_array, my_labels) # NumPy arrays have default numerical labels only

# Pandas DataFrames

DataFrames are a convenient way to interact with Earth observation data in EarthAI. Pandas DataFrames are similar to DataFrames in R if you are familiar with those. They can be thought of as objects that combine multiple Series together. More fundamentally, they are a 2-dimensional grid of data with labeled rows and columns. It may be most useful to just build one and take a look at it.

my_names = ['Larry', 'Moe', 'Curly', 'Shemp'] my_tests = ['test1', 'test2', 'test3', 'test4']

from numpy.random import randn np.random.seed(42)

The first two lines of code just create lists that we will use to name the rows and columns of our DataFrame. The third line imports a random number generator from NumPy. The fourth line is not strictly necessary but allows you to generate the same set of random numbers repeatedly by setting the seed. Below we'll use the Pandas method `pd.DataFrame`

to build a 4 by 4 dimension DataFrame with rows and columns named from the lists above:

my_df = pd.DataFrame(randn(4,4), index=my_names, columns=my_tests) my_df

So now we've created **my_df** that has the appearance of a spreadsheet with labeled rows and columns. We can check the type of **my_df** and see that it is of type pandas.core.frame.DataFrame or more simply a Pandas DataFrame object.

type(my_df)

One of the first things you'll likely do is run the `head`

method on a new DataFrame. By default, this returns the first 5 rows of the DataFrame, though you can set as a function argument how many rows it returns. This function allows you quickly to see the structure of a DataFrame, namely what columns are included and what form the data takes (float, string, etc.). For our DataFrame we only have four rows, so it will just return the full DataFrame. This function is particularly useful on very large DataFrames.

my_df.head()

seriesFromDf = my_df['test1'] seriesFromDf

type(seriesFromDf)

The above code shows that indeed Pandas DataFrames are made up of a set of Pandas Series. We pulled out the **test1** Series and assigned it to a variable, **seriesFromDf**. But more importantly, is it demonstrates how to subset data in a DataFrame. Earth OnDemand queries often contain a lot of columns or attributes such as spectral bands, ids, datetimes, coordinate reference systems, and much more. Most of the time we are only interested in a smaller selection of the data so it does not make sense for computational, memory, and simple visual reasons to carry extra information around in the complete DataFrames. The above code used bracket notation to select the column **test1**. If we wanted multiple columns we can just pass a list into the brackets.

multiCols = my_df[['test2', 'test4']] multiCols

You may want to create a new column in your DataFrame that represents some sort of combination of spectral bands, for instance, which is commonplace in EarthAI analyses. In this case, let's sum all of the tests into a new column:

my_df['sum'] = my_df['test1'] + my_df['test2'] + my_df['test3'] + my_df['test4'] my_df

Dropping a column is just as simple:

my_df.drop('sum', axis = 1)

my_df

We used the `drop`

method on our DataFrame specifying the column name and `axis = 1`

. `axis = 1`

refers to columns. `axis = 0`

refers to rows. But when we take another look at **my_df** in the following cell the column **sum** is still there. So the drop was not permanent. To make it permanent you need to add `inplace = True`

as shown below. Another way to perform the drop and make it permanent is to assign it to a new variable.

my_df.drop('sum', axis=1, inplace=True) # now the column drop will be permanent

my_df

We can likewise drop rows:

my_df.drop('Shemp', axis=0)

We previously selected columns by simply passing the column name(s) in brackets. That does not work for rows. Instead, we need to use a different method as below. Let's look at Moe's scores.

moe = my_df.iloc[1] moe

moe_again = my_df.loc['Moe'] moe_again

If we know the index position we can pass that to `iloc`

, the index location method. Or more straightforward we can just pass the row name "Moe" to the `loc`

method.

my_df.loc['Moe', 'test1']

Or if we only were interested in Moe's **test1** score we just pass in the `['row', 'column']`

coordinates.

You can also easily check the DataFrame for values that meet some specified condition. In our DataFrame here we have both positive and negative floats. Let's say we're interested in the positive values.

my_df > 0

my_df[my_df > 0]

The first code cell above replaces the values with booleans: True for where the condition is met and False otherwise. The second code cell returns the values in place that meet the condition and NaN (not a number) otherwise. We'll see how to handle those NaNs below.

You can also perform a conditional selection on individual columns of the DataFrame:

my_df[my_df['test1'] > 0]

In this case, a DataFrame is returned excluding those rows where **test1** failed to meet the condition.

my_df

Performing a conditional test on a DataFrame does not change the DataFrame itself. The results of the conditional tests can be assigned to new variables if they need to be retained.

new_test = my_df[my_df > -0.2] new_test

We've constructed a new DataFrame based on the condition that values are greater than -0.2 and see there are a handful of NaN values. A quick way to deal with NaN is to perform the `dropna`

method:

new_test.dropna()

By default, this only returns rows in the DataFrame without any NaN values. We could do the same for columns by setting `axis=1`

in `dropna`

. In both cases, a great deal of data is lost. A more common use case is to replace the NaN values with some other value:

new_test.fillna(value=2)

The above code cell fills any NaN value with the value 2 instead. Or you can fill with something like the mean of the **new_test** DataFrame.

new_test.fillna(value = new_test.mean())

So there are ways of dealing with NaN values in DataFrames that do not compromise other valid data.

Next, let's create some basketball data to demonstrate a few other nice properties of Pandas DataFrames.

hoopsData = {'Team': ['Bulls', 'Bulls', 'Lakers', 'Lakers', 'Pacers', 'Pacers'], 'Player': ['Jordan', 'Pippen', 'Johnson', 'James', 'Miller', 'Oladipo'], 'Points': [50, 25, 26, 40, 33, 30]} hoopsDF = pd.DataFrame(hoopsData) hoopsDF

In this case, we have multiple entries for each team that we can group together and perform some operation. The `groupby`

method is useful when you want to aggregate values in some way by a group. Below we'll figure out the total points for each team.

hoopsDF.groupby('Team').sum()

hoopsDF.groupby('Team').describe() # perform some basic statistics for each team

Pandas also allows you to apply functions that you have written to columns of the DataFrame with the `apply`

method. First, let's write a little function that squares a number:

def squared(x): return x ** 2

hoopsDF['Points'].apply(squared)

Above we've applied our squared function to the **Points** column of the DataFrame.

Finally, we can read and write csv and Excel files easily. The cell below reads in a file called **"data/sample.csv"**, and writes out a file called **"new_file.csv"**.

df = pd.read_csv('data/sample.csv') # read csv into Pandas df.to_csv('new_file.csv', index=False) # write Pandas DataFrame to csv file

Similarly, for the Excel file **"data/sample.xlsx"**, we write out the file **"new_file.xlsx"**. To write to xlsx, we need to install the openpyxl library.

!pip install openpyxl

df = pd.read_excel('data/sample.xlsx', sheet_name='Sheet1') # you have to specify a sheet, and this will not read in Excel formulas, etc. df.to_excel('new_file.xlsx', sheet_name='Sheet1')

*These sample files can be downloaded from the attachments to this article below.*

This is just scratching the surface of the wide range of functionality in Pandas.

## Comments

0 comments

Please sign in to leave a comment.