Useful Pandas Tutorial¶

Source Jupyter Notebook.

In [35]:
# Basic Imports.
import pandas as pd
import numpy as np

Opening a Catalogue into a DataFrame¶

I usually work woth astro data. Here we typically import catalogues. This data can be in multiple formats usually .fits, .h5, .csv .

All the methods below will return a pandas dataframe.

Opening .fits¶

A fits is a Flexible Image Transport System (FITS) and is used for storing images, and tables or both and it is the most used file format in astronomy and was designed specifically for it.

In [36]:
from astropy.table import Table

Catalogue = Table.read('catalogue.fits', format='fits')   # Read the catalogue.
Catalogue = Catalogue.to_pandas()                         # Convert to pandas.
Catalogue = Catalogue.head(50)
Catalogue.head(5)
Out[36]:
Source_id Isl_id RA E_RA DEC E_DEC Total_flux E_Total_flux Peak_flux E_Peak_flux ... E_DC_Min_img_plane DC_PA_img_plane E_DC_PA_img_plane Isl_Total_flux E_Isl_Total_flux Isl_rms Isl_mean Resid_Isl_rms Resid_Isl_mean S_Code
0 0 2 167.140371 0.000225 57.729393 0.000070 0.000393 0.000121 0.000378 0.000064 ... 0.000158 0.000000 19.033791 0.000352 0.000075 0.000068 0.0 0.000003 -4.429939e-10 b"b'S'"
1 1 3 167.122289 0.000007 57.655640 0.000010 0.011554 0.000268 0.007201 0.000071 ... 0.000018 89.712254 2.578199 0.011192 0.000116 0.000071 0.0 0.000048 -5.317485e-06 b"b'M'"
2 2 4 167.124132 0.000051 57.680628 0.000047 0.003458 0.000223 0.001642 0.000076 ... 0.000093 127.119915 14.134439 0.003215 0.000129 0.000070 0.0 0.000036 6.057428e-07 b"b'S'"
3 3 5 167.181706 0.000088 58.053840 0.000085 0.000772 0.000140 0.000641 0.000071 ... 0.000189 53.352430 68.238954 0.000741 0.000102 0.000068 0.0 0.000010 1.471807e-07 b"b'S'"
4 4 6 167.143644 0.000085 57.869804 0.000081 0.003440 0.000297 0.001360 0.000081 ... 0.000107 137.025938 7.810522 0.003502 0.000185 0.000081 0.0 0.000023 1.233269e-06 b"b'M'"

5 rows × 45 columns

This data list sources from a radio image. Specifically from the LoFARs LoTSS survey of the northen hemisphere. This data contains alot of columns and alot of rows, so we have limited them to only 50.

.h5¶

.h5 is the file extension for a hdf5 file. This is a hierachical data format (version 5) file. Here we will open the file with a key, this is not always nessesary, this is an identifer for the group contained. For a single group it can be omitted.

In [37]:
Catalogue = pd.read_hdf('catalogue.hdf', key='df')   # Read the catalogue.
Catalogue.head(5)
Out[37]:
Source_id Isl_id RA E_RA DEC E_DEC Total_flux E_Total_flux Peak_flux E_Peak_flux ... E_DC_Min_img_plane DC_PA_img_plane E_DC_PA_img_plane Isl_Total_flux E_Isl_Total_flux Isl_rms Isl_mean Resid_Isl_rms Resid_Isl_mean S_Code
0 0 2 167.140371 0.000225 57.729393 0.000070 0.000393 0.000121 0.000378 0.000064 ... 0.000158 0.000000 19.033791 0.000352 0.000075 0.000068 0.0 0.000003 -4.429939e-10 b'S'
1 1 3 167.122289 0.000007 57.655640 0.000010 0.011554 0.000268 0.007201 0.000071 ... 0.000018 89.712254 2.578199 0.011192 0.000116 0.000071 0.0 0.000048 -5.317485e-06 b'M'
2 2 4 167.124132 0.000051 57.680628 0.000047 0.003458 0.000223 0.001642 0.000076 ... 0.000093 127.119915 14.134439 0.003215 0.000129 0.000070 0.0 0.000036 6.057428e-07 b'S'
3 3 5 167.181706 0.000088 58.053840 0.000085 0.000772 0.000140 0.000641 0.000071 ... 0.000189 53.352430 68.238954 0.000741 0.000102 0.000068 0.0 0.000010 1.471807e-07 b'S'
4 4 6 167.143644 0.000085 57.869804 0.000081 0.003440 0.000297 0.001360 0.000081 ... 0.000107 137.025938 7.810522 0.003502 0.000185 0.000081 0.0 0.000023 1.233269e-06 b'M'

5 rows × 45 columns

.csv¶

A .csv file is a textfile that has data saved in a particular pattern. In this case it is a comma seperated values. so each column value in a row is seperated with a comma. Each row is a line in the text file. These files are very common.

In [38]:
Catalogue = pd.read_csv('catalogue.csv')   # Read the catalogue.
Catalogue.head(3)
Out[38]:
Source_id Isl_id RA E_RA DEC E_DEC Total_flux E_Total_flux Peak_flux E_Peak_flux ... E_DC_Min_img_plane DC_PA_img_plane E_DC_PA_img_plane Isl_Total_flux E_Isl_Total_flux Isl_rms Isl_mean Resid_Isl_rms Resid_Isl_mean S_Code
0 0 2 167.140371 0.000225 57.729393 0.000070 0.000393 0.000121 0.000378 0.000064 ... 0.000158 0.000000 19.033791 0.000352 0.000075 0.000068 0.0 0.000003 -4.429939e-10 b'S'
1 1 3 167.122289 0.000007 57.655640 0.000010 0.011554 0.000268 0.007201 0.000071 ... 0.000018 89.712254 2.578199 0.011192 0.000116 0.000071 0.0 0.000048 -5.317485e-06 b'M'
2 2 4 167.124132 0.000051 57.680628 0.000047 0.003458 0.000223 0.001642 0.000076 ... 0.000093 127.119915 14.134439 0.003215 0.000129 0.000070 0.0 0.000036 6.057428e-07 b'S'

3 rows × 45 columns

Saving a Data Frame.¶

After we have done the desired tasked to a dataset, we also want to save the file. This can be done easitly using the following. Note that with the fits file we need to transform it into a astropy table object first as pandas doest save directly to a fits file.

In [39]:
# Saving a DataFrame to a fits file.
Table.from_pandas(Catalogue).write('catalogue.fits', format='fits', overwrite=True)
# Saving a DataFrame to a csv file.
Catalogue.to_csv('catalogue.csv', index=False)            
#Saving a DataFrame to a hdf file.
Catalogue.to_hdf('catalogue.hdf', key='df', mode='w')      

lets make the catalogue shorter for simplicity.

In [40]:
Catalogue = Catalogue[['RA','DEC','Maj','Min','PA']]   # Drop the columns.
Catalogue.head(3)
Out[40]:
RA DEC Maj Min PA
0 167.140371 57.729393 0.002311 0.001251 97.185263
1 167.122289 57.655640 0.002162 0.001799 94.249325
2 167.124132 57.680628 0.002708 0.002164 131.659779

Preforming Complex operations on Columns within a DataFrame.¶

Sometimes we need to preform operations on the values in a row to create a new column value. Some of these fuctions are given to you in the pandas package but if we have a custom fuction we need to be abit more creative.

In [41]:
# To appy a basic math fuction to all rows that change with column values we can simply do the following.

Catalogue['SQRT'] = np.sqrt(Catalogue.RA**2 - Catalogue.DEC**2)
Catalogue.head(3)
Out[41]:
RA DEC Maj Min PA SQRT
0 167.140371 57.729393 0.002311 0.001251 97.185263 156.854139
1 167.122289 57.655640 0.002162 0.001799 94.249325 156.861999
2 167.124132 57.680628 0.002708 0.002164 131.659779 156.854776

pd.DataFrame.apply¶

The .apply function parses the row values to the function so we can easily alter the fucntion to apply any complex function we like. However as we are using a python function to do this. If we preform these functions on a large dataset and the function is complex, it will not take advantage of the speed up that is a native pandasd function can access due to its implementaion of C++ libaries. If this is important to you, its best to check pandas's documentation for a native function.

In [42]:
def SQRT_fun(row):                           # Define a function to apply to the DataFrame.
    return np.sqrt(row.RA**2 - row.DEC**2)   # Return the value of the function.

Catalogue['Sqrt_1'] = Catalogue.apply(SQRT_fun, axis=1) 
Catalogue.head(3)
Out[42]:
RA DEC Maj Min PA SQRT Sqrt_1
0 167.140371 57.729393 0.002311 0.001251 97.185263 156.854139 156.854139
1 167.122289 57.655640 0.002162 0.001799 94.249325 156.861999 156.861999
2 167.124132 57.680628 0.002708 0.002164 131.659779 156.854776 156.854776
In [ ]:
 

pd.DataFrame.pipe¶

This is a powerful function that allows for chaining functions together to generate a resulting dataframe.

In [50]:
def create_name(row):   # Define a function to apply to the DataFrame.
    def create_name_series(row):
        return 'Name_' + str(round(row.RA,3))
    
    res = row.apply(create_name_series, axis=1)
    row['Name'] = res
    return row

processed_cat = (
    Catalogue
    .pipe(create_name)
)
processed_cat.head(3)
Out[50]:
RA DEC Maj Min PA SQRT Sqrt_1 Name
0 167.140371 57.729393 0.002311 0.001251 97.185263 156.854139 156.854139 Name_167.14
1 167.122289 57.655640 0.002162 0.001799 94.249325 156.861999 156.861999 Name_167.122
2 167.124132 57.680628 0.002708 0.002164 131.659779 156.854776 156.854776 Name_167.124

We can increase the number of functions here so that a pipline of multiple function is created.

In [55]:
def find_eigenvalues(row):   # Define a function to apply to the DataFrame.
    def find_eigenvalues_series(row):
        return np.linalg.eigvals([[row.Maj, row.PA], [row.PA, row.Min]])
    
    res = row.apply(find_eigenvalues_series, axis=1)
    row['eignval 1'] = res
    return row

processed_cat = (
    Catalogue
    .pipe(create_name)
    .pipe(find_eigenvalues)
)
processed_cat.head(3)
Out[55]:
RA DEC Maj Min PA SQRT Sqrt_1 Name Eigenvalues eignval 1
0 167.140371 57.729393 0.002311 0.001251 97.185263 156.854139 156.854139 Name_167.14 [97.1870440225773, -97.18348223941352] [97.1870440225773, -97.18348223941352]
1 167.122289 57.655640 0.002162 0.001799 94.249325 156.861999 156.861999 Name_167.122 [94.25130548539278, -94.24734465874354] [94.25130548539278, -94.24734465874354]
2 167.124132 57.680628 0.002708 0.002164 131.659779 156.854776 156.854776 Name_167.124 [131.66221497815906, -131.65734332902065] [131.66221497815906, -131.65734332902065]
In [59]:
# convert ipynb to .html
!jupyter nbconvert --to html pandas-tut.ipynb
[NbConvertApp] Converting notebook pandas-tut.ipynb to html
[NbConvertApp] Writing 620266 bytes to pandas-tut.html