# Basic Imports.
import pandas as pd
import numpy as np
I usually work woth astro data. Here we typically import catalogues. This data can be in multiple formats usually .fits, .h5, .csv .
All the methods below will return a pandas dataframe.
A fits is a Flexible Image Transport System (FITS) and is used for storing images, and tables or both and it is the most used file format in astronomy and was designed specifically for it.
from astropy.table import Table
Catalogue = Table.read('catalogue.fits', format='fits') # Read the catalogue.
Catalogue = Catalogue.to_pandas() # Convert to pandas.
Catalogue = Catalogue.head(50)
Catalogue.head(5)
Source_id | Isl_id | RA | E_RA | DEC | E_DEC | Total_flux | E_Total_flux | Peak_flux | E_Peak_flux | ... | E_DC_Min_img_plane | DC_PA_img_plane | E_DC_PA_img_plane | Isl_Total_flux | E_Isl_Total_flux | Isl_rms | Isl_mean | Resid_Isl_rms | Resid_Isl_mean | S_Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 167.140371 | 0.000225 | 57.729393 | 0.000070 | 0.000393 | 0.000121 | 0.000378 | 0.000064 | ... | 0.000158 | 0.000000 | 19.033791 | 0.000352 | 0.000075 | 0.000068 | 0.0 | 0.000003 | -4.429939e-10 | b"b'S'" |
1 | 1 | 3 | 167.122289 | 0.000007 | 57.655640 | 0.000010 | 0.011554 | 0.000268 | 0.007201 | 0.000071 | ... | 0.000018 | 89.712254 | 2.578199 | 0.011192 | 0.000116 | 0.000071 | 0.0 | 0.000048 | -5.317485e-06 | b"b'M'" |
2 | 2 | 4 | 167.124132 | 0.000051 | 57.680628 | 0.000047 | 0.003458 | 0.000223 | 0.001642 | 0.000076 | ... | 0.000093 | 127.119915 | 14.134439 | 0.003215 | 0.000129 | 0.000070 | 0.0 | 0.000036 | 6.057428e-07 | b"b'S'" |
3 | 3 | 5 | 167.181706 | 0.000088 | 58.053840 | 0.000085 | 0.000772 | 0.000140 | 0.000641 | 0.000071 | ... | 0.000189 | 53.352430 | 68.238954 | 0.000741 | 0.000102 | 0.000068 | 0.0 | 0.000010 | 1.471807e-07 | b"b'S'" |
4 | 4 | 6 | 167.143644 | 0.000085 | 57.869804 | 0.000081 | 0.003440 | 0.000297 | 0.001360 | 0.000081 | ... | 0.000107 | 137.025938 | 7.810522 | 0.003502 | 0.000185 | 0.000081 | 0.0 | 0.000023 | 1.233269e-06 | b"b'M'" |
5 rows × 45 columns
This data list sources from a radio image. Specifically from the LoFARs LoTSS survey of the northen hemisphere. This data contains alot of columns and alot of rows, so we have limited them to only 50.
.h5 is the file extension for a hdf5 file. This is a hierachical data format (version 5) file. Here we will open the file with a key, this is not always nessesary, this is an identifer for the group contained. For a single group it can be omitted.
Catalogue = pd.read_hdf('catalogue.hdf', key='df') # Read the catalogue.
Catalogue.head(5)
Source_id | Isl_id | RA | E_RA | DEC | E_DEC | Total_flux | E_Total_flux | Peak_flux | E_Peak_flux | ... | E_DC_Min_img_plane | DC_PA_img_plane | E_DC_PA_img_plane | Isl_Total_flux | E_Isl_Total_flux | Isl_rms | Isl_mean | Resid_Isl_rms | Resid_Isl_mean | S_Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 167.140371 | 0.000225 | 57.729393 | 0.000070 | 0.000393 | 0.000121 | 0.000378 | 0.000064 | ... | 0.000158 | 0.000000 | 19.033791 | 0.000352 | 0.000075 | 0.000068 | 0.0 | 0.000003 | -4.429939e-10 | b'S' |
1 | 1 | 3 | 167.122289 | 0.000007 | 57.655640 | 0.000010 | 0.011554 | 0.000268 | 0.007201 | 0.000071 | ... | 0.000018 | 89.712254 | 2.578199 | 0.011192 | 0.000116 | 0.000071 | 0.0 | 0.000048 | -5.317485e-06 | b'M' |
2 | 2 | 4 | 167.124132 | 0.000051 | 57.680628 | 0.000047 | 0.003458 | 0.000223 | 0.001642 | 0.000076 | ... | 0.000093 | 127.119915 | 14.134439 | 0.003215 | 0.000129 | 0.000070 | 0.0 | 0.000036 | 6.057428e-07 | b'S' |
3 | 3 | 5 | 167.181706 | 0.000088 | 58.053840 | 0.000085 | 0.000772 | 0.000140 | 0.000641 | 0.000071 | ... | 0.000189 | 53.352430 | 68.238954 | 0.000741 | 0.000102 | 0.000068 | 0.0 | 0.000010 | 1.471807e-07 | b'S' |
4 | 4 | 6 | 167.143644 | 0.000085 | 57.869804 | 0.000081 | 0.003440 | 0.000297 | 0.001360 | 0.000081 | ... | 0.000107 | 137.025938 | 7.810522 | 0.003502 | 0.000185 | 0.000081 | 0.0 | 0.000023 | 1.233269e-06 | b'M' |
5 rows × 45 columns
A .csv file is a textfile that has data saved in a particular pattern. In this case it is a comma seperated values. so each column value in a row is seperated with a comma. Each row is a line in the text file. These files are very common.
Catalogue = pd.read_csv('catalogue.csv') # Read the catalogue.
Catalogue.head(3)
Source_id | Isl_id | RA | E_RA | DEC | E_DEC | Total_flux | E_Total_flux | Peak_flux | E_Peak_flux | ... | E_DC_Min_img_plane | DC_PA_img_plane | E_DC_PA_img_plane | Isl_Total_flux | E_Isl_Total_flux | Isl_rms | Isl_mean | Resid_Isl_rms | Resid_Isl_mean | S_Code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 167.140371 | 0.000225 | 57.729393 | 0.000070 | 0.000393 | 0.000121 | 0.000378 | 0.000064 | ... | 0.000158 | 0.000000 | 19.033791 | 0.000352 | 0.000075 | 0.000068 | 0.0 | 0.000003 | -4.429939e-10 | b'S' |
1 | 1 | 3 | 167.122289 | 0.000007 | 57.655640 | 0.000010 | 0.011554 | 0.000268 | 0.007201 | 0.000071 | ... | 0.000018 | 89.712254 | 2.578199 | 0.011192 | 0.000116 | 0.000071 | 0.0 | 0.000048 | -5.317485e-06 | b'M' |
2 | 2 | 4 | 167.124132 | 0.000051 | 57.680628 | 0.000047 | 0.003458 | 0.000223 | 0.001642 | 0.000076 | ... | 0.000093 | 127.119915 | 14.134439 | 0.003215 | 0.000129 | 0.000070 | 0.0 | 0.000036 | 6.057428e-07 | b'S' |
3 rows × 45 columns
After we have done the desired tasked to a dataset, we also want to save the file. This can be done easitly using the following. Note that with the fits file we need to transform it into a astropy table object first as pandas doest save directly to a fits file.
# Saving a DataFrame to a fits file.
Table.from_pandas(Catalogue).write('catalogue.fits', format='fits', overwrite=True)
# Saving a DataFrame to a csv file.
Catalogue.to_csv('catalogue.csv', index=False)
#Saving a DataFrame to a hdf file.
Catalogue.to_hdf('catalogue.hdf', key='df', mode='w')
lets make the catalogue shorter for simplicity.
Catalogue = Catalogue[['RA','DEC','Maj','Min','PA']] # Drop the columns.
Catalogue.head(3)
RA | DEC | Maj | Min | PA | |
---|---|---|---|---|---|
0 | 167.140371 | 57.729393 | 0.002311 | 0.001251 | 97.185263 |
1 | 167.122289 | 57.655640 | 0.002162 | 0.001799 | 94.249325 |
2 | 167.124132 | 57.680628 | 0.002708 | 0.002164 | 131.659779 |
Sometimes we need to preform operations on the values in a row to create a new column value. Some of these fuctions are given to you in the pandas package but if we have a custom fuction we need to be abit more creative.
# To appy a basic math fuction to all rows that change with column values we can simply do the following.
Catalogue['SQRT'] = np.sqrt(Catalogue.RA**2 - Catalogue.DEC**2)
Catalogue.head(3)
RA | DEC | Maj | Min | PA | SQRT | |
---|---|---|---|---|---|---|
0 | 167.140371 | 57.729393 | 0.002311 | 0.001251 | 97.185263 | 156.854139 |
1 | 167.122289 | 57.655640 | 0.002162 | 0.001799 | 94.249325 | 156.861999 |
2 | 167.124132 | 57.680628 | 0.002708 | 0.002164 | 131.659779 | 156.854776 |
The .apply function parses the row values to the function so we can easily alter the fucntion to apply any complex function we like. However as we are using a python function to do this. If we preform these functions on a large dataset and the function is complex, it will not take advantage of the speed up that is a native pandasd function can access due to its implementaion of C++ libaries. If this is important to you, its best to check pandas's documentation for a native function.
def SQRT_fun(row): # Define a function to apply to the DataFrame.
return np.sqrt(row.RA**2 - row.DEC**2) # Return the value of the function.
Catalogue['Sqrt_1'] = Catalogue.apply(SQRT_fun, axis=1)
Catalogue.head(3)
RA | DEC | Maj | Min | PA | SQRT | Sqrt_1 | |
---|---|---|---|---|---|---|---|
0 | 167.140371 | 57.729393 | 0.002311 | 0.001251 | 97.185263 | 156.854139 | 156.854139 |
1 | 167.122289 | 57.655640 | 0.002162 | 0.001799 | 94.249325 | 156.861999 | 156.861999 |
2 | 167.124132 | 57.680628 | 0.002708 | 0.002164 | 131.659779 | 156.854776 | 156.854776 |
This is a powerful function that allows for chaining functions together to generate a resulting dataframe.
def create_name(row): # Define a function to apply to the DataFrame.
def create_name_series(row):
return 'Name_' + str(round(row.RA,3))
res = row.apply(create_name_series, axis=1)
row['Name'] = res
return row
processed_cat = (
Catalogue
.pipe(create_name)
)
processed_cat.head(3)
RA | DEC | Maj | Min | PA | SQRT | Sqrt_1 | Name | |
---|---|---|---|---|---|---|---|---|
0 | 167.140371 | 57.729393 | 0.002311 | 0.001251 | 97.185263 | 156.854139 | 156.854139 | Name_167.14 |
1 | 167.122289 | 57.655640 | 0.002162 | 0.001799 | 94.249325 | 156.861999 | 156.861999 | Name_167.122 |
2 | 167.124132 | 57.680628 | 0.002708 | 0.002164 | 131.659779 | 156.854776 | 156.854776 | Name_167.124 |
We can increase the number of functions here so that a pipline of multiple function is created.
def find_eigenvalues(row): # Define a function to apply to the DataFrame.
def find_eigenvalues_series(row):
return np.linalg.eigvals([[row.Maj, row.PA], [row.PA, row.Min]])
res = row.apply(find_eigenvalues_series, axis=1)
row['eignval 1'] = res
return row
processed_cat = (
Catalogue
.pipe(create_name)
.pipe(find_eigenvalues)
)
processed_cat.head(3)
RA | DEC | Maj | Min | PA | SQRT | Sqrt_1 | Name | Eigenvalues | eignval 1 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 167.140371 | 57.729393 | 0.002311 | 0.001251 | 97.185263 | 156.854139 | 156.854139 | Name_167.14 | [97.1870440225773, -97.18348223941352] | [97.1870440225773, -97.18348223941352] |
1 | 167.122289 | 57.655640 | 0.002162 | 0.001799 | 94.249325 | 156.861999 | 156.861999 | Name_167.122 | [94.25130548539278, -94.24734465874354] | [94.25130548539278, -94.24734465874354] |
2 | 167.124132 | 57.680628 | 0.002708 | 0.002164 | 131.659779 | 156.854776 | 156.854776 | Name_167.124 | [131.66221497815906, -131.65734332902065] | [131.66221497815906, -131.65734332902065] |
# convert ipynb to .html
!jupyter nbconvert --to html pandas-tut.ipynb
[NbConvertApp] Converting notebook pandas-tut.ipynb to html [NbConvertApp] Writing 620266 bytes to pandas-tut.html