Manipulate data with Pandas
Pull up a chair and make yourself comfortable as Mihalis Tsoukalos explains how to use Pandas for storing and manipulating tabular data.
Mihalis Tsoukalos explains how to use Pandas for storing and manipulating tabular data.
Pandas is a Python library that gives you a rich set of tools to carry out data analysis. It’s no exaggeration to say that if you’re going to work with Machine Learning and Data Analysis in Python, then you need to learn how to use Pandas, which excels at data conversions, selections and manipulations using simple to understand code.
The core elements of Pandas are the DataFrame and Series structures, which are used for data storage because without data you have nothing to process, explore or work with. The Series structure is a onedimensional labelled array that can hold any kind of data, whereas the DataFrame structure is a twodimensional and size-mutable data structure. Before we see these two data structures in action, we need to learn how to install Pandas.
Installing Pandas
Although you can install Pandas on its own, the recommended way is under the Anaconda environment. Installing Anaconda (https://anaconda.org) on an Arch Linux system is as simple as running the pacman -S anaconda command with root privileges – use your favourite package manager for installing Anaconda on your own Linux system. You can operate Anaconda using the conda command line utility once Anaconda is activated. You can activate Anaconda on Arch Linux by running source /opt/anaconda/bin/activate root . After that, your Linux shell prompt will most likely change to inform you about the active Anaconda environment.
Once Anaconda is installed, you should have Pandas, NumPy, SciPy and Matplotlib installed as well (if not, install them using the conda utility). You can make sure that Pandas is installed by running
$ conda list pandas
# packages in environment at /home/mtsouk/.conda/ envs/LXFormat:
#
# Name Version Build Channel pandas 1.2.3 pypi_0 pypi
Should you wish to see the complete list of installed packages, run conda list instead.
If you don’t want to install Anaconda, you can install Pandas using pip – in that case you should execute pip install pandas with root privileges and run pip show pandas to learn more about the current Pandas installation. You should also do the same for installing NumPy, SciPy and Matplotlib. The following interaction with the Python shell makes sure that Pandas is working: >>> import pandas as pd;
>>> print(pd.__version__);
1.2.3
The previous output verifies that Pandas is accessible and that we’re using Pandas version 1.2.3. If the Pandas library can’t be found, we’re going to receive an error message similar to ModuleNotFoundError: No module named ‘pandas' . For reasons of simplicity, from now on this tutorial will be using Anaconda for running Python and all presented libraries.
Data Frames
Let’s kick off with manipulating two-dimensional data. The data file that’s going to be used is related to Linux Format and has the following columns: issue (id), number of pages (pages), number of tutorials (tutorials), year of publication (year), month as a string value (month) and a column with a random integer number from 1 to 200 (value), to have some extra information in the data file.
The following interaction with the Python shell shows how to read a CSV data file (lxf.csv) that contains data in the aforementioned format and load its contents into a data frame named df:
] dfList = pd.DataFrame(list, columns = ['Name’, ‘Value'], index = ['in0’, ‘in1’, ‘in2']) dict = [{'id': ‘0001’, ‘Year': 2020, ‘month': ‘January'}, {'id': ‘0201’, ‘Year': 2019, ‘month': ‘March'}] dfDict = pd.DataFrame.from_dict(dict)
In the list case, we’re defining our own index values and column names. For the dictionary case, the column names are taken from the dictionary data.
A common way of using Pandas is in a Jupyter (https://jupyter.org) notebook environment. Jupyter should be installed in an Anaconda environment using conda ( conda install -c conda-forge jupyter ). Jupyter works as an interactive graphical application and as an interactive command line application using Jupyter’s console. The Jupyter console binary is already installed by the Anaconda suite and you can start it with jupyterconsole . Start the Jupyter GUI by running the jupyter notebook command on your terminal – this will launch Jupyter in a web browser where you can begin working. If you don’t already have a working Python 3 notebook, you should create a new one from Files>New and selecting a Python 3 notebook. By default, the Jupyter server enables connections from the localhost only. However, if you use the --ip option ( jupyter notebook --ip 0.0.0.0 ), you should be able to access the Jupyter web interface from other machines as well.
The screenshot (left) shows the output of the following Python commands when executed in Jupyter GUI on a web browser window:
import pandas as pd; print(pd.__version); df = pd.read_csv('lxf.csv') # Provided that lxf.csv is in the current working directory df.head() # To see a small part of the actual data df.describe() # To learn more about the data values df.info() # To get more information about the data
There’s nothing stopping you from using Pandas in Python scripts – this script is saved as explore.py: #!/usr/bin/env python
import pandas as pd; print(pd.__version__); df = pd.read_csv('lxf.csv'); head = df.head(); print(head); df.info();
You can add any statements you want – this is just an example of how to use Pandas in Python scripts and not in an interactive environment such as Jupyter. The main differences are that in Jupyter you don’t need to load the Python interpreter and that Jupyter can display graphical output.
The next section tells us more about working with data using Pandas.
Data, in a DataFrame?!
Loading data into a DataFrame isn’t enough – you should also be able to manipulate and select the data you want. You can select a single column from the df DataFrame presented in the previous section using either of the following two techniques:
Y1 = df.year
Y2 = df['year']
You can select the id, year and month columns along with their data from the df DataFrame using one of these two methods:
selected1 = df.loc[:, ['id’, ‘year’, ‘month']] selected2 = df[['id’, ‘year’, ‘month']]
The previous code selects three specific columns and puts them into two new DataFrame variables called
selected1 and selected2. However, neither of these two methods copies data into the new variable because in Python ‘=’ creates a reference to an existing object. We’re going to learn how to create actual copies, if this is what you want, later on in this tutorial.
Select records based on conditions as follows:
year2020 = df.loc[df['year'] == 2020]
The previous statement selects all rows that have a year value of 2020. You can also perform group-by operations on records as follows:
group_by_year = df[['year’, ‘pages’, ‘tutorials']]. groupby('year').mean()
This code takes a subset of the data, the year, pages and tutorials values, groups them by year, applies an aggregate function and finds the average values of each column apart from the column used in groupby() .
The screenshot (page 93) shows the output of the presented commands using jupyter-console. However, the GUI of Jupyter is usually a much better option for data exploration, especially when you’re trying to learn and understand your data because it enables you to create visualisations and edit your commands.
Working with Series
As with a NumPy array, a Series has a data type. Although a pandas.Series has a single data column, it needs an additional one for accessing its elements – this is called the index column and enables you to access individual elements. If you don’t specify its values on your own, Pandas will do that for you. Define a new Series variable for storing float64 values as follows:
sv = pd.Series(dtype='float64')
You can also convert a NumPy array stored in npData into a pandas.Series using a pd.Series(npData) call. Should you wish to add your own index values, which in this case are strings, you can do the following:
sIndex = pd.Series(npData, index=['a’,‘b’,‘c’,‘d'])
You can convert a Series to a DataFrame using to_ frame() . Additionally, you can convert a DataFrame into a Series using squeeze() . However, if the DataFrame has more than one columns, you should specifically select the column you want to use and then call
squeeze(): df['index'].squeeze() . Finally, you can put a Series into an existing DataFrame using the pandas. merge() command. This also works for combining multiple Series into a new DataFrame.
The screenshot (below) shows the output of series. py inside a Jupyter notebook.
What you should remember from this section is that the Series data type can only contain a single list with an index whereas a DataFrame is a collection of series. Additionally, the to_frame() function converts a Series object into a DataFrame and squeeze() can do the opposite, provided that there’s no ubiquity in the DataFrame column that’s going to be used.
Data selection
Let’s take a look at how to add, delete and extract data from DataFrames based on given criteria. Additionally, it shows how to copy existing data instead of referencing it using the copy() function. All these techniques are illustrated in columns.py. The first code excerpt shows how to rename the value column to random : df.rename({'value':'random'}, axis=1, inplace=True)
The inplace=True parameter tells that any changes are going to take place in the df variable. Next, we’re going to learn how to select specific rows using loc() : df.loc[[1, 5]] df.loc[7:]
The first statement selects rows with index value 1 and 5 whereas the second statement selects all rows until the end starting from row with an index value of 7.
The selected = df.loc[(df['year'] == 2020) & (df. tutorials >= 10)] statement selects all records from df
that have 10 or more tutorials and were published in 2020, and saves the results to the selected variable.
The df_copy = df.copy() statement creates a full copy of the df DataFrame called df_copy whereas the df_dates = df[['id’, ‘year’, ‘month']].copy() statement creates a full copy of id, year and month columns of the df DataFrame called df_dates. It’s important to understand the difference between using copy() and not including copy() in a statement: the former creates a new copy of existing data whereas the latter references existing data without copying it. Therefore, df_copy is a totally independent variable.
This statement inserts a new column with boolean values into df based on the value of the “pages” column: df['Big'] = np.where(df['pages']> 100, True, False)
So, if the value in the pages column is greater than 100, then put True in the “Big” column, which is created. Otherwise, put False – this statement uses some of the functionality of the NumPy package, hence the use of np.where() .
Finally, the df.drop('random’, axis=1, inplace=True) statement deletes the “random” column from df. As usual, we should be extra careful with delete operations.
The output of selectData.py can be seen in the screesnhot (left) as executed in the Jupyter environment.
Missing values
Missing values is an important subject in data analysis, and the important functionality of missing.py can be found in the following Python statements: ts = pd.Series([0, np.nan, 2 , np.nan, 4], dtype=pd.
Int64Dtype()) ts.fillna(0)
The first statement creates a Series with two missing values whereas the second statement replaces missing values with 0 using fillna(0) . Using fillna(dts.mean()) instead of fillna(0) is going to replace missing values with the mean of the existing values. Moreover,
dropna() is going to drop any rows with missing values. Finally, if you want to create sample datetime data, you can use the date_range() function:
from datetime import datetime; datelist = pd.date_range(datetime.today(), periods=5, freq="1h").tolist()
The contents of datelist variable should be similar to the following:
[Timestamp('2021-05-04 11:52:39.694832’, freq='H'),
...
Timestamp('2021-05-04 15:52:39.694832’, freq='H')]
Data visualisation
Finally we’ll see how Pandas can help you visualise data – all statements are saved in plot.py. df.plot() creates a line plot, plot(kind='hist’, bins=4) creates a histogram with four bins, plot.box() creates a box plot, plot.area() creates an area plot and plot.pie() creates a pie chart. Behind the scenes, all these Pandas functions use Matplotlib. The screenshot (above) shows part of the plot.py output in Jupyter.
Pandas can generate better visualisations with a little help from Matplotlib. The statements that are going to be executed are the following:
%matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv('lxf.csv'); df[['pages’, ‘tutorials']].plot() plt.title('Plotting with Matplotlib')
plt.ylabel('Value') plt.xlabel('Index') df.plot(x ='id’, y='pages’, kind = ‘bar’, figsize=(10, 10)) plt.title('Pages per LXF issue') plt.ylabel('Pages') plt.xlabel('Issue number')
The new output is more professional-looking than before mainly because we now have titles and axis labels. The screenshot (below) shows the output of the previous code in a Jupyter notebook.
Pandas is a powerful library and is being used everywhere. Put simply, if you have to manipulate data in Python, you’re most likely going to need Pandas not because it’s free but because it’s the best tool for the job. As usual, experiment with Pandas as much as you can in order to make mistakes and learn before using it in real-world projects. Pandas’ official website is at https://pandas.pydata.org whereas the official website for Jupyter is at https://jupyter.org. Finally, the official Anaconda site is at www.anaconda.com.