paint-brush
Pandas Refresher: Using Join on Tablesby@svitlasystems
146 reads

Pandas Refresher: Using Join on Tables

by Svitla Systems4mOctober 24th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Business Intelligence and analytics services are more than trendy concepts; they're in demand in an economy that is in dire need of data-driven decisions and solutions. To achieve true and effective business intelligence and analytics, you need numerous information systems for massive data processing. To the rescue, we call upon reliable tools like Python and Pandas. By writing a script in Python and pairing it with the Pandas library, you'll be able to solve the problem relatively quickly while yielding accurate results for higher consumption.

Company Mentioned

Mention Thumbnail
featured image - Pandas Refresher: Using Join on Tables
Svitla Systems HackerNoon profile picture

Business Intelligence and analytics services are more than trendy concepts; they're in growing demand in an economy that is in dire need of data-driven decisions and solutions.

To achieve true and effective business intelligence and analytics, you need numerous information systems for massive data processing.  Additionally, you need to change the reference terms for these heavy-duty information systems several times a day, as businesses need to have operational data about specific parameters to quickly respond and adapt to shifting markets and external conditions. 

For this purpose, it is common practice to use modern databases, be it relational or NoSQL, to quickly and efficiently build multiple dependencies between large entities. To the rescue, we call upon reliable tools like Python and Pandas. 

To top it off, it very often happens that certain information is needed immediately and only once, even when dealing with critical business information, key parameters, or market Intel. Tomorrow, that data may no longer be relevant, and you will need to process new data in a new format.

Unfortunately, the development cycle of these types of relational databases can be quite costly and time-consuming. Luckily, you can leverage smart tools to help you alleviate these concerns.

By writing a script in Python and pairing it with the Pandas library, you'll be able to solve the problem relatively quickly while yielding accurate results for higher consumption. Let's take a look at how you can process data from various tables in Pandas.

How to load tables in Pandas

The first and most important operation when it comes to data is loading tables. The Pandas library uses a very compact and efficient method of loading tables, where the most common format is CSV.

import pandas as pd
 
df = pd.read_table('data.csv')
 
print (df)

Pandas can read SQL data as well:

...
pd.read_sql_table('table_name', 'postgres:///db_name') 
...

Why do we need to join data in tables?

Relational databases are designed to join data in tables. Each table in the database contains data of a certain form and content. To get a new representation of information, it is necessary to combine data and obtain tables with different columns and lines.

Please refer to the following article for more information about joins in SQL, “Visual Representation of SQL Joins” written by C.L. Moffatt. In it, the author proposed the following visual notation of joins:


Image from: https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

When combining two tables, they are linked using the chosen characteristic to get a new table or view. Keep in mind that the column used to combine the tables should contain only unique values. Let’s define two tables with some information for the Pandas join experiment. 

import pandas as pd
 
df1 = pd.DataFrame({'lkey': ['Alice', 'Brian', 'Joe', 'Jessica', 'Nick'],
                   'age': [18, 21, 30, 25, 33]})
 
df2 = pd.DataFrame({'rkey': ['Alice', 'Brian', 'Joe', 'Jessica', 'Zac'],
              'LastName': ['Smith', 'Simson', 'Goodman', 'Hammond', "Bell"]})

We will make all possible variants of joins in Panda: left, right, inner, outer, and we'll merge two tables. Pandas documentation states that “JOINs can be performed with join() or merge().

By default, join() will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).”

Please refer to “Comparison with SQL” if you need more information about similarities and differences between SQL and Pandas. Another great article to read is “Pandas vs SQL in 5 Examples”.

Left join

Let’ start with left joining. Join two tables by column 

lkey
 in the dataframe 
df1
 and by column 
rkey
 in the dataframe 
df2
.

df1.set_index('lkey').join(df2.set_index('rkey'))

If you do not have a 

Nick
 record in the right table, 
NaN
 will be marked in the corresponding column after joining.

Right join

With the right joining, you can join two tables by column 

lkey
 in the dataframe 
df1
 and by column 
rkey
 in the dataframe 
df2
 with the right intersection from the table.

df1.set_index('lkey').join(df2.set_index('rkey'), how="right")

If you do not have a 

Zac
 record in the right table, 
NaN
 will be marked in the corresponding column after joining.

Outer join

Outer joining allows you to join two tables and includes all lines that are present in both tables. NaN will be placed in the proper column if you don’t have such records on the left side or on the right side.

df1.set_index('lkey').join(df2.set_index('rkey'), how="outer")

If you do not have a Nick record in the right table, 

NaN
 will be marked in the column 
LastName
. If you do not have a 
Zac
 record in the left table, 
NaN
 will be marked in the column 
age
 after joining.

Inner join

Inner join allows you to get all the records from the tables that are present on the left table and right table.

df1.set_index('lkey').join(df2.set_index('rkey'), how="inner")

In this case, you won't have lines with 

Nick
 and 
Zac
.

Merging data in tables

If you need to keep all columns from the left and right tables; and perform inner joining, Pandas allows you to perform functions like merge:

df1.merge(df2, left_on='lkey', right_on='rkey')

By default, merge uses the inner variant of table union. In this case, you will have

Alice
,
Brian
,
Joe
,
Jessica
, and 
lkey
, age 
rkey
LastName
 columns. This is very useful when merging data in tables.

Conclusion

In this article, we aimed to cover the basic techniques for joining tables in Pandas. This is very convenient since you get new dataframes (i.e. tables) as the output. What's more, this information is conveniently processed by all available Pandas algorithms.

When it comes to speed, you will be satisfied with how Pandas works with your data volumes. Of course, Pandas will run slower than SQL queries at some point, but when large amounts of data are involved, it is worth switching to NoSQL databases for quick information retrieval in the required output you need.

Pandas will be an indispensable tool if you need to quickly and efficiently solve a problem in data science or data analytics. Oftentimes, you need to perform the joining of tables only once, since you will likely need a new set of tasks for your information the next day.

It is worth spending time writing and debugging SQL queries when you run this code quite often and on large amounts of data.