Samreena Aslam – Linux Hint https://linuxhint.com Exploring and Master Linux Ecosystem Tue, 02 Mar 2021 03:08:29 +0000 en-US hourly 1 https://wordpress.org/?v=5.6.2 How to Plot Data in Pandas Python https://linuxhint.com/plot-data-pandas-python/ Mon, 01 Mar 2021 15:12:14 +0000 https://linuxhint.com/?p=92217 Data visualization plays an important role in data analysis. Pandas is a strong data analysis library in python for data science. It provides various options for data visualization with .plot() method. Even if you are a beginner, you can easily plot your data using the Pandas library. You need to import the pandas and matplotlib.pyplot package for data visualization.

In this article, we will explore various data plotting methods by using the Pandas python. We have executed all examples on the pycharm source code editor by using the matplotlib.pyplot package.

Plotting in Pandas Python

In Pandas, the .plot() has several parameters that you can use based on your needs. Mostly, using the ‘kind’ parameter, you can define which type of plot you will create.

The Syntax for Plotting Data using Pandas Python

The following syntax is used to plot a DataFrame in Pandas Python:

# import pandas and matplotlib.pyplot Packages
import pandas as pd
import matplotlib.pyplot as plt
# Prepare Data to create DataFrame
data_frame = {
    'Column1': ['field1', 'field2', 'field3', 'field4',...],
     ‘Column2': ['field1', 'field2', 'field3', 'field4',...]
    }
var_df= pd.DataFrame(data_frame, columns=['
Column1', 'Column2])
print(Variable)
# plotting bar graph
var_df.plot.bar(x='Column1', y='Column2')
plt.show()

You can also define the plot kind by using the kind parameter as follows:

var_df.plot(x='Column1', y='Column2', kind=’bar’)

Pandas DataFrames objects have the following plot methods for plotting:

  • Scatter Plotting: plot.scatter()
  • Bar Plotting:  plot.bar() , plot.barh() where h represents horizontal bars plot.
  • Line Plotting: plot.line()
  • Pie Plotting: plot.pie()

If a user only uses the plot() method without using any parameter then, it creates the default line graph.

We will now elaborate on some major types of plotting in detail with the help of some examples.

Scatter Plotting in Pandas

In this type of plotting, we have represented the relationship between two variables. Let’s take an example.

Example

For example, we have data of correlation between two variables GDP_growth and Oil_price. To plot the relation between two variables, we have executed the following piece of code on our source code editor:

import matplotlib.pyplot as plt
import pandas as pd
gdp_cal= pd.DataFrame({
    'GDP_growth': [6.1, 5.8, 5.7, 5.7, 5.8, 5.6, 5.5, 5.3, 5.2, 5.2],
    'Oil_Price': [1500, 1520, 1525, 1523, 1515, 1540, 1545, 1560, 1555, 1565]
})
df = pd.DataFrame(gdp_cal, columns=['Oil_Price', 'GDP_growth'])
print(df)
df.plot(x='Oil_Price', y='GDP_growth', kind = 'scatter', color= 'red')
plt.show()

Line Charts Plotting in Pandas  

The line chart plot is a basic type of plotting in which given information displays in a  data points series that are further connected by segments of straight lines. Using the Line charts, you can also show the trends of information overtime.

Example

In the below-mentioned example, we have taken the data about the past year’s inflation rate. First, prepare the data and then create DataFrame. The following source code plots the line graph of the available data:

import pandas as pd
import matplotlib.pyplot as plt

infl_cal = {'Year': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011],
        'Infl_Rate': [5.8, 10, 7, 6.7, 6.8, 6, 5.5, 8.2, 8.5, 9, 10]
        }
data_frame = pd.DataFrame(infl_cal, columns=['Year', 'Infl_Rate'])
data_frame.plot(x='Year', y='Infl_Rate', kind='line')
plt.show()

In the above example, you need to set the kind= ‘line’ for line chart plotting.

Method 2# Using plot.line() method

The above example, you can also implement using the following method:

import pandas as pd
import matplotlib.pyplot as plt

inf_cal = {'Year': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011],
        'Inflation_Rate': [5.8, 10, 7, 6.7, 6.8, 6, 5.5, 8.2, 8.5, 9, 10]
        }
data_frame = pd.DataFrame(inf_cal, columns=['Inflation_Rate'], index=[2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011])
data_frame.plot.line()

plt.title('Inflation Rate Summary of Past 11 Years')
plt.ylabel('Inflation_Rate')
plt.xlabel('Year')
plt.show()

The following line graph will display after running the above code:

Bar Chart Plotting in Pandas

The bar chart plotting is used to represent the categorical data. In this type of plot, the rectangular bars with different heights are plotted based on the given information. The bar chart can be plotted in two different horizontal or vertical directions.

Example

We have taken the literacy rate of several countries in the following example. DataFrames are created in which ‘Country_Names’ and ‘literacy_Rate’ are the two columns of a DataFrame. Using Pandas, you can plot the information in the bar graph shape as follows:

import pandas as pd
import matplotlib.pyplot as plt

lit_cal = {
    'Country_Names': ['Pakistan', 'USA', 'China', 'India', 'UK', 'Austria', 'Egypt', 'Ukraine', 'Saudia', 'Australia',
                      'Malaysia'],
    'litr_Rate': [5.8, 10, 7, 6.7, 6.8, 6, 5.5, 8.2, 8.5, 9, 10]
    }
data_frame = pd.DataFrame(lit_cal, columns=['Country_Names', 'litr_Rate'])
print(data_frame)
data_frame.plot.bar(x='Country_Names', y='litr_Rate')
plt.show()

You can also implement the above example using the following method. Set the kind=’bar’ for bar chart plotting in this line:

data_frame.plot(x='Country_Names', y='litr_Rate', kind='bar')
plt.show()

Horizontal bar chart plotting

You can also plot the data on horizontal bars by executing the following code:

import matplotlib.pyplot as plt
import pandas as pd

data_chart = {'litr_Rate': [5.8, 10, 7, 6.7, 6.8, 6, 5.5, 8.2, 8.5, 9, 10]}
df = pd.DataFrame(data_chart, columns=['litr_Rate'], index=['Pakistan', 'USA', 'China', 'India', 'UK', 'Austria', 'Egypt', 'Ukraine', 'Saudia', 'Australia',
                      'Malaysia'])

df.plot.barh()

plt.title('Literacy Rate in Various Countries')
plt.ylabel('Country_Names')
plt.xlabel('litr_Rate')
plt.show()

In df.plot.barh(), the barh is used for horizontal plotting. After running the above code, the following bar chart displays on the window:

Pie Chart Plotting in Pandas

A pie chart represents the data in a circular graphic shape in which data displays into slices based on the given quantity.

Example

In the following example, we have displayed the information about ‘Earth_material’ in different slices on the Pie chart. First, create the DataFrame, then, by using the pandas, display all details on the graph.

import pandas as pd
import matplotlib.pyplot as plt

material_per = {'Earth_Part': [71,18,7,4]}
dataframe = pd.DataFrame(material_per,columns=['Earth_Part'],index = ['Water','Mineral','Sand','Metals'])

dataframe.plot.pie(y='Earth_Part',figsize=(7, 7),autopct='%1.1f%%', startangle=90)
plt.show()

The above source code plots the pie graph of the available data:

Conclusion

In this article, you have seen how to plot DataFrames in Pandas python. Different kinds of plotting are performed in the above article. To plot more kinds such as box, hexbin, hist, kde, density, area, etc., you can use the same source code just by changing the plot kind.

]]>
How to Use Group by in Pandas Python https://linuxhint.com/use-group-by-pandas-python/ Fri, 26 Feb 2021 17:49:41 +0000 https://linuxhint.com/?p=91651 Pandas group by function is used for grouping DataFrames objects or columns based on particular conditions or rules. Using the groupby function, the dataset management is easier. However, all related records can be arranged into groups. Using the Pandas library, you can implement the Pandas group by function to group the data according to different kinds of variables. Most developers used three basic techniques for the group by function. First, splitting in which data divide into groups based on some particular conditions. Then, apply certain functions to these groups. In the end, combine the output in the form of data structure.

In this article, we will walk through the basic uses of a group by function in panda’s python. All commands are executed on the Pycharm editor.

Let’s discuss the main concept of the group with the help of the employee’s data. We have created a dataframe with some useful employee details (Employee_Names, Designation, Employee_city, Age).

String Concatenation using Group by Function

Using the groupby function, you can concatenate strings. Same records can be joined with ‘,’ in a single cell.

Example

In the following example, we have sorted data based on the employees ‘Designation’ column and joined the Employees who have the same designation. The lambda function is applied on ‘Employees_Name’.

import pandas as pd
df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})
df1=df.groupby("Designation")['Employee_Names'].apply(lambda Employee_Names: ','.join(Employee_Names))
print(df1)

When the above code is executed, the following output displays:

Sorting Values in an ascending order

Use the groupby object into a regular dataframe by calling ‘.to_frame()’ and then use reset_index() for reindexing. Sort column values by calling sort_values().

Example

In this example, we will sort the Employee’s age in ascending order. Using the following piece of code, we have retrieved the ‘Employee_Age’ in ascending order with ‘Employee_Names’.

import pandas as pd

df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})

df1=df.groupby('Employee_Names')['Employee_Age'].sum().to_frame().reset_index().sort_values(by='Employee_Age')

print(df1)

Use of aggregates with groupby

There are a number of functions or aggregations available that you can apply on data groups such as count(), sum(), mean(), median(), mode(), std(), min(), max().

Example

In this example, we have used a ‘count()’ function with groupby to count the Employees who belong to the same ‘Employee_city’.

import pandas as pd
df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})
df1=df.groupby('Employee_city').count()
print(df1)

As you can see the following output, under the Designation, Employee_Names, and Employee_Age columns, count numbers that belong to the same city:

Visualize data using groupby

By using the ‘import matplotlib.pyplot’, you can visualize your data into graphs.

Example

Here, the following example visualizes the ‘Employee_Age’ with ‘Employee_Nmaes’ from the given DataFrame by using the groupby statement.

import pandas as pd
import matplotlib.pyplot as plt
dataframe = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})
plt.clf()
dataframe.groupby('Employee_Names').sum().plot(kind='bar')
plt.show()

Example

To plot the stacked graph using groupby, turn the ‘stacked=true’ and use the following code:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})
df.groupby(['Employee_city','Employee_Names']).size().unstack().plot(kind='bar',stacked=True, fontsize='6')
plt.show()

In the below-given graph, the number of employees stacked who belong to the same city.

Change Column Name with the group by

You can also change the aggregated column name with some new modified name as follows:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})
df1 = df.groupby('Employee_Names')['Designation'].sum().reset_index(name='Employee_Designation')
print(df1)

In the above example, the ‘Designation’ name is changed to ‘Employee_Designation’.

Retrieve Group by key or value

Using the groupby statement, you can retrieve similar records or values from the dataframe.

Example

In the below-given example, we have group data based on ‘Designation’. Then, the ‘Staff’ group is retrieved by using the .getgroup(‘Staff’).

import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})

extract_value = df.groupby('Designation')
print(extract_value.get_group('Staff'))

The following result displays in the output window:

Add Value into group List

Similar data can be displayed in the form of a list by using the groupby statement. First, group the data based on a condition. Then, by applying the function, you can easily put this group into the lists.

Example

In this example, we have inserted similar records into the group list. All the employees are divided into the group based on ’Employee_city’, and then by applying the ‘Lambda’ function, this group is retrieved in the form of a list.

import pandas as pd

df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})
df1=df.groupby('Employee_city')['Employee_Names'].apply(lambda group_series: group_series.tolist()).reset_index()
print(df1)

Use of Transform function with groupby

The employees are grouped according to their age, these values added together, and by using the ‘transform’ function new column is added in the table:

import pandas as pd

df = pd.DataFrame({
   'Employee_Names':['Sam', 'Ali' , 'Umar', 'Raees', 'Mahwish', 'Hania', 'Mirha', 'Maria', 'Hamza'],
   'Designation':['Manager', 'Staff', 'IT officer', 'IT officer', 'HR', 'Staff', 'HR', 'Staff', 'Team Lead'],
   'Employee_city':['Karachi', 'Karachi', 'Islamabad', 'Islamabad', 'Quetta', 'Lahore', 'Faislabad', 'Lahore', 'Islamabad'],
   'Employee_Age':[60, 23, 25, 32, 43, 26, 30, 23, 35]
})
df['sum']=df.groupby(['Employee_Names'])['Employee_Age'].transform('sum')
print(df)

Conclusion

We have explored the different uses of groupby statement in this article. We have shown how you can divide the data into groups, and by applying different aggregations or functions, you can easily retrieve these groups.

]]>
How to Create Pandas DataFrame in Python? https://linuxhint.com/create-pandas-dataframe-in-python/ Thu, 18 Feb 2021 16:14:04 +0000 https://linuxhint.com/?p=90137

Pandas DataFrame is a 2D (two dimensional) annotated data structure in which data is aligned in the tabular form with different rows and columns. For easier understanding, the DataFrame behaves like a spreadsheet that contains three different components: index, columns, and data. Pandas DataFrames are the most common way to utilize the panda’s objects.

Pandas DataFrames can be created using different methods. This article will explain all possible methods through which you can create Pandas DataFrame in python. We have run all examples on the pycharm tool. Let’s start the implementation of each method one by one.

Basic Syntax

Follow the following syntax while creating DataFrames in Pandas python:

pd.DataFrame(Df_data)

Example:Let’s explain with an example. In this case, we have stored the data of student’s names and percentages in a ‘Students_Data’ variable. Further, using the pd.DataFrame (), we have created a DataFrames for displaying student’s result.

import pandas as pd
Students_Data = {
   'Name':['Samreena', 'Asif', 'Mahwish', 'Raees'],
   'Percentage':[90,80,70,85]}
result = pd.DataFrame(Students_Data)
print (result)

Methods to Create Pandas DataFrames

Pandas DataFrames can be created using the different ways that we will discuss in the rest of the article. We will print the Student’s courses result in the form of DataFrames. So, using one of the following methods, you can create similar DataFrames that are represented in the following image:

Method # 01: Creating Pandas DataFrame from the dictionary of lists

In the following example, DataFrames are created from the dictionaries of lists related to student’s course results. First, import a panda’s library and then create a dictionary of lists. The dict keys represent the column names such as ‘Student_Name’, ‘Course_Title’, and ‘GPA’. Lists represent the column’s data or content. The ‘dictionary_lists’ variable contains the data of students that are further assigned to the ‘df1’ variable. Using the print statement, print the all content of DataFrames.

Example:

# Import libraries for pandas and numpy
import pandas as pd
# Import panda’s library
import pandas as pd
# Create a dictionary of list
dictionary_lists = {
   'Student_Name': ['Samreena', 'Raees', 'Sara', 'Sana'],
   'Course_Title': ['SQA','SRE','IT Basics', 'Artificial intelligence'],
   'GPA': [3.1, 3.3, 2.8, 4.0]}
# Create the DataFrame
dframe = pd.DataFrame(dictionary_lists)
print(dframe)

After executing the above code, the following output will be displayed:

Method # 02: Create Pandas DataFrame from the dictionary of NumPy array

The DataFrame can be created from the dict of array/list. For this purpose, the length must be the same as all the narray. If some index is passed, then the index length should be equal to the array’s length. If no one index is passed, then, in this case, the default index to be a range (n). Here, n represents the array’s length.

Example:

import numpy as np
# Create a numpy array
nparray = np.array(
   [['Samreena', 'Raees', 'Sara', 'Sana'],
    ['SQA', 'SRE', 'IT Basics','Artificial Intelligence'],
    [3.1, 3.3, 2.8, 4.0]])
# Create a dictionary of nparray
dictionary_of_nparray = {
   'Student_Name': nparray[0],
   'Course_Title': nparray[1],
   'GPA': nparray[2]}
# Create the DataFrame
dframe = pd.DataFrame(dictionary_of_nparray)
print(dframe)

Method # 03: Creating pandas DataFrame using the list of lists

In the following code, each line represents a single row.

Example:

# Import library Pandas pd
import pandas as pd
# Create a list of lists
group_lists = [
   ['Samreena', 'SQA', 3.1],
   ['Raees', 'SRE', 3.3],
   ['Sara', 'IT Basics', 2.8],
   ['Sana', 'Artificial Intelligence', 4.0]]
# Create the DataFrame
dframe = pd.DataFrame(group_lists, columns = ['Student_Name', 'Course_Title', 'GPA'])
print(dframe)

Method # 04: Creating pandas DataFrame using the list of dictionary

In the following code, each dictionary represents a single row and keys that represent the column names.

Example:

# Import library pandas
import pandas as pd
# Create a list of dictionaries
dict_list = [
   {'Student_Name': 'Samreena', 'Course_Title': 'SQA', 'GPA': 3.1},
   {'Student_Name': 'Raees', 'Course_Title': 'SRE', 'GPA': 3.3},
   {'Student_Name': 'Sara', 'Course_Title': 'IT Basics', 'GPA': 2.8},
   {'Student_Name': 'Sana', 'Course_Title': 'Artificial Intelligence', 'GPA': 4.0}]
# Create the DataFrame
dframe = pd.DataFrame(dict_list)
print(dframe)

Method # 05: Creating pandas Dataframe from dict of pandas Series

The dict keys represent the names of columns and each Series represents column contents. In the following lines of code, we have taken three types of series: Name_series, Course_series, and GPA_series.

Example:

# Import library pandas
import pandas as pd
# Create the Series of student names
Name_series = pd.Series(['Samreena', 'Raees', 'Sara', 'Sana'])
Course_series = pd.Series(['SQA', 'SRE', 'IT Basics', 'Artificial intelligence'])
GPA_series = pd.Series([3.1, 3.3, 2.8, 4.0])
# Create a Series Dictionary
dictionary_of_nparray

\

] = {'Name': Name_series, 'Age': Course_series, 'Department': GPA_series}
# DataFrame creation
dframe = pd.DataFrame(dictionary_of_nparray)
print(dframe)

Method # 06: Create Pandas DataFrame by using zip() function.

Different lists can be merged through the list(zip()) function. In the following example, pandas DataFrame are created by calling pd.DataFrame() function. Three different lists are created that are merged in the form of tuples.

Example:

import pandas as pd
# List1
Student_Name = ['Samreena', 'Raees', 'Sara', 'Sana']
# List2
Course_Title = ['SQA', 'SRE', 'IT Basics', 'Artificial Intelligence']
# List3
GPA = [3.1, 3.3, 2.8, 4.0]
# Take the list of tuples from three lists further, merge them by use of zip().
tuples = list(zip(Student_Name, Course_Title, GPA))
# Assign data values to tuples.
tuples
# Converting tuples list into pandas Dataframe.
dframe = pd.DataFrame(tuples, columns=['Student_Name', 'Course_Title', 'GPA'])
# Print data.
print(dframe)

Conclusion

Using the above methods, you can create Pandas DataFrames in python. We have printed a student’s course GPA by creating Pandas DataFrames. Hopefully, you will get useful results after running the above-mentioned examples. All programs are commented well for better understanding. If you have more ways to create Pandas DataFrames, then do not hesitate to share them with us. Thanks for reading this tutorial.

]]>
How to Join DataFrames in Pandas Python? https://linuxhint.com/join-dataframes-in-pandas-python/ Wed, 17 Feb 2021 10:32:27 +0000 https://linuxhint.com/?p=90044 Pandas DataFrame is a two-dimensional (2D) data structure thatis aligned in a tabular format. These DataFrames can be combined using different methods such as concat (), merge (), and joins. Pandas have high performance, and full-featured join operations that are resembled with SQL relational database. Using the merge function, join operations can be implemented between DataFrames objects.

We will explore the uses of merge function, concat function, and different types of joins operations in Pandas python in this article. All examples will be executed through the pycharm editor. Let’s start with the details!

Use of Merge Function

The basic commonly used syntax of merge () function is given-below:

pd.merge(df_obj1, df_obj2, how='inner', on=None, left_on=None, right_on=None)

Let’s explain the details of the parameters:

The first two df_obj1 and df_obj2 arguments are the names of the DataFrame objects or tables.

The “how” parameter is used for different types of join operations such as “left, right, outer, and inner”. The merge function uses “inner” join operation by default.

The argument “on” contains the column name on which the join operation is performed. This column must be present in both DataFrame objects.

In the “left_on” and “right_on” arguments, “left_on” is the name of the column name as the key in the left DataFrame. The “right_on” is the name of the column used as a key from the right DataFrame.

To elaborate on the concept of joining DataFrames, we have taken two DataFrame objects- product and customer. The following details are present in the product DataFrame:

product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})

The customer DataFrame contains the following details:

customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'Customer_City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})

Join DataFrames on a Key

We can easily find products sold online and the customers who purchased them. So, based on a key “Product_ID”, we have performed inner join operation on both DataFrames as follows:

# import Pandas library

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.merge(product,customer,on='Product_ID'))

The following output displays on the window after running the above code:

If the columns are different in both DataFrames then, explicitly write the name of each column by the left_on and right_on arguments as follows:

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.merge(product,customer,left_on='Product_Name',right_on='Product_Purchased'))

The following output will show on the screen:

Join DataFrames using How Argument

In the following examples, we will explain four types of Joins operations on Pandas DataFrames:

  • Inner Join
  • Outer Join
  • Left Join
  • Right Join

Inner Join in Pandas

We can perform an inner join on multiple keys. To display more details about the product sales, take Product_ID, Seller_City from the product DataFrame and Product_ID, and “Customer_City” from the customer DataFrame to find that either seller or customer belongs to the same city. Implement the following lines of code:

# import Pandas library

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'Customer_City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.merge(product,customer,how='inner',left_on=['Product_ID','Seller_City'],right_on=['Product_ID','Customer_City']))

The following result shows on the window after running the above code:

Full/outer join in Pandas

Outer joins return both right and left DataFrames values, which either have matches. So, to implement the outer join, set the “how” argument as outer. Let’s modify the above example by using the outer join concept. In the below code, it will return all values of both left and right DataFrames.

# import Pandas library

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'Customer_City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.merge(product,customer,on='Product_ID',how='outer'))

Set the indicator argument as “True”s. You will notice that the new “_merge” column is added at the end.

# import Pandas library

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'Customer_City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.merge(product,customer,on='Product_ID',how='outer',indicator=True))

As you can see in the below screenshot, the merge column values explain which row belongs to which DataFrame.

Left Join in Pandas

Left join only display rows of the left DataFrame.  It is similar to the outer join. So, change the ‘how’ argument value with “left”. Try the following code to implement the idea of Left join:

# import Pandas library

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'Customer_City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.merge(product,customer,on='Product_ID',how='left'))

Right Join in Pandas

The right join keeps all right DataFrame rows to the right along with the rows that are also common in the left DataFrame. In this case, the “how” argument is set as the “right” value. Run the following code to implement the right join concept:

# import Pandas library

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'Customer_City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.merge(product,customer,on='Product_ID',how='right'))

In the following screenshot, you can see the result after running the above code:

Joining of DataFrames using the Concat () function

Two DataFrames can be joined using the concat function. The basic syntax of the concatenation function is given below:

pd.concat([df_obj1, df_obj_2]))

Two DataFrames objects will pass as arguments.

Let’s join both DataFrames product and customer through the concat function. Run the following lines of code to join two DataFrames:

# import Pandas library

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_Name':['headphones','Bag','Shoes','Smartphone','Teeth brush','wrist watch','Laptop'],
    'Category':['Electronics','Fashion','Fashion','Electronics','Grocery','Fashion','Electronics'],
    'Price':[300.0,1000.50,2000.0,21999.0,145.0,1500.0,90999.0],
    'Seller_City':['Islamabad','Lahore','Karachi','Rawalpindi','Islamabad','Karachi','Faisalabad']
})
customer=pd.DataFrame({
    'ID':[1,2,3,4,5,6,7,8,9],
    'Customer_Name':['Sara','Sana','Ali','Raees','Mahwish','Umar','Mirha','Asif','Maria'],
    'Age':[20,21,15,10,31,52,15,18,16],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Product_Purchased':['headphones','NA','wrist watch','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'Customer_City':['Lahore','Islamabad','Faisalabad','Karachi','Karachi','Islamabad','Rawalpindi','Islamabad',
    'Lahore']
})
print (pd.concat([product,customer]))

Conclusion:

In this article, we have discussed the implementation of merge () function, concat () functions, and joins operation in Pandas python. Using the above methods, you can easily join two DataFrames and learned. how to implement the Join operations “inner, outer, left, and right” in Pandas. Hopefully, this tutorial will guide you in implementing the join operations on different types of DataFrames. Please let us know about your difficulties in case of any error.

]]>