150+ Python/Pyspark Pandas DataFrame Practice Exercises with Solutions Beginner to Expert Level
As a data analyst, working with tabular data is a fundamental part of your role. Pandas, a popular data manipulation library in Python, offers a powerful tool called the DataFrame to handle and analyze structured data. In this comprehensive guide, we will cover a wide range of exercises that will help you master DataFrame operations using Pandas, including some examples in PySpark.
1. Creating a Simple DataFrame
Let's start by creating a simple DataFrame from scratch. We'll use Pandas to create a DataFrame from a dictionary of data.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 22 Los Angeles
3 David 28 Chicago
Explanation:
In the example above, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. Each column contains data for different individuals.
2. Viewing Data
You can use various methods to view and inspect your DataFrame.
# Display the first few rows of the DataFrame
print(df.head())
# Display basic statistics about the DataFrame
print(df.describe())
# Display column names
print(df.columns)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 22 Los Angeles
3 David 28 Chicago
Age
count 4.000000
mean 26.250000
std 3.304038
min 22.000000
25% 24.750000
50% 26.500000
75% 28.000000
max 30.000000
Index(['Name', 'Age', 'City'], dtype='object')
Explanation:
You can use the head()
method to view the first few rows of the DataFrame, the describe()
method to display basic statistics, and the columns
attribute to display column names.
3. Selecting Columns
You can select specific columns from a DataFrame using the column names.
# Select the 'Name' and 'Age' columns
selected_columns = df[['Name', 'Age']]
print(selected_columns)
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
3 David 28
Explanation:
In the example above, we selected only the 'Name' and 'Age' columns from the DataFrame.
4. Filtering Data
You can filter rows based on conditions.
# Filter individuals above the age of 25
filtered_data = df[df['Age'] > 25]
print(filtered_data)
Output:
Name Age City
1 Bob 30 San Francisco
3 David 28 Chicago
Explanation:
The example above filters the DataFrame to include only individuals above the age of 25.
5. Sorting Data
You can sort the DataFrame based on a specific column.
# Sort the DataFrame by 'Age' in ascending order
sorted_data = df.sort_values(by='Age')
print(sorted_data)
Output:
Name Age City
2 Charlie 22 New York
0 Alice 25 San Francisco
3 David 28 Chicago
1 Bob 30 San Francisco
Explanation:
The example above sorts the DataFrame based on the 'Age' column in ascending order.
6. Aggregating Data
You can perform aggregation functions like sum, mean, max, and min on DataFrame columns.
# Calculate the mean age
mean_age = df['Age'].mean()
print("Mean Age:", mean_age)
# Calculate the maximum age
max_age = df['Age'].max()
print("Max Age:", max_age)
Output:
Mean Age: 26.25
Max Age: 30
Explanation:
In the example above, we calculated the mean and maximum age from the 'Age' column.
7. Data Transformation: Adding a New Column
You can add a new column to the DataFrame.
# Add a new column 'Salary' with random salary values
import random
df['Salary'] = [random.randint(40000, 90000) for _ in range(len(df))]
print(df)
Output:
Name Age City Salary
0 John 25 New York 78500
1 Bob 30 San Francisco 62000
2 Mary 22 New York 42000
3 David 28 Chicago 74600
Explanation:
In this example, a new column 'Salary' is added to the DataFrame, and random salary values between 40000 and 90000 are assigned to each row.
8. Data Transformation: Removing a Column
You can remove a column from the DataFrame using the drop
method.
# Remove the 'City' column
df_without_city = df.drop('City', axis=1)
print(df_without_city)
Output:
Name Age Salary
0 John 25 78500
1 Bob 30 62000
2 Mary 22 42000
3 David 28 74600
Explanation:
The 'City' column is removed from the DataFrame using the drop
method with axis=1
.
9. Filtering Data: Select Rows Based on Condition
You can filter the DataFrame to select rows that meet certain conditions.
# Select rows where Age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)
Output:
Name Age City Salary
0 John 25 New York 78500
1 Bob 30 San Francisco 62000
3 David 28 Chicago 74600
Explanation:
The filtered_df
contains rows where the 'Age' column value is greater than 25.
10. Aggregation: Calculating Mean Salary
You can calculate the mean salary of the employees.
# Calculate the mean salary
mean_salary = df['Salary'].mean()
print("Mean Salary:", mean_salary)
Output:
Mean Salary: 64275.0
Explanation:
The mean salary of the employees is calculated using the mean
function on the 'Salary' column.
11. Grouping and Aggregation: Calculate Maximum Age per City
You can group the data by a specific column and calculate aggregation functions within each group.
# Group by 'City' and calculate maximum age
max_age_per_city = df.groupby('City')['Age'].max()
print(max_age_per_city)
Output:
City
Chicago 28
New York 25
San Francisco 30
Name: Age, dtype: int64
Explanation:
The groupby
function is used to group the data by the 'City' column, and then the max
function is applied to calculate the maximum age within each group.
12. Joining DataFrames: Merge Employee and Department Data
You can merge two DataFrames based on a common column.
# Sample Department DataFrame
department_data = {'City': ['New York', 'San Francisco'],
'Department': ['HR', 'Finance']}
department_df = pd.DataFrame(department_data)
# Merge Employee and Department DataFrames
merged_df = pd.merge(df, department_df, on='City')
print(merged_df)
Output:
Name Age City Salary Department
0 John 25 New York 78500 HR
1 Robert 22 New York 65000 HR
2 Bob 30 San Francisco 62000 Finance
3 Mary 24 San Francisco 73000 Finance
Explanation:
The merge
function is used to combine the Employee DataFrame with the Department DataFrame based on the common 'City' column.
13. Filtering Data: Select Employees with Salary Greater Than 70000
You can filter rows based on certain conditions.
# Select employees with salary greater than 70000
high_salary_employees = df[df['Salary'] > 70000]
print(high_salary_employees)
Output:
Name Age City Salary
0 John 25 New York 78500
3 Mary 24 San Francisco 73000
Explanation:
The DataFrame is filtered using a condition df['Salary'] > 70000
to select only those employees whose salary is greater than 70000.
14. Sorting Data: Sort Employees by Age in Descending Order
You can sort the DataFrame based on one or more columns.
# Sort employees by age in descending order
sorted_by_age_desc = df.sort_values(by='Age', ascending=False)
print(sorted_by_age_desc)
Output:
Name Age City Salary
2 Bob 30 San Francisco 62000
0 John 25 New York 78500
3 Mary 24 San Francisco 73000
1 Robert 22 New York 65000
Explanation:
The sort_values
function is used to sort the DataFrame based on the 'Age' column in descending order.
15. Grouping and Aggregating Data: Calculate Average Salary by City
You can group data based on a column and then perform aggregation functions.
# Group by city and calculate average salary
avg_salary_by_city = df.groupby('City')['Salary'].mean()
print(avg_salary_by_city)
Output:
City
New York 71750.0
San Francisco 67500.0
Name: Salary, dtype: float64
Explanation:
The groupby
function is used to group the data by the 'City' column, and then the mean
function is applied to the 'Salary' column to calculate the average salary for each city.
16. Merging DataFrames: Merge Employee and Department DataFrames
You can merge two DataFrames based on a common column.
# Create a Department DataFrame
department_data = {'DepartmentID': [1, 2, 3],
'DepartmentName': ['HR', 'Finance', 'IT']}
departments = pd.DataFrame(department_data)
# Merge Employee and Department DataFrames
merged_df = pd.merge(df, departments, left_on='DepartmentID', right_on='DepartmentID')
print(merged_df)
Output:
Name Age City Salary DepartmentID DepartmentName
0 John 25 New York 78500 1 HR
1 Robert 22 New York 65000 1 HR
2 Mary 24 San Francisco 73000 2 Finance
3 Bob 30 San Francisco 62000 3 IT
Explanation:
The merge
function is used to combine the Employee DataFrame and the Department DataFrame based on the 'DepartmentID' column.
17. Sorting Data: Sort Employees by Salary in Descending Order
You can sort the DataFrame based on one or more columns.
# Sort employees by salary in descending order
sorted_df = df.sort_values(by='Salary', ascending=False)
print(sorted_df)
Output:
Name Age City Salary DepartmentID
0 John 25 New York 78500 1
2 Mary 24 San Francisco 73000 2
1 Robert 22 New York 65000 1
3 Bob 30 San Francisco 62000 3
Explanation:
The sort_values
function is used to sort the DataFrame by the 'Salary' column in descending order.
18. Dropping Columns: Remove the DepartmentID Column
You can drop unnecessary columns from the DataFrame.
# Drop the DepartmentID column
df_without_dept = df.drop(columns='DepartmentID')
print(df_without_dept)
Output:
Name Age City Salary
0 John 25 New York 78500
1 Robert 22 New York 65000
2 Mary 24 San Francisco 73000
3 Bob 30 San Francisco 62000
Explanation:
The drop
function is used to remove the 'DepartmentID' column from the DataFrame.
19. Filtering Data: Get Employees with Salary Above 70000
You can filter rows based on a condition.
# Filter employees with salary above 70000
high_salary_df = df[df['Salary'] > 70000]
print(high_salary_df)
Output:
Name Age City Salary DepartmentID
0 John 25 New York 78500 1
2 Mary 24 San Francisco 73000 2
Explanation:
We use boolean indexing to filter rows where the 'Salary' column is greater than 70000.
20. Grouping Data: Calculate Average Salary by City
You can group data based on one or more columns and perform aggregate functions.
# Group by city and calculate average salary
average_salary_by_city = df.groupby('City')['Salary'].mean()
print(average_salary_by_city)
Output:
City
New York 71750.0
San Francisco 67500.0
Name: Salary, dtype: float64
Explanation:
We use the groupby
function to group the data by the 'City' column and then calculate the mean of the 'Salary' column for each group.
21. Renaming Columns: Rename DepartmentID to DeptID
You can rename columns in a DataFrame using the rename
method.
# Rename DepartmentID column to DeptID
df.rename(columns={'DepartmentID': 'DeptID'}, inplace=True)
print(df)
Output:
Name Age City Salary DeptID
0 John 25 New York 78500 1
1 Alice 28 San Francisco 62000 2
2 Mary 24 San Francisco 73000 2
Explanation:
We use the rename
method and provide a dictionary to specify the old column name as the key and the new column name as the value. The inplace=True
argument makes the changes in-place.
22. Merging DataFrames: Merge Employee and Department Data
You can merge two DataFrames using the merge
function.
# Create department DataFrame
department_data = {'DeptID': [1, 2], 'DepartmentName': ['HR', 'Finance']}
department_df = pd.DataFrame(department_data)
# Merge employee and department DataFrames
merged_df = pd.merge(df, department_df, on='DeptID')
print(merged_df)
Output:
Name Age City Salary DeptID DepartmentName
0 John 25 New York 78500 1 HR
1 Alice 28 San Francisco 62000 2 Finance
2 Mary 24 San Francisco 73000 2 Finance
Explanation:
We create a new DataFrame department_df
to represent the department information. Then, we use the merge
function to merge the df
DataFrame with the department_df
DataFrame based on the 'DeptID' column.
23. Grouping and Aggregation: Calculate Average Salary by Department
You can use the groupby
method to group the DataFrame by a specific column and then apply aggregation functions.
# Group by DepartmentName and calculate average salary
average_salary_by_department = merged_df.groupby('DepartmentName')['Salary'].mean()
print(average_salary_by_department)
Output:
DepartmentName
Finance 67500.0
HR 78500.0
Name: Salary, dtype: float64
Explanation:
We use the groupby
method to group the merged DataFrame by the 'DepartmentName' column. Then, we calculate the average salary for each department using the mean
function on the 'Salary' column within each group.
24. Pivot Table: Create a Pivot Table of Average Salary by Department and Age
Pivot tables allow you to create multi-dimensional summaries of data.
# Create a pivot table of average salary by DepartmentName and Age
pivot_table = merged_df.pivot_table(values='Salary', index='DepartmentName', columns='Age', aggfunc='mean')
print(pivot_table)
Output:
Age 24 25 28
DepartmentName
Finance 73000.0 NaN 62000.0
HR NaN 78500.0 NaN
Explanation:
We use the pivot_table
method to create a pivot table that displays the average salary for each combination of 'DepartmentName' and 'Age'. The aggfunc='mean'
argument specifies that the aggregation function should be the mean.
25. Selecting Rows Based on Conditions
You can filter rows from a DataFrame based on certain conditions using boolean indexing.
# Select rows where 'Age' is greater than 25
selected_rows = merged_df[merged_df['Age'] > 25]
print(selected_rows)
Output:
EmployeeID Name Age DepartmentName Salary
1 2 Jane 28 HR 78500
Explanation:
We use boolean indexing to filter rows where the 'Age' column is greater than 25.
26. Sorting DataFrame by Columns
You can sort a DataFrame based on one or more columns using the sort_values
function.
# Sort DataFrame by 'Salary' column in descending order
sorted_df = merged_df.sort_values(by='Salary', ascending=False)
print(sorted_df)
Output:
EmployeeID Name Age DepartmentName Salary
0 1 John 25 Finance 80000
1 2 Jane 28 HR 78500
2 3 Alice 24 Finance 72000
Explanation:
We use the sort_values
function to sort the DataFrame based on the 'Salary' column in descending order.
29. Grouping Data
You can group data based on one or more columns using the groupby
function.
# Group data by 'DepartmentName' and calculate the average salary
grouped_data = merged_df.groupby('DepartmentName')['Salary'].mean()
print(grouped_data)
Output:
DepartmentName
Finance 76000.0
HR 78500.0
Name: Salary, dtype: float64
Explanation:
We use the groupby
function to group the data by the 'DepartmentName' column and then calculate the average salary for each group.
30. Merging DataFrames
You can merge two DataFrames using the merge
function.
# Create a new DataFrame with department-wise average salary
department_avg_salary = merged_df.groupby('DepartmentName')['Salary'].mean().reset_index()
# Merge the original DataFrame with the department-wise average salary DataFrame
merged_with_avg_salary = pd.merge(merged_df, department_avg_salary, on='DepartmentName', suffixes=('', '_avg'))
print(merged_with_avg_salary)
Output:
EmployeeID Name Age DepartmentName Salary Salary_avg
0 1 John 25 Finance 80000 76000.0
1 3 Alice 24 Finance 72000 76000.0
2 2 Jane 28 HR 78500 78500.0
Explanation:
We first calculate the average salary for each department using the groupby
function and create a new DataFrame. Then, we use the merge
function to combine the original DataFrame with the department-wise average salary DataFrame based on the 'DepartmentName' column.
31. Pivoting Data
You can pivot data using the pivot_table
function.
# Create a pivot table to display average salary for each department and age
pivot_table = merged_df.pivot_table(index='DepartmentName', columns='Age', values='Salary', aggfunc='mean')
print(pivot_table)
Output:
Age 24 25 28
DepartmentName
Finance 72000.0 80000.0 NaN
HR NaN NaN 78500.0
Explanation:
We use the pivot_table
function to create a pivot table that displays the average salary for each department and age combination.
35. Working with Missing Data
Missing data can be handled using various functions.
# Check for missing values in the DataFrame
missing_values = merged_df.isnull().sum()
print(missing_values)
Output:
EmployeeID 0
Name 0
Age 0
DepartmentName 0
Salary 0
dtype: int64
Explanation:
The isnull()
function checks for missing values in the DataFrame and returns a boolean DataFrame. The sum()
function then calculates the total number of missing values for each column.
36. Handling Missing Data
Missing data can be filled using the fillna
function.
# Fill missing values in the 'Age' column with the mean age
merged_df['Age'].fillna(merged_df['Age'].mean(), inplace=True)
print(merged_df)
Output:
EmployeeID Name Age DepartmentName Salary
0 1 John 25 Finance 80000
1 2 Jane 28 HR 78500
2 3 Alice 24 Finance 72000
Explanation:
The fillna()
function is used to fill missing values in the 'Age' column with the mean age of the dataset. The inplace=True
parameter applies the changes to the original DataFrame.
37. Exporting Data to CSV
DataFrames can be exported to CSV files using the to_csv
function.
# Export the DataFrame to a CSV file
merged_df.to_csv('employee_data.csv', index=False)
Output:
A CSV file named 'employee_data.csv' will be created in the working directory.
Explanation:
The to_csv()
function is used to export the DataFrame to a CSV file. The index=False
parameter prevents the index column from being exported.
38. Exporting Data to Excel
DataFrames can be exported to Excel files using the to_excel
function.
# Export the DataFrame to an Excel file
merged_df.to_excel('employee_data.xlsx', index=False)
Output:
An Excel file named 'employee_data.xlsx' will be created in the working directory.
Explanation:
The to_excel()
function is used to export the DataFrame to an Excel file. The index=False
parameter prevents the index column from being exported.
39. Merging DataFrames
DataFrames can be merged using the merge
function.
# Merge two DataFrames based on a common column
merged_data = pd.merge(employee_df, department_df, on='DepartmentID')
print(merged_data)
Output:
EmployeeID Name Age DepartmentID Salary DepartmentName
0 1 John 25 1 80000 Finance
1 3 Alice 24 1 72000 Finance
2 2 Jane 28 2 78500 HR
Explanation:
The merge()
function is used to merge two DataFrames based on a common column, in this case, 'DepartmentID'. The resulting DataFrame contains columns from both original DataFrames.
40. Grouping and Aggregating Data
Data can be grouped and aggregated using the groupby
function.
# Group data by department and calculate average salary
average_salary = merged_data.groupby('DepartmentName')['Salary'].mean()
print(average_salary)
Output:
DepartmentName
Finance 76000.0
HR 78500.0
Name: Salary, dtype: float64
Explanation:
The groupby()
function is used to group the data by the 'DepartmentName' column. The mean()
function calculates the average salary for each department.
41. Pivot Tables
Pivot tables can be created using the pivot_table
function.
# Create a pivot table to display average salary by department and age
pivot_table = merged_data.pivot_table(values='Salary', index='DepartmentName', columns='Age', aggfunc='mean')
print(pivot_table)
Output:
Age 24 25 28
DepartmentName
Finance 72000.0 80000.0 NaN
HR NaN NaN 78500.0
Explanation:
The pivot_table()
function creates a pivot table that displays the average salary by department and age. The values
parameter specifies the column to aggregate, the index
parameter specifies the rows (DepartmentName), the columns
parameter specifies the columns (Age), and the aggfunc
parameter specifies the aggregation function to use.
45. Handling Missing Data
Missing data can be handled using functions like fillna
and dropna
.
# Fill missing values with a specific value
df_filled = df.fillna(value=0)
# Drop rows with missing values
df_dropped = df.dropna()
Explanation:
The fillna()
function is used to fill missing values in the DataFrame with a specified value, such as 0 in this case. The dropna()
function is used to remove rows with missing values from the DataFrame.
46. Sorting Data
DataFrames can be sorted using the sort_values
function.
# Sort DataFrame by 'Salary' in ascending order
sorted_df = df.sort_values(by='Salary')
print(sorted_df)
Output:
EmployeeID Name Age DepartmentID Salary
1 3 Alice 24 1 72000
0 1 John 25 1 80000
2 2 Jane 28 2 78500
Explanation:
The sort_values()
function is used to sort the DataFrame by a specified column, in this case, 'Salary'. The resulting DataFrame is sorted in ascending order by default.
47. Merging DataFrames with Different Columns
DataFrames with different columns can be merged using the merge
function with the how
parameter.
# Merge two DataFrames with different columns using an outer join
merged_data = pd.merge(df1, df2, how='outer', left_on='ID', right_on='EmployeeID')
print(merged_data)
Output:
ID Name Age EmployeeID Department
0 1 John 25 1 Finance
1 2 Jane 28 2 HR
2 3 Alice 24 3 Finance
3 4 Bob 30 NaN NaN
Explanation:
The merge()
function can handle merging DataFrames with different columns using different types of joins. In this example, an outer
42. Applying Functions to Columns
You can apply custom functions to columns using the apply
function.
# Define a custom function
def double_salary(salary):
return salary * 2
# Apply the custom function to the 'Salary' column
df['Doubled Salary'] = df['Salary'].apply(double_salary)
print(df)
Output:
EmployeeID Name Age DepartmentID Salary Doubled Salary
0 1 John 25 1 80000 160000
1 2 Jane 28 2 78500 157000
2 3 Alice 24 1 72000 144000
Explanation:
The apply()
function is used to apply a custom function to each element of a column. In this example, a custom function double_salary
is defined to double the salary of each employee, and the function is applied to the 'Salary' column using df['Salary'].apply(double_salary)
. The result is a new column 'Doubled Salary'
containing the doubled salary values.
43. Creating Pivot Tables
Pivot tables can be created using the pivot_table
function.
# Create a pivot table with 'Department' as columns and 'Age' as values
pivot_table = df.pivot_table(values='Age', columns='Department', aggfunc='mean')
print(pivot_table)
Output:
Department Finance HR
Age 24 28
Explanation:
The pivot_table()
function is used to create a pivot table from a DataFrame. In this example, the pivot table has 'Department' as columns and 'Age' as values, with the aggregation function 'mean'
to calculate the average age for each department.
44. Grouping and Aggregating Data
Data can be grouped and aggregated using the groupby
function.
# Group data by 'Department' and calculate the average age and salary
grouped_data = df.groupby('Department').agg({'Age': 'mean', 'Salary': 'mean'})
print(grouped_data)
Output:
Age Salary
Department
Finance 24.5 76000.0
HR 28.0 78500.0
Explanation:
The groupby()
function is used to group the data based on a specified column, in this case, 'Department'. The agg()
function is then used to apply aggregation functions to the grouped data. In this example, the average age and salary for each department are calculated.
45. Merging DataFrames
DataFrames can be merged using the merge
function.
# Create two DataFrames
df1 = pd.DataFrame({'EmployeeID': [1, 2, 3],
'Name': ['John', 'Jane', 'Alice'],
'DepartmentID': [1, 2, 1]})
df2 = pd.DataFrame({'DepartmentID': [1, 2],
'DepartmentName': ['Finance', 'HR']})
# Merge the DataFrames based on 'DepartmentID'
merged_df = pd.merge(df1, df2, on='DepartmentID')
print(merged_df)
Output:
EmployeeID Name DepartmentID DepartmentName
0 1 John 1 Finance
1 3 Alice 1 Finance
2 2 Jane 2 HR
Explanation:
The merge()
function is used to merge two DataFrames based on a common column, in this case, 'DepartmentID'. The result is a new DataFrame containing the combined data from both DataFrames.
46. Handling Missing Values
Missing values can be handled using functions like dropna
and fillna
.
# Drop rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)
# Fill missing values with a specific value
filled_df = df.fillna(value=0)
print(filled_df)
Output:
EmployeeID Name Age DepartmentID Salary
0 1 John 25 1 80000
1 2 Jane 28 2 78500
2 3 Alice 24 1 72000
EmployeeID Name Age DepartmentID Salary
0 1 John 25 1 80000
1 2 Jane 28 2 78500
2 3 Alice 24 1 72000
Explanation:
The dropna()
function is used to remove rows with any missing values, while the fillna()
function is used to fill missing values with a specified value, in this case, 0.
47. Changing Column Data Types
Column data types can be changed using the astype
function.
# Change the data type of 'Salary' column to float
df['Salary'] = df['Salary'].astype(float)
print(df.dtypes)
Output:
EmployeeID int64
Name object
Age int64
DepartmentID int64
Salary float64
dtype: object
Explanation:
The astype()
function is used to change the data type of a column. In this example, the data type of the 'Salary' column is changed from integer to float.
48. Grouping and Aggregating Data
Data can be grouped and aggregated using the groupby
function.
# Group data by 'DepartmentID' and calculate the average salary
grouped_df = df.groupby('DepartmentID')['Salary'].mean()
print(grouped_df)
Output:
DepartmentID
1 76000.0
2 78500.0
Name: Salary, dtype: float64
Explanation:
The groupby()
function is used to group data based on a specified column, in this case, 'DepartmentID'. The mean()
function is then applied to calculate the average salary for each department.
49. Pivoting Data
Data can be pivoted using the pivot_table
function.
# Pivot the data to show average salary by department and age
pivot_df = df.pivot_table(values='Salary', index='DepartmentID', columns='Age', aggfunc='mean')
print(pivot_df)
Output:
Age 24 25 28
DepartmentID
1 76000.0 80000.0 NaN
2 NaN NaN 78500.0
Explanation:
The pivot_table()
function is used to create a pivot table that displays the average salary by department and age. The values
parameter specifies the column to aggregate, the index
parameter specifies the rows, the columns
parameter specifies the columns, and the aggfunc
parameter specifies the aggregation function to use.
50. Exporting Data to CSV
Data can be exported to a CSV file using the to_csv
function.
# Export DataFrame to a CSV file
df.to_csv('employee_data.csv', index=False)
Explanation:
The to_csv()
function is used to export a DataFrame to a CSV file. The index
parameter is set to False
to exclude the index column from the exported CSV file.
51. Exporting Data to Excel
Data can be exported to an Excel file using the to_excel
function.
# Export DataFrame to an Excel file
df.to_excel('employee_data.xlsx', index=False)
Explanation:
The to_excel()
function is used to export a DataFrame to an Excel file. The index
parameter is set to False
to exclude the index column from the exported Excel file.
52. Joining DataFrames
DataFrames can be joined using the merge
function.
# Join two DataFrames based on a common column
result_df = pd.merge(df1, df2, on='EmployeeID')
print(result_df)
Output:
EmployeeID Name_x Salary_x Name_y Salary_y
0 1 John 60000 Alice 55000
1 2 Mary 75000 Bob 60000
Explanation:
The merge()
function is used to combine two DataFrames based on a common column, in this case, 'EmployeeID'. The result is a new DataFrame containing the combined data from both DataFrames.
53. Merging DataFrames
DataFrames can be merged using the merge
function with different types of joins.
# Merge DataFrames with different types of joins
inner_join_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
left_join_df = pd.merge(df1, df2, on='EmployeeID', how='left')
right_join_df = pd.merge(df1, df2, on='EmployeeID', how='right')
outer_join_df = pd.merge(df1, df2, on='EmployeeID', how='outer')
Explanation:
The merge()
function can perform different types of joins based on the how
parameter. The available options are 'inner'
(intersection of keys), 'left'
(keys from the left DataFrame), 'right'
(keys from the right DataFrame), and 'outer'
(union of keys).
54. Handling Missing Data
Missing data can be handled using the fillna
function.
# Fill missing values with a specific value
df['Salary'].fillna(0, inplace=True)
Explanation:
The fillna()
function is used to fill missing values in a specific column with a specified value. The inplace
parameter is set to True
to modify the DataFrame in place.
55. Grouping and Aggregating Data
Data can be grouped and aggregated using the groupby
function.
# Grouping and aggregating data
grouped_df = df.groupby('Department')['Salary'].mean()
print(grouped_df)
Output:
Department
HR 65000
IT 70000
Sales 60000
Name: Salary, dtype: int64
Explanation:
The groupby()
function is used to group the data by a specific column (in this case, 'Department'). The mean()
function is then applied to the 'Salary' column to calculate the average salary for each department.
56. Pivot Tables
Pivot tables can be created using the pivot_table
function.
# Creating a pivot table
pivot_table = df.pivot_table(index='Department', values='Salary', aggfunc='mean')
print(pivot_table)
Output:
Department
HR 65000
IT 70000
Sales 60000
Name: Salary, dtype: int64
Explanation:
The pivot_table()
function is used to create a pivot table that summarizes and aggregates data based on specified columns. In this example, a pivot table is created with the 'Department' column as the index and the 'Salary' column values are aggregated using the mean()
function.
57. Creating a Bar Plot
Bar plots can be created using the plot
function.
# Creating a bar plot
df['Salary'].plot(kind='bar')
plt.xlabel('Employee')
plt.ylabel('Salary')
plt.title('Employee Salaries')
plt.show()
Output:
An interactive bar plot will be displayed.
Explanation:
The plot()
function can be used to create various types of plots, including bar plots. The kind
parameter is set to 'bar'
to indicate that a bar plot should be created. Additional labels and a title are added to the plot using the xlabel()
, ylabel()
, and title()
functions. Finally, the show()
function is used to display the plot.
58. Creating a Histogram
Histograms can be created using the hist
function.
# Creating a histogram
df['Age'].hist(bins=10)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()
Output:
An interactive histogram will be displayed.
Explanation:
The hist()
function is used to create a histogram plot. The bins
parameter determines the number of bins or intervals in the histogram. Additional labels and a title are added to the plot using the xlabel()
, ylabel()
, and title()
functions. Finally, the show()
function is used to display the plot.
59. Creating a Box Plot
Box plots can be created using the boxplot
function.
# Creating a box plot
df.boxplot(column='Salary', by='Department')
plt.xlabel('Department')
plt.ylabel('Salary')
plt.title('Salary Distribution by Department')
plt.suptitle('')
plt.show()
Output:
An interactive box plot will be displayed.
Explanation:
The boxplot()
function is used to create a box plot that visualizes the distribution of a numerical variable ('Salary') based on different categories ('Department'). The column
parameter specifies the column to plot, and the by
parameter specifies the grouping variable. Labels and titles are added to the plot using the xlabel()
, ylabel()
, and title()
functions. The suptitle()
function is used to remove the default title that comes with the plot. Finally, the show()
function is used to display the plot.
60. Creating a Scatter Plot
Scatter plots can be created using the scatter
function.
# Creating a scatter plot
df.plot.scatter(x='Age', y='Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Age vs Salary')
plt.show()
Output:
An interactive scatter plot will be displayed.
Explanation:
The plot.scatter()
function is used to create a scatter plot that visualizes the relationship between two numerical variables ('Age' and 'Salary'). The x
parameter specifies the x-axis variable, and the y
parameter specifies the y-axis variable. Labels and a title are added to the plot using the xlabel()
, ylabel()
, and title()
functions. Finally, the show()
function is used to display the plot.
61. Filtering Rows with Multiple Conditions
You can filter rows in a DataFrame based on multiple conditions using the &
(AND) and |
(OR) operators.
# Filtering rows with multiple conditions
filtered_df = df[(df['Age'] >= 30) & (df['Salary'] >= 50000)]
Explanation:
The example demonstrates how to filter rows in a DataFrame based on multiple conditions. In this case, we are filtering for rows where the 'Age' is greater than or equal to 30 and the 'Salary' is greater than or equal to 50000. The &
operator is used to perform an element-wise AND operation on the conditions.
62. Grouping Data and Calculating Aggregates
You can use the groupby
function to group data by one or more columns and then apply aggregate functions to the grouped data.
# Grouping data and calculating aggregates
grouped_df = df.groupby('Department')['Salary'].mean()
Explanation:
The groupby()
function is used to group the data by a specific column ('Department' in this case). Then, the ['Salary'].mean()
expression calculates the mean salary for each department. The result is a new DataFrame with department names as the index and the corresponding mean salaries.
63. Merging DataFrames
You can merge two DataFrames based on a common column using the merge
function.
# Merging DataFrames
merged_df = pd.merge(df1, df2, on='EmployeeID')
Explanation:
The merge()
function is used to merge two DataFrames ('df1' and 'df2') based on a common column ('EmployeeID' in this case). The result is a new DataFrame containing the combined data from both original DataFrames.
64. Handling Missing Data
Missing data can be handled using the fillna
function or by dropping rows with missing values using the dropna
function.
# Handling missing data
df.fillna(value=0, inplace=True)
Explanation:
The fillna()
function is used to fill missing values in the DataFrame with a specified value (in this case, 0). The inplace=True
parameter updates the DataFrame in place with the filled values.
65. Pivoting DataFrames
You can pivot a DataFrame using the pivot
function to reshape the data based on column values.
# Pivoting a DataFrame
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
Explanation:
The pivot()
function is used to reshape the DataFrame. In this example, the DataFrame is pivoted based on the 'Date' and 'Product' columns, and the 'Sales' column values are used as the values for the pivoted DataFrame.
66. Melting DataFrames
Melting a DataFrame can help convert it from a wide format to a long format using the melt
function.
# Melting a DataFrame
melted_df = pd.melt(df, id_vars=['Date'], value_vars=['Product_A', 'Product_B'])
Explanation:
The melt()
function is used to transform the DataFrame from wide format to long format. The 'Date' column is kept as the identifier variable, and the 'Product_A' and 'Product_B' columns are melted into a single column called 'variable', and their corresponding values are in the 'value' column.
67. Reshaping DataFrames with Stack and Unstack
You can use the stack
and unstack
functions to reshape DataFrames by stacking and unstacking levels of the index or columns.
# Stacking and unstacking DataFrames
stacked_df = df.stack()
unstacked_df = df.unstack()
Explanation:
The stack()
function is used to stack the specified level(s) of columns to produce a Series with a MultiIndex. The unstack()
function is used to unstack the specified level(s) of the index to produce a DataFrame with reshaped columns.
68. Creating Pivot Tables
Pivot tables can be created using the pivot_table
function to summarize and analyze data.
# Creating a pivot table
pivot_table_df = df.pivot_table(index='Department', values='Salary', aggfunc='mean')
Explanation:
The pivot_table()
function is used to create a pivot table. In this example, the pivot table is based on the 'Department' column, and the 'Salary' column values are aggregated using the mean function.
69. Grouping Data in a DataFrame
You can group data in a DataFrame using the groupby
function to perform aggregate operations on grouped data.
# Grouping data and calculating mean
grouped_df = df.groupby('Category')['Price'].mean()
Explanation:
The groupby()
function is used to group data based on a column ('Category' in this case). The mean()
function is then applied to the 'Price' column within each group to calculate the average price for each category.
70. Merging DataFrames
DataFrames can be merged using the merge
function to combine data from different sources based on common columns.
# Merging DataFrames
merged_df = pd.merge(df1, df2, on='common_column')
Explanation:
The merge()
function is used to combine data from two DataFrames based on a common column ('common_column' in this case). The resulting DataFrame contains columns from both original DataFrames, aligned based on the matching values in the common column.
71. Concatenating DataFrames
DataFrames can be concatenated using the concat
function to combine them vertically or horizontally.
# Concatenating DataFrames vertically
concatenated_df = pd.concat([df1, df2])
# Concatenating DataFrames horizontally
concatenated_df = pd.concat([df1, df2], axis=1)
Explanation:
The concat()
function is used to concatenate DataFrames either vertically (default) or horizontally (if axis=1
is specified). This is useful when you want to combine data from different sources into a single DataFrame.
72. Handling Missing Data
Missing data can be handled using functions like dropna
, fillna
, and interpolate
.
# Dropping rows with missing values
cleaned_df = df.dropna()
# Filling missing values with a specific value
filled_df = df.fillna(value)
# Interpolating missing values
interpolated_df = df.interpolate()
Explanation:
Missing data can be handled using various methods. The dropna()
function removes rows with missing values, the fillna()
function fills missing values with a specified value, and the interpolate()
function fills missing values using interpolation methods.
73. Reshaping DataFrames
DataFrames can be reshaped using functions like pivot
, melt
, and stack/unstack
.
# Pivot table
pivot_table = df.pivot_table(index='Index', columns='Column', values='Value')
# Melt DataFrame
melted_df = pd.melt(df, id_vars=['ID'], value_vars=['Var1', 'Var2'])
# Stack and unstack
stacked_df = df.stack()
unstacked_df = df.unstack()
Explanation:
DataFrames can be reshaped to change the layout of the data. The pivot_table()
function creates a pivot table based on the provided columns, melt()
function is used to transform wide data into long format, and stack()
and unstack()
functions change between multi-level indexed and unindexed representations.
74. Aggregating Data in Groups
DataFrames can be grouped and aggregated using functions like groupby
and agg
.
# Grouping and aggregating data
grouped = df.groupby('Category')['Value'].agg(['mean', 'sum'])
Explanation:
The groupby()
function is used to group data based on a column ('Category' in this case), and the agg()
function is then used to perform aggregate operations (e.g., mean, sum) on the grouped data.
75. Applying Functions to Columns
You can apply functions to DataFrame columns using apply
or applymap
.
# Applying a function to a column
df['NewColumn'] = df['Column'].apply(function)
# Applying a function element-wise to all columns
transformed_df = df.applymap(function)
Explanation:
The apply()
function can be used to apply a function to a specific column. The applymap()
function is used to apply a function element-wise to all columns in the DataFrame.
76. Using Lambda Functions
Lambda functions can be used for concise operations within DataFrames.
# Applying a lambda function
df['NewColumn'] = df['Column'].apply(lambda x: x * 2)
Explanation:
Lambda functions provide a concise way to define small operations directly within a function call. In this case, the lambda function is applied to each element of the 'Column' and the result is assigned to the 'NewColumn'.
77. Handling Missing Data
Dealing with missing data is a common task in data analysis. Pandas provides various functions to handle missing values.
# Check for missing values
missing_values = df.isnull().sum()
# Drop rows with missing values
cleaned_df = df.dropna()
# Fill missing values with a specific value
df_filled = df.fillna(value)
Explanation:
The isnull()
function is used to identify missing values in the DataFrame. The dropna()
function is used to remove rows containing missing values, and the fillna()
function is used to fill missing values with a specified value.
78. Removing Duplicates
Removing duplicate rows is essential to ensure data accuracy and consistency.
# Removing duplicates based on all columns
deduplicated_df = df.drop_duplicates()
# Removing duplicates based on specific columns
deduplicated_specific_df = df.drop_duplicates(subset=['Column1', 'Column2'])
Explanation:
The drop_duplicates()
function removes duplicate rows from the DataFrame. You can specify columns using the subset
parameter to consider only certain columns for duplicate removal.
79. Sorting DataFrames
DataFrames can be sorted using the sort_values
function.
# Sorting by a single column
sorted_df = df.sort_values(by='Column')
# Sorting by multiple columns
sorted_multi_df = df.sort_values(by=['Column1', 'Column2'], ascending=[True, False])
Explanation:
The sort_values()
function is used to sort the DataFrame based on one or more columns. The by
parameter specifies the columns to sort by, and the ascending
parameter determines whether the sorting is in ascending or descending order.
80. Exporting DataFrames
DataFrames can be exported to various file formats using functions like to_csv
, to_excel
, and to_sql
.
# Export to CSV
df.to_csv('output.csv', index=False)
# Export to Excel
df.to_excel('output.xlsx', index=False)
# Export to SQL database
df.to_sql('table_name', connection_object, if_exists='replace')
Explanation:
DataFrames can be exported to various file formats using functions like to_csv()
for CSV files, to_excel()
for Excel files, and to_sql()
to store data in a SQL database. The index
parameter specifies whether to include the index in the exported file.
81. Grouping and Aggregating Data
Grouping data allows you to perform aggregate operations on specific subsets of data.
# Grouping by a single column and calculating mean
grouped_mean = df.groupby('Category')['Value'].mean()
# Grouping by multiple columns and calculating sum
grouped_sum = df.groupby(['Category1', 'Category2'])['Value'].sum()
Explanation:
The groupby()
function is used to group the DataFrame based on one or more columns. Aggregate functions like mean()
, sum()
, count()
, etc., can then be applied to the grouped data to calculate summary statistics.
82. Reshaping Data
DataFrames can be reshaped using functions like melt
and pivot
.
# Melting the DataFrame
melted_df = pd.melt(df, id_vars=['ID'], value_vars=['Value1', 'Value2'])
# Creating a pivot table
pivot_table = df.pivot_table(index='Category', columns='Date', values='Value', aggfunc='sum')
Explanation:
The melt()
function is used to transform the DataFrame from wide format to long format. The pivot_table()
function is used to create a pivot table, aggregating data based on specified rows, columns, and values.
83. Combining DataFrames
DataFrames can be combined using functions like concat
, merge
, and join
.
# Concatenating DataFrames vertically
concatenated_df = pd.concat([df1, df2], axis=0)
# Merging DataFrames based on a common column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
# Joining DataFrames based on index
joined_df = df1.join(df2, how='outer')
Explanation:
The concat()
function is used to concatenate DataFrames vertically or horizontally. The merge()
function is used to merge DataFrames based on common columns, and the join()
function is used to join DataFrames based on index.
84. Time Series Analysis
Pandas provides functionality for working with time series data.
# Converting to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# Resampling time series data
resampled_df = df.resample('W').sum()
# Rolling mean calculation
rolling_mean = df['Value'].rolling(window=7).mean()
Explanation:
Pandas allows you to work with time series data by converting date columns to datetime format, resampling data at different frequencies, and calculating rolling statistics like moving averages.
85. Visualizing Data
Data visualization is crucial for understanding patterns and trends in data.
import matplotlib.pyplot as plt
import seaborn as sns
# Line plot
plt.plot(df['Date'], df['Value'])
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Value over Time')
plt.show()
# Scatter plot
sns.scatterplot(x='X', y='Y', data=df)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()
Explanation:
Matplotlib and Seaborn libraries are commonly used for data visualization in Python. You can create various types of plots, including line plots and scatter plots, to visualize relationships and trends in your data.
86. Handling Missing Data
Dealing with missing data is essential for accurate analysis.
# Checking for missing values
missing_values = df.isnull().sum()
# Dropping rows with missing values
df_cleaned = df.dropna()
# Filling missing values with a specific value
df_filled = df.fillna(0)
Explanation:
The isnull()
function is used to identify missing values in a DataFrame. You can then use dropna()
to remove rows or columns with missing values, and fillna()
to replace missing values with a specific value.
87. Data Transformation
You can perform various data transformation operations to prepare data for analysis.
# Applying a function to a column
df['Transformed_Column'] = df['Value'].apply(lambda x: x * 2)
# Applying a function element-wise
df_transformed = df.applymap(lambda x: x.upper() if isinstance(x, str) else x)
# Binning data into categories
df['Category'] = pd.cut(df['Value'], bins=[0, 10, 20, 30], labels=['Low', 'Medium', 'High'])
Explanation:
Data transformation involves modifying, adding, or removing columns in a DataFrame to create new features or prepare data for analysis. You can use functions like apply()
and applymap()
to transform data based on custom functions.
88. Working with Categorical Data
Categorical data requires special handling to encode and analyze properly.
# Encoding categorical variables
encoded_df = pd.get_dummies(df, columns=['Category'], prefix=['Cat'], drop_first=True)
# Mapping categories to numerical values
category_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
df['Category'] = df['Category'].map(category_mapping)
Explanation:
Categorical data needs to be transformed into numerical format for analysis. You can use one-hot encoding with get_dummies()
to create binary columns for each category, or use map()
to map categories to specific numerical values.
89. Data Aggregation and Pivot Tables
Aggregating data and creating pivot tables helps summarize information.
# Creating a pivot table
pivot_table = df.pivot_table(index='Category', columns='Month', values='Value', aggfunc='sum')
# Grouping and aggregating data
grouped = df.groupby('Category')['Value'].agg(['sum', 'mean', 'max'])
Explanation:
Pivot tables allow you to create multidimensional summaries of data. You can also use the groupby()
function to group data based on specific columns and then apply aggregate functions to calculate summary statistics.
90. Exporting Data
After analysis, you might need to export your DataFrame to different formats.
# Exporting to CSV
df.to_csv('output.csv', index=False)
# Exporting to Excel
df.to_excel('output.xlsx', index=False)
# Exporting to JSON
df.to_json('output.json', orient='records')
Explanation:
Pandas provides methods to export DataFrames to various file formats, including CSV, Excel, and JSON. You can use the to_csv()
, to_excel()
, and to_json()
functions to save your data.
91. Merging DataFrames
Combining data from multiple DataFrames can be useful for analysis.
# Inner join
merged_inner = pd.merge(df1, df2, on='ID', how='inner')
# Left join
merged_left = pd.merge(df1, df2, on='ID', how='left')
# Concatenating DataFrames
concatenated = pd.concat([df1, df2], axis=0)
Explanation:
You can merge DataFrames using different types of joins (inner, outer, left, right) with the merge()
function. Use concat()
to concatenate DataFrames along a specified axis.
92. Time Series Analysis
Pandas supports time series analysis and manipulation.
# Converting a column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Resampling time series data
df_resampled = df.resample('D', on='Date').sum()
# Shifting time series data
df_shifted = df['Value'].shift(1)
Explanation:
For time series analysis, it's crucial to convert time-related columns to datetime format using pd.to_datetime()
. You can resample time series data to a different frequency and apply aggregation functions using resample()
. Shifting data can help in calculating differences between consecutive time periods.
93. Plotting Data
Pandas provides built-in methods for data visualization.
# Line plot
df.plot(x='Date', y='Value', kind='line', title='Line Plot')
# Bar plot
df.plot(x='Category', y='Value', kind='bar', title='Bar Plot')
# Histogram
df['Value'].plot(kind='hist', title='Histogram')
Explanation:
Pandas provides easy-to-use methods for creating various types of plots directly from DataFrames. You can create line plots, bar plots, histograms, and more using the plot()
function.
94. Advanced Indexing and Selection
Pandas offers advanced indexing and selection capabilities.
# Indexing using boolean conditions
filtered_data = df[df['Value'] > 10]
# Indexing using loc and iloc
selected_data = df.loc[df['Category'] == 'High', 'Value']
# Multi-level indexing
multi_indexed = df.set_index(['Category', 'Date'])
Explanation:
You can use boolean conditions to filter rows that meet specific criteria. The loc
and iloc
indexers allow you to select data by label or integer-based location, respectively. Multi-level indexing lets you create hierarchical index structures.
95. Handling Duplicate Data
Duplicate data can affect analysis accuracy, so it's important to handle it.
# Checking for duplicates
duplicate_rows = df.duplicated()
# Dropping duplicates
df_deduplicated = df.drop_duplicates()
# Keeping the first occurrence of duplicates
df_first_occurrence = df.drop_duplicates(keep='first')
Explanation:
Use the duplicated()
function to identify duplicate rows in a DataFrame. You can then use drop_duplicates()
to remove duplicate rows, either by dropping all duplicates or keeping only the first occurrence.
96. Handling Missing Data
Missing data can be problematic for analysis, so it's important to handle it properly.
# Checking for missing values
missing_values = df.isnull()
# Dropping rows with missing values
df_no_missing = df.dropna()
# Filling missing values with a specific value
df_filled = df.fillna(0)
Explanation:
The isnull()
function helps you identify missing values in your DataFrame. You can use dropna()
to remove rows with missing values or fillna()
to replace missing values with a specific value.
97. Aggregating Data
You can perform aggregation operations to summarize data in various ways.
# Grouping data and calculating mean
grouped_mean = df.groupby('Category')['Value'].mean()
# Grouping data and calculating sum
grouped_sum = df.groupby('Category')['Value'].sum()
# Pivot tables
pivot_table = df.pivot_table(index='Category', columns='Date', values='Value', aggfunc='mean')
Explanation:
Use the groupby()
function to group data based on specific columns and perform aggregation functions such as mean, sum, count, etc. Pivot tables allow you to create a table summarizing data based on multiple dimensions.
98. Reshaping Data
You can reshape data to fit different formats using Pandas.
# Melting data from wide to long format
melted = pd.melt(df, id_vars=['Category'], value_vars=['Jan', 'Feb', 'Mar'])
# Pivoting data from long to wide format
pivoted = melted.pivot_table(index='Category', columns='variable', values='value')
Explanation:
The melt()
function helps you reshape data from wide to long format, where each row represents a unique combination of variables. The pivot_table()
function can then be used to reshape the long format data back to wide format.
99. Working with Text Data
Pandas supports text data manipulation and analysis.
# Extracting substrings
df['First Name'] = df['Full Name'].str.split().str[0]
# Counting occurrences of values
word_count = df['Text Column'].str.split().apply(len)
# Finding and replacing text
df['Text Column'] = df['Text Column'].str.replace('old', 'new')
Explanation:
Pandas provides methods for working with text data within columns. You can use str.split()
to split text into substrings, apply()
to perform operations on each element, and str.replace()
to find and replace specific text within columns.
100. Exporting Data
Exporting data is essential for sharing analysis results.
# Export to CSV
df.to_csv('data.csv', index=False)
# Export to Excel
df.to_excel('data.xlsx', index=False)
# Export to JSON
df.to_json('data.json', orient='records')
Explanation:
Pandas allows you to export DataFrames to various file formats, including CSV, Excel, and JSON. Use the respective to_*
functions and specify the file name. Set index=False
to exclude the index column from the export.
101. Working with DateTime Data
Pandas provides tools to work with datetime data efficiently.
# Converting strings to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Extracting year, month, day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
# Calculating time differences
df['TimeDiff'] = df['End Time'] - df['Start Time']
Explanation:
Pandas provides the to_datetime()
function to convert strings to datetime objects. You can use dt.year
, dt.month
, and dt.day
to extract date components. Calculating time differences becomes straightforward by subtracting datetime columns.
102. Merging DataFrames
Combining data from multiple DataFrames can provide valuable insights.
# Merging based on common column
merged_df = pd.merge(df1, df2, on='ID')
# Merging with different column names
merged_df = pd.merge(df1, df2, left_on='ID1', right_on='ID2')
# Merging on multiple columns
merged_df = pd.merge(df1, df2, on=['ID', 'Date'])
Explanation:
Pandas offers the merge()
function to combine DataFrames based on shared columns. You can specify the column to merge on using the on
parameter or different columns using left_on
and right_on
. Merging on multiple columns is also possible by passing a list of column names.
103. Combining DataFrames
Concatenating DataFrames is useful for combining data vertically or horizontally.
# Concatenating vertically
concatenated_df = pd.concat([df1, df2])
# Concatenating horizontally
concatenated_df = pd.concat([df1, df2], axis=1)
Explanation:
The concat()
function allows you to concatenate DataFrames vertically (along rows) or horizontally (along columns). Use axis=0
for vertical concatenation and axis=1
for horizontal concatenation.
104. Applying Functions to Columns
Applying functions to DataFrame columns can transform or manipulate data.
# Applying a function element-wise
df['New Column'] = df['Column'].apply(lambda x: x * 2)
# Applying a function to multiple columns
df[['Col1', 'Col2']] = df[['Col1', 'Col2']].applymap(lambda x: x.strip())
Explanation:
You can use the apply()
function to apply a function element-wise to a column. To apply a function to multiple columns, use applymap()
. The example demonstrates how to double the values in a column and strip whitespace from multiple columns.
105. Categorical Data
Converting data to categorical format can save memory and improve performance.
# Converting to categorical
df['Category'] = df['Category'].astype('category')
# Displaying categories
categories = df['Category'].cat.categories
# Mapping categories to numerical values
df['Category Code'] = df['Category'].cat.codes
Explanation:
By converting categorical data to the 'category'
type, you can save memory and improve performance. Use the cat.categories
property to display the unique categories and cat.codes
to map them to numerical values.
106. Handling Missing Data
Dealing with missing data is essential for data analysis and modeling.
# Checking for missing values
missing_values = df.isnull().sum()
# Dropping rows with missing values
df_cleaned = df.dropna()
# Filling missing values with a specific value
df_filled = df.fillna(value=0)
Explanation:
Use isnull()
to identify missing values in a DataFrame. The sum()
function calculates the number of missing values per column. You can drop rows with missing values using dropna()
or fill missing values with a specific value using fillna()
.
107. Aggregating Data
Aggregating data provides insights into summary statistics.
# Calculating mean, median, and sum
mean_value = df['Column'].mean()
median_value = df['Column'].median()
sum_value = df['Column'].sum()
# Grouping and aggregating
grouped_data = df.groupby('Category')['Value'].sum()
Explanation:
Aggregating data helps analyze summary statistics. Use mean()
, median()
, and sum()
to calculate these statistics. Grouping data using groupby()
allows for aggregation based on specific columns.
108. Reshaping Data
Reshaping data allows for different representations of the same information.
# Pivoting data
pivot_table = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
# Melting data
melted_df = pd.melt(df, id_vars='Date', value_vars=['Col1', 'Col2'], var_name='Category', value_name='Value')
Explanation:
Pivoting reshapes data by creating a new table with columns based on unique values from another column. The pivot_table()
function allows for customization of aggregation functions. Melting data converts wide-format data to long-format, making it more suitable for analysis.
109. Working with Text Data
Manipulating text data is common in data analysis.
# Extracting substring
df['Substr'] = df['Text'].str[0:5]
# Splitting text into columns
df[['First Name', 'Last Name']] = df['Name'].str.split(expand=True)
# Counting occurrences of a substring
df['Count'] = df['Text'].str.count('pattern')
Explanation:
Text manipulation is possible using string methods like str[0:5]
to extract a substring. The str.split()
function splits text into separate columns. The str.count()
function counts occurrences of a substring in a column.
110. Exporting Data
Exporting data is essential for sharing analysis results.
# Exporting to CSV
df.to_csv('data.csv', index=False)
# Exporting to Excel
df.to_excel('data.xlsx', index=False)
# Exporting to JSON
df.to_json('data.json', orient='records')
Explanation:
Use to_csv()
to export data to a CSV file. The to_excel()
function exports to an Excel file, and to_json()
exports to a JSON file with various orientations, such as 'records'
.
111. Merging DataFrames
Merging data from multiple DataFrames can provide a comprehensive view of the data.
# Inner join
merged_inner = pd.merge(df1, df2, on='common_column', how='inner')
# Left join
merged_left = pd.merge(df1, df2, on='common_column', how='left')
# Concatenating DataFrames
concatenated_df = pd.concat([df1, df2], axis=0)
Explanation:
Merging DataFrames is useful for combining related data. pd.merge()
performs inner and left joins on specified columns using the how
parameter. The pd.concat()
function concatenates DataFrames along the specified axis.
112. Time Series Analysis
Working with time series data requires specialized techniques.
# Converting to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Resampling data
daily_data = df.resample('D', on='Date').sum()
# Shifting data
df['Shifted'] = df['Value'].shift(1)
Explanation:
Time series analysis involves converting date columns to datetime format using pd.to_datetime()
. Resampling data using resample()
aggregates data over specified time intervals. Shifting data using shift()
offsets data by a specified number of periods.
113. Working with Categorical Data
Categorical data can be encoded and analyzed effectively.
# Encoding categorical data
df['Category'] = df['Category'].astype('category')
df['Category_encoded'] = df['Category'].cat.codes
# One-hot encoding
one_hot_encoded = pd.get_dummies(df['Category'], prefix='Category')
Explanation:
Encode categorical data using astype('category')
and cat.codes
to assign unique codes to categories. Use pd.get_dummies()
for one-hot encoding, creating separate columns for each category.
114. Data Visualization with Pandas
Data visualization helps in understanding patterns and trends.
import matplotlib.pyplot as plt
# Line plot
df.plot(x='Date', y='Value', kind='line', title='Line Plot')
# Histogram
df['Value'].plot(kind='hist', bins=10, title='Histogram')
plt.show()
Explanation:
Data visualization libraries like Matplotlib can be used to create various plots. df.plot()
generates line plots, and plot(kind='hist')
creates histograms for numeric data.
115. Pivot Tables
Pivot tables help summarize and analyze data from multiple dimensions.
# Creating a pivot table
pivot_table = df.pivot_table(values='Value', index='Category', columns='Date', aggfunc='sum')
# Handling missing values
pivot_table_filled = pivot_table.fillna(0)
Explanation:
Pivot tables are used to summarize data across multiple dimensions. pivot_table()
creates a pivot table using specified values, index, columns, and aggregation function. fillna()
is used to handle missing values by filling them with a specific value.
116. Groupby and Aggregation
Grouping data and applying aggregation functions helps in obtaining insights.
# Grouping data
grouped_data = df.groupby('Category')['Value'].sum()
# Multiple aggregations
aggregated_data = df.groupby('Category').agg({'Value': ['sum', 'mean']})
Explanation:
groupby()
is used to group data based on specified columns. Aggregation functions like sum()
and mean()
can be applied to the grouped data. agg()
allows performing multiple aggregations on different columns.
117. Working with Datetime Index
Datetime index provides flexibility in time-based analysis.
# Setting datetime index
df.set_index('Date', inplace=True)
# Resampling with datetime index
resampled_data = df.resample('M').sum()
Explanation:
Setting a datetime index using set_index()
enables time-based analysis. resample()
with a datetime index can be used to aggregate data over different time periods.
118. Handling Outliers
Detecting and handling outliers is crucial for accurate analysis.
# Detecting outliers using z-score
from scipy.stats import zscore
outliers = df[np.abs(zscore(df['Value'])) > 3]
# Handling outliers
df_no_outliers = df[(np.abs(zscore(df['Value'])) < 3)]
Explanation:
Outliers can be detected using the z-score method from the scipy.stats
library. Values with z-scores greater than a threshold (e.g., 3) can be considered outliers. Removing outliers helps in obtaining more reliable analysis results.
119. Exporting Data
Exporting DataFrame data to various formats is essential for sharing and collaboration.
# Exporting to CSV
df.to_csv('data.csv', index=False)
# Exporting to Excel
df.to_excel('data.xlsx', index=False)
# Exporting to JSON
df.to_json('data.json', orient='records')
Explanation:
DataFrames can be exported to various formats like CSV, Excel, and JSON using to_csv()
, to_excel()
, and to_json()
methods. Specifying index=False
excludes the index column from the exported data.
120. Merging DataFrames
Merging DataFrames helps in combining data from different sources.
# Inner merge
merged_df_inner = pd.merge(df1, df2, on='Key', how='inner')
# Left merge
merged_df_left = pd.merge(df1, df2, on='Key', how='left')
# Right merge
merged_df_right = pd.merge(df1, df2, on='Key', how='right')
# Outer merge
merged_df_outer = pd.merge(df1, df2, on='Key', how='outer')
Explanation:
Merging DataFrames using the pd.merge()
function combines data based on a common key. Different types of merges such as inner, left, right, and outer can be performed based on the requirement.
121. Handling Duplicates
Identifying and removing duplicate rows from DataFrames.
# Identifying duplicates
duplicate_rows = df[df.duplicated()]
# Removing duplicates
df_no_duplicates = df.drop_duplicates()
Explanation:
Duplicate rows can be identified using the duplicated()
method. Removing duplicates can be done using the drop_duplicates()
method, which retains the first occurrence of each duplicated row.
122. Handling Missing Values
Dealing with missing values is crucial for accurate analysis.
# Checking for missing values
missing_values = df.isnull().sum()
# Dropping rows with missing values
df_no_missing = df.dropna()
# Filling missing values
df_filled = df.fillna(0)
Explanation:
Missing values can be identified using isnull()
, and the sum of missing values in each column can be calculated using sum()
. Rows with missing values can be dropped using dropna()
, and missing values can be filled using fillna()
.
123. String Operations
Performing string operations on DataFrame columns.
# Changing case
df['Column'] = df['Column'].str.lower()
# Extracting substrings
df['Substring'] = df['Column'].str.extract(r'(\d{3})')
Explanation:
String operations can be performed using the str
attribute of DataFrame columns. Changing case, extracting substrings, and applying regular expressions are some common string operations.
124. Grouping and Aggregating Data
Grouping data by one or more columns and applying aggregation functions.
# Grouping and summing
grouped_sum = df.groupby('Category')['Value'].sum()
# Grouping and calculating mean
grouped_mean = df.groupby('Category')['Value'].mean()
Explanation:
Grouping data allows you to perform aggregate calculations on subsets of the data. Common aggregation functions include sum, mean, count, and more. You can use the groupby()
method to specify the grouping columns and then apply the desired aggregation function.
125. Pivoting and Reshaping Data
Pivoting and reshaping data to transform its structure.
# Pivoting data
pivot_table = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
# Melting data
melted_df = pd.melt(df, id_vars=['Date'], value_vars=['Category1', 'Category2'])
Explanation:
Pivoting data reshapes it by converting columns into rows and vice versa. The pivot_table()
function is used for this purpose. Melting data converts wide format data into long format by stacking multiple columns into a single column using the melt()
function.
126. Time Series Analysis
Analyzing time series data using pandas.
# Converting to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Setting date as index
df.set_index('Date', inplace=True)
# Resampling
monthly_data = df.resample('M').sum()
Explanation:
Time series data analysis involves working with date and time data. Converting date strings to datetime objects, setting the date column as the index, and resampling data (e.g., aggregating daily data into monthly data) are common operations in time series analysis.
127. Plotting Data
Visualizing data using pandas plotting capabilities.
import matplotlib.pyplot as plt
# Line plot
df.plot(kind='line', x='Date', y='Value', title='Line Plot')
# Bar plot
df.plot(kind='bar', x='Category', y='Value', title='Bar Plot')
Explanation:
Pandas provides built-in plotting capabilities for visualizing data. Different types of plots, such as line plots, bar plots, histograms, and more, can be created using the plot()
function. Matplotlib is commonly used as the backend for pandas plotting.
128. Handling Missing Data
Dealing with missing data in pandas DataFrames.
# Checking for missing values
missing_values = df.isnull().sum()
# Dropping rows with missing values
df_cleaned = df.dropna()
# Filling missing values
df_filled = df.fillna(0)
Explanation:
Handling missing data is crucial in data analysis. You can check for missing values using the isnull()
function and then use methods like dropna()
to remove rows with missing values or fillna()
to fill missing values with a specified value.
129. Merging DataFrames
Combining multiple DataFrames using merge and join operations.
# Merging based on a common column
merged_df = pd.merge(df1, df2, on='ID')
# Inner join
inner_join_df = df1.merge(df2, on='ID', how='inner')
# Outer join
outer_join_df = df1.merge(df2, on='ID', how='outer')
Explanation:
Merging DataFrames involves combining them based on common columns. The merge()
function can be used to perform different types of joins, such as inner join, outer join, left join, and right join. The how
parameter specifies the type of join to perform.
130. Combining DataFrames
Concatenating multiple DataFrames vertically or horizontally.
# Concatenating vertically
concatenated_df = pd.concat([df1, df2])
# Concatenating horizontally
concatenated_horizontal_df = pd.concat([df1, df2], axis=1)
Explanation:
Combining DataFrames involves stacking them vertically or horizontally. The concat()
function is used for this purpose. When concatenating horizontally, the axis
parameter should be set to 1.
131. Grouping and Aggregation
Performing group-wise analysis and aggregations on DataFrame columns.
# Grouping by a column and calculating mean
grouped_df = df.groupby('Category')['Value'].mean()
# Applying multiple aggregations
aggregated_df = df.groupby('Category')['Value'].agg(['mean', 'sum', 'count'])
Explanation:
Grouping and aggregation are commonly used for summarizing data based on different categories. The groupby()
function is used to group the DataFrame based on a specified column, and then aggregation functions like mean()
, sum()
, and count()
can be applied to calculate statistics for each group.
132. Pivoting and Reshaping
Reshaping DataFrames using pivot tables and stacking/unstacking.
# Creating a pivot table
pivot_table = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
# Stacking and unstacking
stacked_df = pivot_table.stack()
unstacked_df = stacked_df.unstack()
Explanation:
Pivot tables allow you to reshape your data by providing a new structure. The pivot_table()
function creates a new DataFrame with rows as index, columns as columns, and values based on aggregation functions. Stacking and unstacking are used to reshape multi-level index DataFrames into a single-level index or vice versa.
133. Time Series Analysis
Working with time series data and performing time-based operations.
# Converting column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Setting datetime column as index
df.set_index('Date', inplace=True)
# Resampling time series data
resampled_df = df.resample('W').mean()
Explanation:
Time series data involves working with data that is indexed by time. You can convert a column to a datetime format using pd.to_datetime()
and then set it as the index of the DataFrame using set_index()
. Resampling allows you to aggregate data based on a specified time frequency (e.g., weekly) using the resample()
function.
134. Working with Text Data
Performing text-based operations on DataFrame columns.
# Converting column to uppercase
df['Name'] = df['Name'].str.upper()
# Extracting text using regular expressions
df['Digits'] = df['Text'].str.extract(r'(\d+)')
# Counting occurrences of a substring
df['Count'] = df['Text'].str.count('apple')
Explanation:
Text-based operations involve manipulating and extracting information from text data in DataFrame columns. You can convert text to uppercase using str.upper()
, extract specific patterns using regular expressions and str.extract()
, and count occurrences of substrings using str.count()
.
135. Working with Categorical Data
Dealing with categorical data and performing categorical operations.
# Converting column to categorical
df['Category'] = df['Category'].astype('category')
# Creating dummy variables
dummy_df = pd.get_dummies(df['Category'])
# Merging with original DataFrame
df = pd.concat([df, dummy_df], axis=1)
Explanation:
Categorical data represents data that belongs to a specific category. You can convert a column to a categorical type using astype()
and create dummy variables using pd.get_dummies()
. Dummy variables are used to represent categorical variables as binary columns, which can then be merged with the original DataFrame.
Comments