Libraries
Pandas is a popular open-source Python library used for data manipulation and analysis. It provides data structures and functions that make working with structured data, such as tabular data (like Excel
spreadsheets or SQL
tables), easy and intuitive.
To install Pandas, you can use the following command in your command-line interface (such as Terminal
or Command Prompt
):
pip install pandas
Matplotlib functionalities have been integrated into the pandas library, facilitating their use with dataframes
and series
. For this reason, you might also need to import the matplotlib library when building charts with Pandas.
This also means that they use the same functions, and if you already know Matplotlib, you'll have no trouble learning plots with Pandas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Dataset
In order to create graphics with Pandas, we need to use pandas objects: Dataframes
and Series
. A dataframe can be seen as an Excel
table, and a series as a column
in that table. This means that we must systematically convert our data into a format used by pandas.
Since histograms need quantitative variables, we will create a dataset with 2 columns. The first column is called "type"
, which stores the categories "group1"
and "group2"
repeated a total of 1000 times each.
The second column is named "value"
. It holds numbers. The first 1000 numbers are random values from a normal distribution with an average of 0 and a standard deviation of 1. The next 1000 numbers are random values from another normal distribution with an average of 4 and a standard deviation of 1. We concatenate them into one single column thanks to the concatenate()
function from numpy.
# Create 2 columns: one categorical and one numerical
sample_size = 1000
data = {
'type': ['group1'] * sample_size + ['group2'] * sample_size,
'value': np.concatenate([np.random.normal(0, 1, sample_size),
np.random.normal(4, 1, sample_size)])
}
df = pd.DataFrame(data)
Basic histogram with 2 groups
Once we've opened our dataset, we'll now create a simple histogram, representing the distributions of the 'value'
variable with the 2 groups. We will iterate over all distinct value in the 'type'
variable and use the hist()
function.
# Plot the histograms of each group
for group in df['type'].unique():
# Filter the dataset on the group
filtered_df = df[df['type']==group]
# Add the histogram to the graphic
filtered_df['value'].hist(figsize=(8, 4))
# Display the plot
plt.show()
Customize histogram with 2 groups
The above histograms can be easily customized with the following features
- change the
bins
argument to the value we want - change the
color
argument to the color we want - change the
edgecolor
argument to the color we want - add a title and axis label
- add a legend
Our first step will be to get a list of the labels in the type
variable and then define a list of colors of the same length as the first list.
# Get group names and define colors
group_name = df['type'].unique()
colors = ['purple', 'orange']
# Plot the histograms
for i, group in enumerate(group_name):
ax = df[df['type']==group]['value'].hist(figsize=(8, 4),
edgecolor='gray',
bins=12,
color=colors[i]
)
# Add a legend
ax.legend(group_name)
# Add a title and axis label
ax.set_title('Distribution of 2 different groups')
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
# Show the plot
plt.show()
Histogram with small multiples
Now we will see how to create a chart with small multiple histograms that display the distribution of several variables at the same time. First we need a dataset with more variables with different distributions.
Create the dataset
For this, we will use numpy
random functions and generate 9 different numeric variables. Don't worry if this seems complicated to you: it's only useful for generating fake data and making the graphs readable!
# Number of data points
num_data_points = 1000
# Generate data for each distribution
normal_data = np.random.normal(loc=0, scale=1, size=num_data_points)
uniform_data = np.random.uniform(low=-1, high=1, size=num_data_points)
bimodal_data = np.concatenate((np.random.normal(loc=-2, scale=1, size=num_data_points // 2),
np.random.normal(loc=2, scale=1, size=num_data_points // 2)))
poisson_data = np.random.poisson(lam=5, size=num_data_points)
exponential_data = np.random.exponential(scale=2, size=num_data_points)
gamma_data = np.random.gamma(shape=2, scale=2, size=num_data_points)
beta_data = np.random.beta(a=2, b=5, size=num_data_points)
lognormal_data = np.random.lognormal(mean=0, sigma=1, size=num_data_points)
triangular_data = np.random.triangular(left=-1, mode=0, right=1, size=num_data_points)
# Create a DataFrame
data = {
'Normal': normal_data,
'Uniform': uniform_data,
'Bimodal': bimodal_data,
'Poisson': poisson_data,
'Exponential': exponential_data,
'Gamma': gamma_data,
'Beta': beta_data,
'LogNormal': lognormal_data,
'Triangular': triangular_data
}
df = pd.DataFrame(data)
Create the chart
Now we can create a small multiple histograms with pandas and matplotlib:
- The following code goes through each column of the dataframe and creates a histogram plot
- For each subplot, the code adds a histogram of a specific column's data from the dataframe
- It adds a title and axis label
- The code adjusts the layout (thanks to the
tight_layout()
function) to make sure they fit well together in the figure - Finally, it displays the entire set of subplots as a single plot
# Initialize a 3x3 charts
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(8, 8))
# Flatten the axes array (makes it easier to iterate over)
axes = axes.flatten()
# Loop through each column and plot a histogram
for i, column in enumerate(df.columns):
# Add the histogram
df[column].hist(ax=axes[i], # Define on which ax we're working on
edgecolor='white', # Color of the border
color='#69b3a2' # Color of the bins
)
# Add title and axis label
axes[i].set_title(f'{column} distribution')
axes[i].set_xlabel(column)
axes[i].set_ylabel('Frequency')
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()
Going further
This post explains how to show the distribution of multiple groups and variables with pandas.
For more examples of how to create or customize your plots with Pandas, see the pandas section. You may also be interested in how to customize your histograms with Matplotlib and Seaborn.