Small multiples
This post addresses 2 scenarios:
- you want to represent the distribution of a large number of variables
- you want to represent the distribution of different groups within a single variable
For both of these scenarios, you may want to use small multiple histogram. This articles explain how to implement both options with python.
Libraries
First, you need to install the following librairies:
- matplotlib is used for creating the plot
numpy
is used to generate some datapandas
for data manipulation
# Libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Histogram for several variables
Dataset
In our case, we'll create a 4x4 window, for a total of 16 histograms. To do this, we need to generate 16 random variables.
To ensure that the variables are distributed differently, we'll randomly generate means and standard deviations using the numpy function random.uniform()
. Our variable names will simply be "Variable_1", "Variable_2", etc.
# Number of variables wanted
num_variables = 16
# Initialize the list that will contains our variable parameters
columns = []
means = []
stds = []
# Generate random data for each variable
for i in range(num_variables):
# Assign a name for each variable
column_name = f"Variable_{i+1}" # Variable_1, Variable_2, etc
columns.append(column_name)
# Generate random mean and standard deviation for each variable
mean = np.random.uniform(0, 100)
std = np.random.uniform(5, 100)
means.append(mean)
stds.append(std)
# Generate random data for the DataFrame
data = np.random.normal(loc=means, scale=stds, size=(1000, num_variables))
# Create the DataFrame
df = pd.DataFrame(data, columns=columns)
Small multiple plot
The following code creates a 4x4 grid with a total of 16 histograms using small multiple. It does so in several steps:
- defines a 4x4 grid, for a total of 16 subplots
- iterates over our variables, add a title and an axis name
- remove extra subplots (it happens only if the number of variable is not equal to
num_rows*num_cols
)
Colors are generated using matplotlib's tab20
colormap. The plt.tight_layout()
function is used to avoid overlap between subplots.
# Number of histograms to display
num_histograms = 16
# Create a 4x4 grid of subplots to accommodate 16 histograms
num_rows = 4
num_cols = 4
# Create a figure and subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(8, 8))
# Flatten the axes array to iterate through subplots easily
axes_flat = axes.flatten()
# Get a list of (16) distinct colors from the tab20 colormap
colors = plt.cm.tab20.colors[:num_histograms]
# Iterate through the DataFrame columns and plot histograms with distinct colors
for i, (column, ax) in enumerate(zip(df.columns, axes_flat)):
df[column].plot.hist(ax=ax, bins=15, alpha=0.7, color=colors[i], edgecolor='black')
ax.set_title(f'Histogram of {column}', fontsize = 7)
ax.set_xlabel(column, fontsize = 7)
# Remove any extra empty subplots if the number of variables is less than 16
if i < num_histograms - 1:
for j in range(i + 1, num_histograms):
fig.delaxes(axes_flat[j])
# Adjust layout and display the plot
plt.tight_layout()
plt.show()
Histogram for several groups
Dataset
This code generates a pandas DataFrame called df
with two columns: Continuous_Variable
and Categorical_Variable
.
The Continuous_Variable
column contains 1000 randomly generated continuous values drawn from a normal distribution with a mean of 10 and a standard deviation of 5.
The 'Categorical_Variable' column contains 1000 randomly chosen categories from a list of 16 different modalities, such as Category_1
, Category_2
, and so on.
The resulting DataFrame df
contains 1000 rows and 2 columns, where each row represents a data point with a continuous value and a corresponding categorical value.
size = 1000
# Generating continuous variable
continuous_data = np.random.normal(loc=10, scale=5, size=size)
# Generating categorical variable with 16 different modalities
categories = ['Category_{}'.format(i) for i in range(1, 17)]
categorical_data = np.random.choice(categories, size=size)
# Creating pandas DataFrame
df = pd.DataFrame({
'Continuous_Variable': continuous_data,
'Categorical_Variable': categorical_data
})
Small multiple plot
The following code creates a 4x4 grid of subplots to plot histograms for each category of a categorical variable. It uses the tab20
colormap to get a list of 16 distinct colors for the histograms. Then, it iterates over each category, retrieves the data corresponding to that category, plots a histogram using the data, and sets the title, x-axis label, and y-axis label for each subplot.
# Create a figure and 16 subplots (one for each category)
fig, axs = plt.subplots(4, 4, figsize=(8, 8))
fig.suptitle('Histograms for Each Modality of the Categorical Variable', fontsize=16)
# Flatten the axs array to make it easier to iterate over
axs = axs.flatten()
# Get a list of (16) distinct colors from the tab20 colormap
colors = plt.cm.tab20.colors[:num_histograms]
# Iterate over each category and plot the histogram
for i, category in enumerate(categories):
category_data = df[df['Categorical_Variable'] == category]['Continuous_Variable']
axs[i].hist(category_data, bins=15, alpha=0.7, edgecolor="black", color=colors[i])
axs[i].set_title(category, fontsize = 7)
axs[i].set_xlabel('Value', fontsize = 7)
axs[i].set_ylabel('Frequency', fontsize = 7)
# Adjust the layout and display the plot
plt.tight_layout()
plt.show()
Going further
This post explained how to create histograms with small mutliple using matplotlib.
For more examples of how to customize your histogram, check the histogram section. You might be interested in how to make a histogram with seaborn for a better looking chart or even how to show several distribution with a mirror histogram.