A rainplot is a visualization that combines half-density plots, jittered points, and boxplots. This combination is a transparent and intuitive form of visualization that provides an overview of the raw data, the probability density function, and key summary statistics at the same time.

This post explains how to create a raincloud plot with the ptitprince library and Matplotlib. This type of visualization combines a half-violin, raw data points, and a boxplot to make up a rich and transparent chart.

Libraries

Let's get started by importing some libraries:

  • numpy for numerical operations
  • palmerpenguins for the dataset
  • matplotlib for the visualization
  • ptitprince for the rainplot
# For the data
import palmerpenguins
import matplotlib.pyplot as plt
import numpy as np
import ptitprince as pt

Dataset

Today we're gonna use the popular Palmer Penguins dataset which we have used in other posts too.

penguins = palmerpenguins.load_penguins().dropna()

Basic half-violin plot

The PtitPrince library makes it very easy to obtain half-violin plots with the function half_violinplot(). We can use a pandas dataframe and the names of the variables for each axis, or simply the array of values.

Obtaining a horizontal half-violin plot is as simple as passing the name of the numerical variable to x and the name of the categorical variable to y.

pt.half_violinplot(x="bill_length_mm", y="species", data=penguins);

If you change the order of the variables you obtain the vertical version of the same plot:

pt.half_violinplot(x="species", y="bill_length_mm", data=penguins);

Getting started with out rainplot

One of the first things we do is defining creating lists to store the names of the species and the colors. The first is going to be used to subset the dataframe, and colors ara matter of good-taste. A nice-looking palette of colors can make a simple plot much more attractive.

SPECIES = ["Adelie", "Gentoo", "Chinstrap"]
COLORS = ["#FF5A5F", "#FFB400", "#007A87"]

# Create our own Axes with custom size
fig, ax = plt.subplots(figsize=(7, 6))

# inner=None removes the default boxplot
# The 'width' argument allows us to scale the density
# We pass our color palette in the 'palette' argument

pt.half_violinplot(
    x="bill_length_mm", y="species", palette=COLORS, 
    inner=None, data=penguins, width=0.6, ax=ax
);

Add rain

The rain, which is just another name for the mono-dimensional representation of the data points, is useful to identify potential outliers or other patterns. These points are jittered to reduce the overlap and obtain a clearer view of their distribution.

fig, ax = plt.subplots(figsize=(7, 6))

# Half-violin plot as above
pt.half_violinplot(
    x="bill_length_mm", y="species", palette=COLORS, 
    inner=None, data=penguins, width=1, ax=ax
)

# Iterate over the species
for i, species in enumerate(SPECIES):
    # Subset the data
    data = penguins[penguins["species"] == species]
    
    # Jitter the values on the vertical axis
    y = i + np.random.uniform(high=0.2, size=len(data))
    
    # Select the values of the horizontal axis
    x = data["bill_length_mm"]
    
    # Add the rain using the scatter method.
    ax.scatter(x, y, color=COLORS[i], alpha=0.6)

Add boxplots

The last step of this process is to add a boxplot on top of the dots. This is helpful to quickly visualize the median and obtain clear bounds for the bulk of the distributions.

The boxplots are added with the .boxplot() method. This method expects to receive a list of arrays with the values for each group. If you're not very familiar with boxplots in Matplotlib this is going to be very helpful.

# Generate list of arrays
boxplot_data = [
    penguins[penguins["species"] == species]["bill_length_mm"].values 
    for species in SPECIES
]

# Vertical positions for the boxplots
POSITIONS = [0, 1, 2]

ax.boxplot(boxplot_data, vert=False, positions=POSITIONS, manage_ticks=False)

fig

Customize the raincloud plot

One problem with the boxplots above is that they are not aligned with the rain below. This last step consists of shifting the boxplots a little and customizing their appearance.

The following are two dictionaries with styles for the line representing the mean and the box.

# The style of the line that represents the median.
medianprops = {"linewidth": 1.5, "color": "#a9a9a9", "solid_capstyle": "butt"}
# The style of the box ... This is also used for the whiskers
boxprops = {"linewidth": 1.5, "color": "#a9a9a9"}
fig, ax = plt.subplots(figsize=(7, 6))

pt.half_violinplot(
    x="bill_length_mm", y="species", scale = "area", 
    palette=COLORS, inner=None, data=penguins, width=1,
    ax=ax
)

for i, species in enumerate(SPECIES):
    data = penguins[penguins["species"] == species]
    y = i + np.random.uniform(high=0.2, size=len(data))
    x = data["bill_length_mm"]
    ax.scatter(x, y, color=COLORS[i], alpha=0.6)

# Positions are shifted now
SHIFT = 0.1
POSITIONS = [0 + SHIFT, 1 + SHIFT, 2 + SHIFT]

ax.boxplot(
    boxplot_data, 
    vert=False, 
    positions=POSITIONS, 
    manage_ticks=False,
    showfliers = False, # Do not show the outliers beyond the caps.
    showcaps = False,   # Do not show the caps
    medianprops = medianprops,
    whiskerprops = boxprops,
    boxprops = boxprops
)

# Finally, add labels and a title
ax.set_xlabel("Bill length (mm)", fontsize=15)
ax.set_ylabel("Species", fontsize=15)
ax.set_title("Bill length by Species", fontsize=18)

# Change size of tick labels
ax.tick_params(labelsize=13)

Going further

In this post we've seen how to combine a density plot and a boxplot to create a raincloud plot. This is a very powerful way to visualize the distribution of a variable across different groups.

You might be interested how to create multiple density plots, also called ridgeline plot

Contact & Edit


👋 This document is a work by Yan Holtz. You can contribute on github, send me a feedback on twitter or subscribe to the newsletter to know when new examples are published! 🔥

This page is just a jupyter notebook, you can edit it here. Please help me making this website better 🙏!