Sankey Diagram with python and the pySankey library

logo of a chart:Sankey

This post shows how to create a Sankey diagram using the pySankey library. It shows how the dataset must be formatted, what are the possible customizations and how to save the diagram to a png image.

Introduction

A Sankey diagram is a visualisation technique that allows to display flows. Several entities (nodes) are represented by rectangles or text. Their links are represented with arrow or arcs that have a width proportional to the importance of the flow.

The pySankey library, which is based on Matplotlib, makes it extremely easy to obtain Sankey diagrams in Python. This post is based on the library's documentation and aims to explain how to obtain Sankey diagrams with the pySankey library.

The pySankey library can be installed with pip install pysankey, but note you need to use pySankey instead of pysankey when importing the library or something from it.

import pandas as pd

# Import the sankey function from the sankey module within pySankey
from pySankey.sankey import sankey

Basic Sankey diagram

Let's import the fruits.txt dataset that comes with the library. Here we download it from the github repository.

The dataset has 2 columns only. Each row describes a connection, with the origin in the first column and the destination in the second. If a connection has several occurences in the dataset (the same row appears many times), its weight will be higher and the connection on the diagram will be bigger.

The sankey() function is used to draw the diagram. It takes at least 2 arguments as input: the origin and destination columns:

url = "https://raw.githubusercontent.com/anazalea/pySankey/master/pysankey/fruits.txt"
df = pd.read_csv(url, sep=" ", names=["true", "predicted"])

colors = {
    "apple": "#f71b1b",
    "blueberry": "#1b7ef7",
    "banana": "#f3f71b",
    "lime": "#12e23f",
    "orange": "#f78c1b"
}

sankey(df["true"], df["predicted"], aspect=20, colorDict=colors, fontsize=12)

Dataset with weights

It's also possible to use weights. The following diagram is based on the customer-goods.csv data from the pySankey library. This time each connection has only 1 row in the dataset, but its weight is explicitely provided in a column called revenue. We can provide this column to the leftWeight and rightWeight argument to draw the connections with the according sizes.

url = "https://raw.githubusercontent.com/anazalea/pySankey/master/pysankey/customers-goods.csv"
df = pd.read_csv(url, sep=",")

sankey(
    left=df["customer"], right=df["good"], 
    leftWeight= df["revenue"], rightWeight=df["revenue"], 
    aspect=20, fontsize=20
)

Save the figure (.png)

You need matplotlib if you want to save the diagram with a custom size:

import matplotlib.pyplot as plt

# Create Sankey diagram again
sankey(
    left=df["customer"], right=df["good"], 
    leftWeight= df["revenue"], rightWeight=df["revenue"], 
    aspect=20, fontsize=20
)

# Get current figure
fig = plt.gcf()

# Set size in inches
fig.set_size_inches(6, 6)

# Set the color of the background to white
fig.set_facecolor("w")

# Save the figure
fig.savefig("customers-goods.png", bbox_inches="tight", dpi=150)

Contact & Edit


👋 This document is a work by Yan Holtz. You can contribute on github, send me a feedback on twitter or subscribe to the newsletter to know when new examples are published! 🔥

This page is just a jupyter notebook, you can edit it here. Please help me making this website better 🙏!