This page aims to describe how to realise a basic dendrogram with Python. To realise such a dendrogram, you first need to have a numeric matrix. Each line represent an entity (here a car). Each column is a variable that describes the cars. The objective is to cluster the entities to know who share similarities with who.
At the end, entities that are highly similar are close in the Tree. Let’s start by loading a dataset and the requested libraries:
# Libraries import pandas as pd from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage import numpy as np # Import the mtcars dataset from the web + keep only numeric variables url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv' df = pd.read_csv(url) df = df.set_index('model') del df.index.name df
All right, now that we have our numeric matrix, we can calculate the distance between each car, and realise the hierarchical clustering. This is done through the linkage function. I do not enter in the details now, but I strongly advise to visit the graph #401 for more details concerning this crucial step.
# Calculate the distance between each sample # You have to think about the metric you use (how to measure similarity) + about the method of clusterization you use (How to group cars) Z = linkage(df, 'ward')
Last but not least, you can easily plot this object as a dendrogram using the dendrogram function. See graph #401 for possible customisation.
# Make the dendrogram plt.title('Hierarchical Clustering Dendrogram') plt.xlabel('sample index') plt.ylabel('distance (Ward)') dendrogram(Z, labels=df.index, leaf_rotation=90)