#404 Dendrogram with heat map

When you use a dendrogram to display the result of a cluster analysis, it is a good practice to add the corresponding heatmap. It allows you to visualise the structure of your entities (dendrogram), and to understand if this structure is logical (heatmap).  This is easy work thanks to the seaborn library that provides an awesome ‘cluster map’ function. This page aims to describe how it works, and note that once more the seaborn documentation is awesome.

  •  

     

     

     

    Before starting complicated stuff, let’s start by doing a basic dendrogram with heat map. As input you need a numeric matrix: each row is an entity (a car here), each column is a numerical variable that describe cars. Once you get it, just call the clustermap function!

    The figure is quite disappointing: the heatmap is almost all black! Why? Well, have a look to the dataset. Almost all the variable range between 0 and 5. However, the variables ‘disp’ and ‘hp’ have values over 100. Thus, all the other variables appear so small compared to them, thus black..

    All right, then we need to normalize our dataset.

     

     

     

    
    # Libraries
    import seaborn as sns
    import pandas as pd
    from matplotlib import pyplot as plt
    
    # Data set
    url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
    df = pd.read_csv(url)
    df = df.set_index('model')
    del df.index.name
    df
    
    # Default plot
    sns.clustermap(df)
    
    
  • Fortunately, the seaborn library created an option to standardise (left) or normalize (right) the data.  Note that you can do it by rows (0) or by column (2). Normalizing means that for each cell of the matrix you subtract the mean of the row (or column), and then divide by the standard deviation of the row (or column). Standardizing means subtracting the min and dividing by the max.

    
    # Libraries
    import seaborn as sns
    import pandas as pd
    from matplotlib import pyplot as plt
    
    # Data set
    url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
    df = pd.read_csv(url)
    df = df.set_index('model')
    del df.index.name
    df
    
    # Standardize or Normalize every column in the figure
    # Standardize:
    sns.clustermap(df, standard_scale=1)
    # Normalize
    sns.clustermap(df, z_score=1)
    
    

     

  • Now that the data are normalised we have to understand how to calculate the distance between individuals. Indeed, several methods are available. The most famous ones are the Pearson correlation and the Euclidean distance. However, note that they can give you really different results, so you HAVE TO think about it (see chart below). This doc can give you more info.

    
    # Libraries
    import seaborn as sns
    import pandas as pd
    from matplotlib import pyplot as plt
    
    # Data set
    url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
    df = pd.read_csv(url)
    df = df.set_index('model')
    del df.index.name
    
    
    # OK now we can compare our individuals. But how do you determine the similarity between 2 cars?
    # Several way to calculate that. the 2 most common ways are: correlation and euclidean distance?
    sns.clustermap(df, metric="correlation", standard_scale=1)
    sns.clustermap(df, metric="euclidean", standard_scale=1)
    
    

    Take into account the difference between Pearson correlation and Euclidean distance. Here are 4 cases. These 2 metrics can tell the same story (up right), but can also give a completely different result.

  • The last tricky statistical part of this graphic is the cluster algorithm you use to group the individuals. Once more, it can highly changes the result of your analysis. Do not hesitate to visit this doc for more info. If you have no idea which algorithm to use, ward method is probably a good starting point.

    # Libraries
    import seaborn as sns
    import pandas as pd
    from matplotlib import pyplot as plt
    
    # Data set
    url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
    df = pd.read_csv(url)
    df = df.set_index('model')
    del df.index.name
    df
    
    # OK now we determined the distance between 2 individuals. But how to do the clusterisation? Several methods exist.
    # If you have no idea, ward is probably a good start.
    sns.clustermap(df, metric="euclidean", standard_scale=1, method="single")
    sns.clustermap(df, metric="euclidean", standard_scale=1, method="ward")
    
    
  • You can provide whatever color palette to the clustermap function. Here are 3 examples. Read this page for more information concerning color palettes.

    
    # Libraries
    import seaborn as sns
    import pandas as pd
    from matplotlib import pyplot as plt
    
    # Data set
    url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
    df = pd.read_csv(url)
    df = df.set_index('model')
    del df.index.name
    df
    
    # CHange color palette
    sns.clustermap(df, metric="euclidean", standard_scale=1, method="ward", cmap="mako")
    sns.clustermap(df, metric="euclidean", standard_scale=1, method="ward", cmap="viridis")
    sns.clustermap(df, metric="euclidean", standard_scale=1, method="ward", cmap="Blues")
    
    
  • Sometimes, a few values in your input have extreme values. In a heatmap, this has as an effect to make every other cell the same color, what is not desired. The clustermap function allows you to avoid that with the ‘robust‘ argument. Here is an example with (left) and without (right) this option.

    
    # Libraries
    import seaborn as sns
    import pandas as pd
    from matplotlib import pyplot as plt
    
    # Data set
    url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
    df = pd.read_csv(url)
    df = df.set_index('model')
    del df.index.name
    df
    
    # Ignore outliers
    # Let's create an outlier in the dataset:
    df.drat[15]=1000
    # use the outlier detection
    sns.clustermap(df, robust=True)
    
    # do not use it
    sns.clustermap(df, robust=False)
    
    

Leave a Reply

Your email address will not be published.