#39 Hidden data under boxplot

 

 

This page is dedicated to the dangerous feature of boxplots. A boxplot summarizes the distribution of a numeric variable for several groups. The problem is than summarizing also means loosing information, and that can become a mistake.

If we consider the boxplot beside, it is easy to conclude that the ‘C’ group has a higher value than the others. However, we cannot see what is the underlying distribution of dots into each group, neither the number of observation for each.

Let’s see a few techniques allowing to avoid that:

 

 

 

 

 

Code of the boxplot



# libraries and data
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Dataset:
a = pd.DataFrame({ 'group' : np.repeat('A',500), 'value': np.random.normal(10, 5, 500) })
b = pd.DataFrame({ 'group' : np.repeat('B',500), 'value': np.random.normal(13, 1.2, 500) })
c = pd.DataFrame({ 'group' : np.repeat('B',500), 'value': np.random.normal(18, 1.2, 500) })
d = pd.DataFrame({ 'group' : np.repeat('C',20), 'value': np.random.normal(25, 4, 20) })
e = pd.DataFrame({ 'group' : np.repeat('D',100), 'value': np.random.uniform(12, size=100) })
df=a.append(b).append(c).append(d).append(e)

# Usual boxplot
sns.boxplot(x='group', y='value', data=df)

 

 

Proposition to correct it


  •  

     

     

    The first common solution is to add all the data points with transparency and jitter over the boxplot. Jitter means that we shift all data points randomly on the X axis. Learn more in the dedicated chart #36.

    This is the best option if you have a limited amount of data. If you have too many, dots will overlap and the figure will look bad, so better use a violin plot instead.

     

     

     

     

    Code

    
    ax = sns.boxplot(x='group', y='value', data=df)
    ax = sns.stripplot(x='group', y='value', data=df, color="orange", jitter=0.2, size=2.5)
    plt.title("Boxplot with jitter", loc="left")
    
    
  •  

     

     

     

    If the amount of data for each group is really huge, the best option is to make a violin plot. It represents the distribution of each group. See how the bimodal distribution of the group B was hidden in the boxplot? See more in the dedicated section of the gallery.

    However be careful to check the number of datapoint for each group, since this information is hidden in violin plots. Here, the group C has almost no data compared to other, and we do not see it!

     

     

     

     

    Code

    
    sns.violinplot( x='group', y='value', data=df)
    plt.title("Violin plot", loc="left")
    
    
  •  

     

     

    It is always a good practice to display the number of observation under each group. Look here, the group C is the highest one, but has only 20 observations whereas the others have between 100 and 1000!

    This is definitely an information you want to know before taking any decision.

     

     

     

     

     

    Code

    
    # Start with a basic boxplot
    sns.boxplot(x="group", y="value", data=df)
    
    # Calculate number of obs per group & median to position labels
    medians = df.groupby(['group'])['value'].median().values
    nobs = df.groupby("group").size().values
    nobs = [str(x) for x in nobs.tolist()]
    nobs = ["n: " + i for i in nobs]
    
    # Add it to the plot
    pos = range(len(nobs))
    for tick,label in zip(pos,ax.get_xticklabels()):
    plt.text(pos[tick], medians[tick] + 0.4, nobs[tick], horizontalalignment='center', size='medium', color='w', weight='semibold')
    
    # add title
    plt.title("Boxplot with number of observation", loc="left")
    
    

Leave a Reply

Your email address will not be published.