This page is dedicated to the dangerous feature of boxplots. A boxplot summarizes the distribution of a numeric variable for several groups. The problem is than summarizing also means loosing information, and that can become a mistake.
If we consider the boxplot beside, it is easy to conclude that the ‘C’ group has a higher value than the others. However, we cannot see what is the underlying distribution of dots into each group, neither the number of observation for each.
Let’s see a few techniques allowing to avoid that:
Code of the boxplot
# libraries and data import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd # Dataset: a = pd.DataFrame({ 'group' : np.repeat('A',500), 'value': np.random.normal(10, 5, 500) }) b = pd.DataFrame({ 'group' : np.repeat('B',500), 'value': np.random.normal(13, 1.2, 500) }) c = pd.DataFrame({ 'group' : np.repeat('B',500), 'value': np.random.normal(18, 1.2, 500) }) d = pd.DataFrame({ 'group' : np.repeat('C',20), 'value': np.random.normal(25, 4, 20) }) e = pd.DataFrame({ 'group' : np.repeat('D',100), 'value': np.random.uniform(12, size=100) }) df=a.append(b).append(c).append(d).append(e) # Usual boxplot sns.boxplot(x='group', y='value', data=df)
Proposition to correct it

The first common solution is to add all the data points with transparency and jitter over the boxplot. Jitter means that we shift all data points randomly on the X axis. Learn more in the dedicated chart #36.
This is the best option if you have a limited amount of data. If you have too many, dots will overlap and the figure will look bad, so better use a violin plot instead.
Code
ax = sns.boxplot(x='group', y='value', data=df) ax = sns.stripplot(x='group', y='value', data=df, color="orange", jitter=0.2, size=2.5) plt.title("Boxplot with jitter", loc="left")

If the amount of data for each group is really huge, the best option is to make a violin plot. It represents the distribution of each group. See how the bimodal distribution of the group B was hidden in the boxplot? See more in the dedicated section of the gallery.
However be careful to check the number of datapoint for each group, since this information is hidden in violin plots. Here, the group C has almost no data compared to other, and we do not see it!
Code
sns.violinplot( x='group', y='value', data=df) plt.title("Violin plot", loc="left")

It is always a good practice to display the number of observation under each group. Look here, the group C is the highest one, but has only 20 observations whereas the others have between 100 and 1000!
This is definitely an information you want to know before taking any decision.
Code
# Start with a basic boxplot sns.boxplot(x="group", y="value", data=df) # Calculate number of obs per group & median to position labels medians = df.groupby(['group'])['value'].median().values nobs = df.groupby("group").size().values nobs = [str(x) for x in nobs.tolist()] nobs = ["n: " + i for i in nobs] # Add it to the plot pos = range(len(nobs)) for tick,label in zip(pos,ax.get_xticklabels()): plt.text(pos[tick], medians[tick] + 0.4, nobs[tick], horizontalalignment='center', size='medium', color='w', weight='semibold') # add title plt.title("Boxplot with number of observation", loc="left")