Boxplot is probably one of the most common type of graphic. It gives a nice summary of one or several numeric variables. The line that divides the box into 2 parts represents the median

of the data. The end of the box shows the upper and lower quartiles. The extreme lines shows the highest and lowest value excluding outliers. Note that boxplot hide the number of values

existing behind the variable. Thus, it is highly advised to print the number of observation, add unique observation with jitter or use a violinplot if you have many observations.

Input format

Format 1: 1 numerical variable (for the Y axis) + 1 categorical (gives the groups). This is the ‘long‘ or ‘tidy‘ format.


Format 2:  several numerical variables : one per group. This is the ‘wide‘ format.




Boxplot and hidden data



A boxplot summarizes the distribution of a numerical variable for one or several groups. Thus, it hides the underlying distribution and the number of points of each group. That makes this chart dangerous. This post gives an example of possible mistake, and 3 solutions to fix it.