Overplotting is one of the most common problem in dataviz. When your dataset is big, dots of your scatterplot tend overlap, and your graphic becomes unreadable.
This problem is illustrated by the scatterplot beside, realised with Matplotlib (code hereafter). A first look might lead to the conclusion that there is no relationship between X and Y. We will see below how wrong it is.
In this post, I propose 10 charts allowing to avoid overplotting. Of course, the reproducible code is provided for each.
Code: a scatterplot with overplotting
# libraries and data import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd plt.style.use('seaborn') # Dataset: df=pd.DataFrame({'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A',20000) }) tmp1=pd.DataFrame({'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B',20000) }) tmp2=pd.DataFrame({'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C',20000) }) df=df.append(tmp1).append(tmp2) # plot plt.plot( 'x', 'y', data=df, linestyle='', marker='o') plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting looks like that:', loc='left')
Let’s see how to avoid it

The first option is really easy to apply, but still really efficient: just the reduce the size of the dots! In this case it work very well. See: our messy scatterplot was actually composed by three groups of data!
Please note that like many of the methods of this post, the drawback is that outliers are now really hard to detect.
# Plot with small marker size plt.plot( 'x', 'y', data=df, linestyle='', marker='o', markersize=0.7) plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Try to reduce the dot size', loc='left')

Transparency indirectly allows to see the density of dots on each part of the scatterplot. Once more, the 3 groups are now obvious.
# Plot with transparency plt.plot( 'x', 'y', data=df, linestyle='', marker='o', markersize=3, alpha=0.05, color="purple") # Titles plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Try to use transparency', loc='left')

One of my favorite option is to realise a 2D density chart. Note that several variations of this solution exist, and are presented in a dedicated section. So visit it for more explanation!
In addition to being one of the best option of this post, it is a oneliner thanks to the seaborn library!
# 2D density plot: sns.kdeplot(df.x, df.y, cmap="Reds", shade=True) plt.title('Overplotting? Try 2D density graph', loc='left')

Easy but efficient, you can sample a subset of your data frame randomly. It works but is probably not the best option. If you are unlucky, you could miss an interesting pattern.
Thanks to pandas, this is really easy to do:
# Sample 1000 random lines df_sample=df.sample(1000) # Make the plot with this subset plt.plot( 'x', 'y', data=df_sample, linestyle='', marker='o') # titles plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Sample your data', loc='left')

If the entities are grouped following a categorical variable, it is a good Idea to visualise each group one by one. Personally, I like to let the whole dataset in background to be able to compare:
# Filter the data randomly df_filtered = df[ df['group'] == 'A'] # Plot the whole dataset plt.plot( 'x', 'y', data=df, linestyle='', marker='o', markersize=1.5, color="grey", alpha=0.3, label='other group') # Add the group to study plt.plot( 'x', 'y', data=df_filtered, linestyle='', marker='o', markersize=1.5, alpha=0.3, label='group A') # Add titles and legend plt.legend(markerscale=8) plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Show a specific group', loc='left')

If the entities are grouped following a categorical variable, showing these groups can allows to get more insight about the data.
# Plot sns.lmplot( x="x", y="y", data=df, fit_reg=False, hue='group', legend=False, palette="Accent", scatter_kws={"alpha":0.1,"s":15} ) # Legend plt.legend(loc='lower right', markerscale=2) # titles plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Show putative structure', loc='left')

If the entities are grouped following a categorical variable, showing these groups can allows to get more insight about the data. A really good option for that is through small multiple (=faceting).
# Use seaborn for easy faceting g = sns.FacetGrid(df, col="group", hue="group") g = (g.map(plt.scatter, "x", "y", edgecolor="w"))

Sometimes your data are not as continuous as you would like them to be. This results in this kind of figure, were all dots are aligned together.
# Dataset: a=np.concatenate([np.random.normal(2, 4, 1000), np.random.normal(4, 4, 1000), np.random.normal(1, 2, 500), np.random.normal(10, 2, 500), np.random.normal(8, 4, 1000), np.random.normal(10, 4, 1000)]) df=pd.DataFrame({'x': np.repeat( range(1,6), 1000), 'y': a }) # plot plt.plot( 'x', 'y', data=df, linestyle='', marker='o')
To avoid this situation, we can use Jittter: we are going to shift every dots a bit on the right or on the left. That makes the figure readable. For example, we can now see that the points around x=3 are actually split in 2 parts!
Thanks to seaborn, this is easily done with the stripplot function!
# A scatterplot with jitter sns.stripplot(df.x, df.y, jitter=0.2, size=2) plt.title('Overplotting? Use jitter when x data are not really continuous', loc='left')

This option is not my favorite as 3D are often hard to read, but it can work sometimes. Moreover, it makes a eyecatching figure. Learn how to do 3D plot with matplotlib in the dedicated section.
# libraries from scipy.stats import kde from mpl_toolkits.mplot3d import Axes3D # Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents nbins=300 k = kde.gaussian_kde([df.x,df.y]) xi, yi = np.mgrid[ df.x.min():df.x.max():nbins*1j, df.y.min():df.y.max():nbins*1j] zi = k(np.vstack([xi.flatten(), yi.flatten()])) # Transform it in a dataframe data=pd.DataFrame({'x': xi.flatten(), 'y': yi.flatten(), 'z': zi }) # Make the plot fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_trisurf(data.x, data.y, data.z, cmap=plt.cm.Spectral, linewidth=0.2) # Adapt angle, first number is up/down, second number is right/left ax.view_init(30, 80)

Last but not least, here is my favorite version. In addition to make a 2D density chart, you can easily show the marginal distributions of both X and Y variable. And once more it is just a oneliner thanks to seaborn!
# 2D density + marginal distribution: sns.jointplot(x=df.x, y=df.y, kind='kde')
This is a great post, showing the different options. 1:1000 stop to comment, so multiply my thanks to you.
me too, thanks