#134 How to avoid overplotting with python

 

 

 

 

Overplotting is one of the most common problem in dataviz. When your dataset is big, dots of your scatterplot tend overlap, and your graphic becomes unreadable.

This problem is illustrated by the scatterplot beside, realised with Matplotlib (code hereafter). A first look might lead to the conclusion that there is no relationship between X and Y. We will see below how wrong it is.

In this post, I propose 10 charts allowing to avoid overplotting. Of course, the reproducible code is provided for each.

 

 

 

Code: a scatterplot with overplotting



# libraries and data
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
plt.style.use('seaborn')

# Dataset:
df=pd.DataFrame({'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A',20000) })
tmp1=pd.DataFrame({'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B',20000) })
tmp2=pd.DataFrame({'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C',20000) })
df=df.append(tmp1).append(tmp2)

# plot
plt.plot( 'x', 'y', data=df, linestyle='', marker='o')
plt.xlabel('Value of X')
plt.ylabel('Value of Y')
plt.title('Overplotting looks like that:', loc='left')

 

 

 

Let’s see how to avoid it


  •  

     

     

     

    The first option is really easy to apply, but still really efficient: just the reduce the size of the dots! In this case it work very well. See: our messy scatterplot was actually composed by three groups of data!

    Please note that like many of the methods of this post, the drawback is that outliers are now really hard to detect.

     

     

     

     

     

    
    # Plot with small marker size
    plt.plot( 'x', 'y', data=df, linestyle='', marker='o', markersize=0.7)
    plt.xlabel('Value of X')
    plt.ylabel('Value of Y')
    plt.title('Overplotting? Try to reduce the dot size', loc='left')
    
    
  •  

     

     

     

    Transparency indirectly allows to see the density of dots on each part of the scatterplot. Once more, the 3 groups are now obvious.

     

     

     

     

     

     

     

    
    # Plot with transparency
    plt.plot( 'x', 'y', data=df, linestyle='', marker='o', markersize=3, alpha=0.05, color="purple")
    
    # Titles
    plt.xlabel('Value of X')
    plt.ylabel('Value of Y')
    plt.title('Overplotting? Try to use transparency', loc='left')
    
    
  •  

     

     

    One of my favorite option is to realise a 2D density chart. Note that several variations of this solution exist, and are presented in a dedicated section. So visit it for more explanation!

    In addition to being one of the best option of this post, it is a one-liner thanks to the seaborn library!

     

     

     

     

     

    
    # 2D density plot: 
    sns.kdeplot(df.x, df.y, cmap="Reds", shade=True)
    plt.title('Overplotting? Try 2D density graph', loc='left')
    
    
  •  

     

     

     

     

    Easy but efficient, you can sample a subset of your data frame randomly. It works but is probably not the best option. If you are unlucky, you could miss an interesting pattern.

    Thanks to pandas, this is really easy to do:

     

     

     

     

     

    
    # Sample 1000 random lines
    df_sample=df.sample(1000)
    
    # Make the plot with this subset
    plt.plot( 'x', 'y', data=df_sample, linestyle='', marker='o')
    
    # titles
    plt.xlabel('Value of X')
    plt.ylabel('Value of Y')
    plt.title('Overplotting? Sample your data', loc='left')
    
    
  •  

     

     

     

    If the entities are grouped following a categorical variable, it is a good Idea to visualise each group one by one. Personally, I like to let the whole dataset in background to be able to compare:

     

     

     

     

     

    
    # Filter the data randomly
    df_filtered = df[ df['group'] == 'A']
    # Plot the whole dataset
    plt.plot( 'x', 'y', data=df, linestyle='', marker='o', markersize=1.5, color="grey", alpha=0.3, label='other group')
    
    # Add the group to study
    plt.plot( 'x', 'y', data=df_filtered, linestyle='', marker='o', markersize=1.5, alpha=0.3, label='group A')
    
    # Add titles and legend
    plt.legend(markerscale=8)
    plt.xlabel('Value of X')
    plt.ylabel('Value of Y')
    plt.title('Overplotting? Show a specific group', loc='left')
    
    
  •  

     

     

     

     

    If the entities are grouped following a categorical variable, showing these groups can allows to get more insight about the data.

     

     

     

     

     

     

     

    
    # Plot
    sns.lmplot( x="x", y="y", data=df, fit_reg=False, hue='group', legend=False, palette="Accent", scatter_kws={"alpha":0.1,"s":15} )
    
    # Legend
    plt.legend(loc='lower right', markerscale=2)
    
    # titles
    plt.xlabel('Value of X')
    plt.ylabel('Value of Y')
    plt.title('Overplotting? Show putative structure', loc='left')
    
    
  •  

    If the entities are grouped following a categorical variable, showing these groups can allows to get more insight about the data. A really good option for that is through small multiple (=faceting).

    
    # Use seaborn for easy faceting
    g = sns.FacetGrid(df, col="group", hue="group")
    g = (g.map(plt.scatter, "x", "y", edgecolor="w"))
    
    
  •  

     

     

     

     

    Sometimes your data are not as continuous as you would like them to be. This results in this kind of figure, were all dots are aligned together.

     

     

     

     

    # Dataset:
    a=np.concatenate([np.random.normal(2, 4, 1000), np.random.normal(4, 4, 1000), np.random.normal(1, 2, 500), np.random.normal(10, 2, 500), np.random.normal(8, 4, 1000), np.random.normal(10, 4, 1000)])
    df=pd.DataFrame({'x': np.repeat( range(1,6), 1000), 'y': a })
    
    # plot
    plt.plot( 'x', 'y', data=df, linestyle='', marker='o')
    

     

     

     

     

    To avoid this situation, we can use Jittter: we are going to shift every dots a bit on the right or on the left. That makes the figure readable. For example, we can now see that the points around x=3 are actually split in 2 parts!

    Thanks to seaborn, this is easily done with the stripplot function!

     

     

     

     

     

     

    # A scatterplot with jitter
    sns.stripplot(df.x, df.y, jitter=0.2, size=2)
    plt.title('Overplotting? Use jitter when x data are not really continuous', loc='left')
    
    
  •  

     

     

     

     

    This option is not my favorite as 3D are often hard to read, but it can work sometimes. Moreover, it makes a eye-catching figure. Learn how to do 3D plot with matplotlib in the dedicated section.

     

     
     

     

    
    # libraries
    from scipy.stats import kde
    from mpl_toolkits.mplot3d import Axes3D
    
    # Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents
    nbins=300
    k = kde.gaussian_kde([df.x,df.y])
    xi, yi = np.mgrid[ df.x.min():df.x.max():nbins*1j, df.y.min():df.y.max():nbins*1j]
    zi = k(np.vstack([xi.flatten(), yi.flatten()]))
    
    # Transform it in a dataframe
    data=pd.DataFrame({'x': xi.flatten(), 'y': yi.flatten(), 'z': zi })
    
    # Make the plot
    fig = plt.figure()
    ax = fig.gca(projection='3d')
    ax.plot_trisurf(data.x, data.y, data.z, cmap=plt.cm.Spectral, linewidth=0.2)
    # Adapt angle, first number is up/down, second number is right/left
    ax.view_init(30, 80)
    
  •  

     

     

     

     

     

    Last but not least, here is my favorite version. In addition to make a 2D density chart, you can easily show the marginal distributions of both X and Y variable. And once more it is just a one-liner thanks to seaborn!

     

     

     

     

     

     

    
    # 2D density + marginal distribution:
    sns.jointplot(x=df.x, y=df.y, kind='kde')
    

 

 

Leave a Reply

Your email address will not be published.