import pandas as pd
from plotnine import *
Dataset
Since histograms are a type of chart that displays the distribution of a numerical variable, we need a dataset that contains this type of values.
For instance, we will use the iris
dataset, which is a famous dataset used in data science. It contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed.
We can easily load it using the following code:
path = 'https://raw.githubusercontent.com/holtzy/The-Python-Graph-Gallery/master/static/data/iris.csv'
df = pd.read_csv(path)
df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Most simple histogram
The ggplot()
function works the following way: you start by initializing a plot with ggplot()
and then you add layers to it using the +
operator.
In this case, we will use the geom_histogram()
function to create a histogram. We will map the Sepal.Length
column to the x-axis.
(
ggplot(df, aes(x='sepal_length')) +
geom_histogram(bins=8)
)
Control number of bins
You can control the number of bins in a histogram by setting the bins
argument inside the geom_histogram()
function.
(
ggplot(df, aes(x='sepal_length')) +
geom_histogram(bins=15)
)
Change color and edge color
You can change the color of the bars by setting the fill
argument inside the geom_histogram()
function. You can also change the color of the edges by setting the color
argument.
(
ggplot(df, aes(x='sepal_length')) +
geom_histogram(bins=10, fill='lightblue', color='black')
)
Change overall appearance
To improve the style of the chart, we can change the following arguments:
fill
: to change the color of the barscolor
: to change the color of the borders of the barsalpha
: to change the transparency of the barsbins
: to change the number of bins
(
ggplot(df, aes(x='sepal_length')) +
geom_histogram(bins=10, fill='lightblue', color='black', alpha=0.4) +
theme_minimal()
)
Going further
This article explains how to create a histogram with plotnine.
If you want to go further, you can also learn how to create a multiple histogram with plotnine and have a look at the histogram section of gallery.