Matplotlib Histogram – How to Visualize Distributions in Python
Matplotlib histogram is used to visualize the frequency distribution of numeric array. In this article, we explore practical techniques like histogram facets, density plots, plotting multiple histograms in same plot.
Matplotlib histogram is used to visualize the frequency distribution of numeric array by splitting it to small equal-sized bins. In this article, we explore practical techniques that are extremely useful in your initial data analysis and plotting.
Content
What is a histogram?
How to plot a basic histogram in python?
Histogram grouped by categories in same plot
Histogram grouped by categories in separate subplots
Seaborn Histogram and Density Curve on the same plot
Histogram and Density Curve in Facets
Difference between a Histogram and a Bar Chart
Practice Exercise
Conclusion
1. What is a Histogram?
A histogram is a plot of the frequency distribution of numeric array by splitting it to small equal-sized bins.
If you want to mathemetically split a given array to bins and frequencies, use the numpyhistogram() method and pretty print it like below.
python
import numpy as np
x = np.random.randint(low=0, high=100, size=100)
# Compute frequency and bins
frequency, bins = np.histogram(x, bins=10, range=[0, 100])
# Pretty Print
for b, f in zip(bins[1:], frequency):
print(round(b, 1), ' '.join(np.repeat('*', f)))
The above representation, however, won’t be practical on large arrays, in which case, you can use matplotlib histogram.
2. How to plot a basic histogram in python?
The pyplot.hist() in matplotlib lets you draw the histogram. It required the array as the required input and you can specify the number of bins needed.
python
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams.update({'figure.figsize':(7,5), 'figure.dpi':100})
# Plot Histogram on x
x = np.random.normal(size = 1000)
plt.hist(x, bins=50)
plt.gca().set(title='Frequency Histogram', ylabel='Frequency');
Histogram
3. Histogram grouped by categories in same plot
You can plot multiple histograms in the same plot. This can be useful if you want to compare the distribution of a continuous variable grouped by different categories.
Let’s use the diamonds dataset from R’s ggplot2 package.
python
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/diamonds.csv')
df.head()
Diamonds Table
Let’s compare the distribution of diamond depth for 3 different values of diamond cut in the same plot.
Well, the distributions for the 3 differenct cuts are distinctively different. But since, the number of datapoints are more for Ideal cut, the it is more dominant.
So, how to rectify the dominant class and still maintain the separateness of the distributions?
You can normalize it by setting density=True and stacked=True. By doing this the total area under each distribution becomes 1.
4. Histogram grouped by categories in separate subplots
The histograms can be created as facets using the plt.subplots()
Below I draw one histogram of diamond depth for each category of diamond cut. It’s convenient to do it in a for-loop.
python
# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/diamonds.csv')
# Plot
fig, axes = plt.subplots(1, 5, figsize=(10,2.5), dpi=100, sharex=True, sharey=True)
colors = ['tab:red', 'tab:blue', 'tab:green', 'tab:pink', 'tab:olive']
for i, (ax, cut) in enumerate(zip(axes.flatten(), df.cut.unique())):
x = df.loc[df.cut==cut, 'depth']
ax.hist(x, alpha=0.5, bins=100, density=True, stacked=True, label=str(cut), color=colors[i])
ax.set_title(cut)
plt.suptitle('Probability Histogram of Diamond Depths', y=1.05, size=16)
ax.set_xlim(50, 70); ax.set_ylim(0, 1);
plt.tight_layout();
Histograms Facets
5. Seaborn Histogram and Density Curve on the same plot
If you wish to have both the histogram and densities in the same plot, the seaborn package (imported as sns) allows you to do that via the distplot(). Since seaborn is built on top of matplotlib, you can use the sns and plt one after the other.