Learn what box plots are, how to read one, the advantages and disadvantages of box plots, and how you can transform your data into this powerful visualization.
Box plots are a widely used type of data visualization. As a professional, you can use box plots to show a high-level overview of your data, compare data sets, and provide a quick visual without taking up much space. In this article, you can further explore what a box plot is, what type of data is appropriate, its advantages and disadvantages, and how to build your own.
Box plots, or box-and-whisker plots, are a visual tool used to represent the distribution of a data set. This type of graph shows key statistics of your data, including the median, quartiles, and outliers. You can use box plots to gain insight into some aspects of the frequency distribution of your data, including:
Central tendency: This measure represents the entire distribution of the data. In the case of box plots, it is the median, indicated by a line drawn in your box.
Spread: This is the range of the data set. In a box plot, this is shown by individual points representing the highest and lowest values in your set. This helps you see how dispersed your data is.
Variability: This shows how clustered or not clustered your data is. If the box of your box plot is long, it shows that the values of your data are highly variable. If it is short, you can see that the data points are more clustered (less varied) around a certain value.
Because of the statistical measures represented by box plots, they are typically suited best to numerical data. This is because you are using metrics such as the median, upper and lower quartiles, and spread of the data to graph it appropriately. This type of visual representation requires the data to be naturally ordered and is less suitable for categorical data or data without a natural order.
Knowing how to read the graph appropriately can help you gain relevant insights from the representation when you see a box plot. When looking at the visual, take yourself through the following steps.
You can find several elements of the data set by examining the box at the heart of the diagram. The box represents the middle two quartiles of the data, which is the middle 50 percent of the data. The length of the box is the interquartile range (IQR).
The top line of the box represents the 75th percentile of the data (Quartile 3 or Q3), meaning that 75 percent of values in the data set fall below this value. Similarly, the bottom line of the box represents the 25th percentile of the data (Quartile 1 or Q1), with 25 percent of data falling below this line.
As mentioned above, a longer box represents greater variability in your data, showing that the middle 50 percent of data are spread out. A shorter box shows that the middle 50 percent of data are close in values and have less variability.
The median represents your measure of central tendency and shows the point where 50 percent of data lies above it and 50 percent below it.
You can find the whiskers extending from the edges of the box. These whiskers extend to your data set's smallest and largest value within 1.5 times the IQR. This shows the range of your data, excluding outliers.
Beyond the whiskers, you can display individual data points with a dot or other marker on your graph. This shows which values vary significantly away from typical values within your data set. You should look carefully at your outliers to make sure they are not mistakes in your data set and represent actual, unbiased data.
When choosing to use a box plot, be aware of the pros and cons. Depending on your data type and needs, different advantages or disadvantages might be more important to you.
Easy comparison between data sets: Box plots allow you to visualize numerical data sets side by side to see how they differ in centrality, distribution, and variability.
Able to visualize skew: By examining where the quartiles and median fall, along with the whiskers, you can see if your data set has a certain skew or tendency.
Can represent large data sets: Because only certain measures of the data set are represented in a box plot (e.g., median, quartiles), you can represent large data sets simply. This can give a high-level overview to a general audience.
Simple overview of data: You can’t tell finer details about the data, such as if you have multiple clusters in your distribution.
Not appropriate for all data sets: If you have data that is not numerical, has limited data points, or only represents a small range of values, a box plot may not be the right choice.
May be limited with certain software: Certain software packages might naturally exclude outliers or otherwise misrepresent the data if your data has unusual data points. In this case, you might miss certain aspects of your data.
Building your own box plot involves several steps, including calculations and data visualization. To create a box plot, follow these steps:
Gather your data. Ensure your data set is complete and has enough data points over a numerical range to be effectively represented. Consider ranking your data in ascending or descending order to divide it into quarters.
Calculate your key box statistics. You will want to calculate your median, Q1, Q3, and IQR.
Calculate your key whisker statistics. Determine the lower and upper bounds for potential outliers using the IQR. The lower bound equals Q1 - 1.5 * IQR, while the upper bound equals Q3 + 1.5 * IQR.
Identify your outliers. After calculating your whisker statistics, data points outside this range are typically classified as outliers.
Create your box plot. You can do this by hand or in software like R or Excel.
1. Draw a number line (vertical or horizontal) for your axis.
2. Draw a box with Q1 as the bottom and Q3 as the top.
3. Draw your whiskers.
4. Plot any potential outliers as individual data points beyond the whiskers.
You can continue building your data visualization skills on Coursera. As a beginner, consider broad overview classes offered by top universities and organizations, such as Data Visualization With Advanced Excel or Data Visualization and Communication With Tableau.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.