What Are Outliers in Data Sciences?

Written by Coursera Staff • Updated on Oct 2, 2024

Outliers are data points that lie an abnormal amount outside of the rest of the values in a certain data set. Discover how, as a statistician or data analyst, you might use several methods to help determine whether a certain value is an outlier.

[Featured Image] A smiling data scientist analyzes data featuring outliers at her desk and has graphs displayed on her computer monitors.

As data science continues to expand, understanding the concept of outliers is critical for accurate data analysis and interpretation. Outliers may indicate that a data point is incorrect, or they may skew some of the findings from your data when not handled correctly. To dive further into the basics of outliers, explore what outliers are, the role they play in data analytics, methods that you can use to define outliers, and how to deal with outliers once you identify them.

What are outliers?

Outliers are data points that lie outside the majority of the data in a particular data set. These values might be much higher or lower in value than other points and may impact the results of the data analysis in ways that misrepresent the data sample. By learning how to identify and handle outliers, data analysts can increase the likelihood that their analysis will accurately reflect the validity and reliability of their results.

The role of outliers in data analytics

Outliers play an important role in data analytics, varying depending on the origin and impact of the analysis. For example, in some fields, outliers may provide insight into rare occurrences, indicating the need for further analysis. In the health care industry, an outlier data point may represent someone with an abnormal set of symptoms or recovery pattern. This could indicate that you should explore further, such as looking at patients with similar characteristics to see potential outcomes.

In other cases, outliers may represent sources of errors. Measurement inaccuracies, typos, or other factors may introduce noise into the data set that does not represent the actual data. The presence of outliers in data sets may also signal low data quality and introduce bias into your analysis. If there were systematic errors during data collection, you would have to make an informed decision on how best to proceed.

How to find outliers

You can find outliers in data through several detection methods. You may choose several methods depending on your role and the purpose of outlier detection. Some of the methods you can choose include:

Sorting data

By sorting your data into ascending or descending order, it may become apparent that certain data points are much higher or lower than others. For example, if you had the data set:

1, 1, 3, 4, 5, 5, 102

You would likely determine that 102 is an outlier. You would then examine the data points more closely to identify the source of the outlier data point.

Data visualization

Another way to determine whether you have outliers in your data set is to visualize the data. You can do this by graphing your data set. You can choose any graphical representation that suits you, but scatter plots and histograms are two common choices to identify outliers.

Histograms display data in “bins” that represent segments of the data. Each bin represents how many data points are a specific value or fall within a range of values. This can show you when a data point is far out of range. For example, if you have tall bins between the values of 10 and 30 and then a short bin at a value of 200, you might look more closely at the 200 value.

Scatter plots plot values on a standard graph with an x and y axis. This showcases outliers by grouping the majority of the points in a cluster. If one point is much different from the rest of the cluster, this indicates an outlier.

Interquartile range

Assessing the interquartile range (IQR) of a data set is another way to detect outliers. You calculate the IQR by subtracting the first quartile (Q1) value from the third quartile (Q3) value. You can visualize this through boxplots, which you draw by creating a box along a y-axis. The bottom of the box is the value of the first quartile, and the top of the box is the value of the third quartile of the data.

In the data set, 25 percent will fall below the first quartile (Q1), and 75 percent will fall below the third quartile (Q3). Outliers are often defined as values that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).

Z-score

For data that follows a normal distribution, Z-scores can be one way to find how far away a data point is from the mean of the data set. A normal distribution indicates that the data follows a bell-shaped curve. The Z-score is the number of standard deviations (a measure of variance) away from the mean a point lies. In most cases, a score of over three indicates an outlier. Before choosing this method as your form of outlier detection, it’s important that you test to ensure that your data follows a normal distribution. When your data follows a normal distribution, 68 percent of the data points will lie within 1 standard deviation of the mean, and 95 percent will lie between 2 standard deviations of the mean.

How to deal with outliers

After you identify outliers in your data set, the next step will be to determine how best to deal with these outliers. To do this, you can consider several options:

Remove or correct outliers: If you find that the outliers are from measurement errors, you may benefit from removing them from the data set or correcting them if possible. However, you should do this carefully to prevent bias or sample misrepresentation.

Apply data transformations: Logarithmic, square root, or inverse transformations can help reduce outliers' influence on the analysis. Transformations such as these often stabilize the variances of the data and make them more suitable for certain statistical tests.

Use robust statistical methods: Using methods for your analysis that are less sensitive to outliers, like choosing the median of your data set instead of the mean, can lead to more reliable results without the need to remove outliers.

Learn more about data science with Coursera.

Identifying and managing outliers in your data set can help you accurately analyze information without introducing unnecessary bias. You can choose between several outlier detection methods, including visual and mathematical representations. Once you identify your outliers, you can choose to remove them, correct them, or transform them, depending on the nature of your outlier.

To continue building your statistical skills, consider taking online courses on Coursera. To learn the basics of outliers and data analysis, you can take the Introduction to Data Analysis beginner course by IBM or complete the Google Data Analytics Professional Certificate.

Keep reading

Updated on Oct 2, 2024

Written by:

Coursera Staff

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.