What Is Clustering?

Written by Coursera Staff • Updated on

Professionals across industries use cluster analysis to explore data and inform decision-making. Learn more about different types of clustering, why cluster analysis is important, and techniques to visualize your clusters.

[Featured Image] A data scientist sits at a desk with a laptop and relies on clustering to gain insights.

Clustering is a technique used in data analysis to organize data into clusters based on similar features. The idea is that similar data are in each cluster, showing natural grouping within the data. You can choose to cluster based on different types of attributes like color, size, or type. Cluster analysis is a form of unsupervised learning, meaning it doesn’t rely on predefined categories or labels and instead discovers inherent groups in the data.

In this article, we will explore what clustering is, why it is important, how you might use this method, and examples of different types of cluster analyses. 

Google

professional certificate

Google Data Analytics

Get on the fast track to a career in Data Analytics. In this certificate program, you’ll learn in-demand skills, and get AI training from Google experts. Learn at your own pace, no degree or experience required.

4.8

(158,483 ratings)

2,882,771 already enrolled

Beginner level

Average time: 6 month(s)

Learn at your own pace

Skills you'll build:

Tableau Software, Data Visualization, Ggplot2, Sampling (Statistics), Presentations, Applicant Tracking Systems, Rmarkdown, Data Ethics, Stakeholder Communications, Data Presentation, Data Visualization Software, Data Storytelling, Spreadsheet Software, Data Literacy, Interviewing Skills, Data Analysis, Data Cleansing, Interactive Data Visualization, Data Validation, LinkedIn, Data Integrity, SQL, Data Transformation, Data Quality, Sample Size Determination, Data Processing, Analytical Skills, Data-Driven Decision-Making, Data Sharing, Data Management, Google Sheets, Professional Development, Generative AI, Problem Solving, Communication, Prompt Engineering, Personal Attributes, R Programming, Data Manipulation, Data Structures, Tidyverse (R Package), Statistical Programming, Data Import/Export, Package and Software Management, Dashboard, Quantitative Research, Business Analysis, Expectation Management, Portfolio Management, Research Reports, Business Analytics, Artificial Intelligence, Web Content Accessibility Guidelines, Excel Formulas, Pivot Tables And Charts, Data Integration, User Feedback, Data Compilation, Analytics, Data Collection, Data Security, Data Storage, Databases, Relational Databases, Unstructured Data

What is data mining?

Data mining involves finding patterns, trends, and information in large volumes of data. It uses algorithms and statistical methods to uncover relationships and insights within data that may not be immediately obvious. Cluster analyses are one type of data mining algorithm used to uncover data characteristics through a natural grouping of the information.

Why is clustering important?

Clustering is important in data analysis for several reasons, including identifying patterns and structures within large data sets that may not be immediately obvious. By organizing data into clusters, analysts can more easily interpret and understand the data, leading to more informed decision-making.

Clustering is important across various fields and applications, helping professionals explore their data and identify directions for further analysis. For example, in business and marketing, companies can segment their customers into groups they can use for targeted marketing strategies, helping them optimize resources and enhance customer satisfaction. By clustering customers based on purchasing behavior, a business can decide how to market to each group most effectively.

What is an example of clustering in different industries?

Professionals use clustering methods in a wide variety of industries to group data and inform decision-making. Some ways you might see clustering applied include the following:

  • Business: Companies use clustering for customer segmentation, which means grouping customers based on their behavior and characteristics.

  • Machine learning: Clustering can organize large data sets and improve model performance.

  • Ecology: Clustering can classify plants or animals based on genetic or physical characteristics, aiding in biodiversity studies and conservation efforts.

  • Social networking: Clustering helps identify communities within social networks by looking at characteristics and relationships.

  • Investment: Clustering can inform stock price trends and investment algorithms, improving financial returns.

  • Finance: Financial institutions cluster transactions to detect fraudulent activities, often hidden from common detection methods.

  • Climate analysis: Cluster analysis can identify weather trends and patterns, informing scientists on metrics such as atmospheric pressure.

  • Resource allocation: Companies can use cluster analysis to identify areas that require more attention, such as needing more personnel or certain types of resources.

Benefits and drawbacks of using clustering algorithms

Choosing cluster analyses for your data can offer many benefits. Some advantages you might experience include:

  • Improved understanding of your data

  • Doesn’t rely on previous knowledge of data features

  • Several methods suited for different applications

  • Informed decision-making

  • Diverse applications across various industries

When considering advantages, it’s also important to consider disadvantages. Limitations to be aware of include:

  • Not able to make predictions

  • Difficulty with clusters of different sizes and densities with some methods

  • Sensitive to outliers

Different types of clustering

Understanding how each method works can help you decide which is right for your data when choosing a clustering algorithm. While methods differ, each algorithm has the same goal: to classify data into similar groups.

What is hierarchical clustering?

Hierarchical clustering is a clustering method that methodically groups data, either from a top-down or bottom-up approach, known as divisive and agglomerative hierarchical clustering, respectively. For divisive clustering, the tree's top point (root) includes all of the data. The data then branches into large subgroups, which branch into smaller subgroups, and so on. For example, you might start with a group of animals, which is then classified as mammal, reptile, and so on, before being further classified into species. For agglomerative clustering, the reverse process applies. Every animal is classified individually first, then categorized into small groups before these groups join to form larger classifications until all data is in one group.

What is K-means clustering?

K-means clustering algorithms work by clustering data into a pre-defined number of groups. This method works by placing the pre-defined number of centroid values. The goal is to minimize the distance between the data point and the nearest centroid value. The data iteratively groups, and the centroid values adjust based on the data distribution until the “minimum distance” is found, so the clusters best represent the underlying organization. In most cases, data points can only belong to one cluster. However, “fuzzy” k-means algorithms allow the inclusion of data points in several clusters.

What is mixture-based clustering?

Mixture-based clustering models, such as Gaussian mixture models (GMM), reference different probability distributions to classify where data points belong. Each cluster corresponds to a distribution, and data points group based on the likelihood of belonging to those distributions. This method is flexible and can accommodate clusters of different sizes and shapes.

When to use clustering

In general, you use clustering when your goal is to uncover natural groupings in your data and reveal underlying relationships. You can use this for professional applications, such as grouping your customers by buying habits or patients by response-to-treatment, or in quality control to detect outliers that fall outside your typical groupings. If you have a large, unlabeled data set, using clustering can help you simplify your information and identify patterns, making it easier for you to derive meaningful insights.

Visualization tools in cluster analysis

Both heat maps and self-organizing maps (SOMs) can enrich your cluster analysis by offering distinct ways to visualize and interpret your data. While heat maps provide a direct visual representation of the structure of your data with color gradients, SOMs show data similarities through proximity between points. 

Heat maps

Heat maps in cluster analysis are graphical representations of data where a color shows each value. They are particularly useful if you’re visualizing the presence or magnitude of phenomena, allowing you to quickly see patterns, correlations, and trends in your data. In a heat map, closely related data points have similar colors, indicating they belong to the same cluster. This visualization can help you see underlying structures in the data before using formal clustering methods and is commonly used in fields such as biology and genetics. 

Self-organizing maps

Self-organizing maps are an artificial neural network that takes high-dimensional data and outputs a two-dimensional representation. You might see this type of representation used in business applications, bioinformatics, and data mining. SOMs reveal clusters and relationships that might not be apparent from traditional clustering techniques. By mapping high-dimensional data onto a two-dimensional grid, you can use SOMs to understand complex data better, helping you identify patterns and explore data relationships.

Learn how to use clustering algorithms on Coursera

Learn more about cluster analysis with exciting courses on the Coursera learning platform. You can choose between several courses depending on your field and skill level, including the Google Data Analytics Professional Certificate, Business Analytics for Decision Making, or Statistics Foundations.

Google

professional certificate

Google Data Analytics

Get on the fast track to a career in Data Analytics. In this certificate program, you’ll learn in-demand skills, and get AI training from Google experts. Learn at your own pace, no degree or experience required.

4.8

(158,483 ratings)

2,882,771 already enrolled

Beginner level

Average time: 6 month(s)

Learn at your own pace

Skills you'll build:

Tableau Software, Data Visualization, Ggplot2, Sampling (Statistics), Presentations, Applicant Tracking Systems, Rmarkdown, Data Ethics, Stakeholder Communications, Data Presentation, Data Visualization Software, Data Storytelling, Spreadsheet Software, Data Literacy, Interviewing Skills, Data Analysis, Data Cleansing, Interactive Data Visualization, Data Validation, LinkedIn, Data Integrity, SQL, Data Transformation, Data Quality, Sample Size Determination, Data Processing, Analytical Skills, Data-Driven Decision-Making, Data Sharing, Data Management, Google Sheets, Professional Development, Generative AI, Problem Solving, Communication, Prompt Engineering, Personal Attributes, R Programming, Data Manipulation, Data Structures, Tidyverse (R Package), Statistical Programming, Data Import/Export, Package and Software Management, Dashboard, Quantitative Research, Business Analysis, Expectation Management, Portfolio Management, Research Reports, Business Analytics, Artificial Intelligence, Web Content Accessibility Guidelines, Excel Formulas, Pivot Tables And Charts, Data Integration, User Feedback, Data Compilation, Analytics, Data Collection, Data Security, Data Storage, Databases, Relational Databases, Unstructured Data

University of Colorado Boulder

course

Business Analytics for Decision Making

In this course you will learn how to create models for decision making. We will start with cluster analysis, a technique for data reduction that is very ...

4.6

(1,863 ratings)

90,986 already enrolled

Beginner level

Average time: 8 hour(s)

Learn at your own pace

Skills you'll build:

Process Optimization, Business Intelligence, Market Analysis, Business Analytics, Decision Making, Business Modeling, Risk Analysis, Microsoft Excel, Predictive Analytics, Unsupervised Learning, Data Analysis, Analytics, Simulation and Simulation Software, Probability Distribution, Business Analysis

Meta

course

Statistics Foundations

This course takes a deep dive into the statistical foundation upon which data analytics is built. The first part of this course will help you to thoroughly ...

4.7

(348 ratings)

36,583 already enrolled

Beginner level

Average time: 21 hour(s)

Learn at your own pace

Skills you'll build:

Statistical Analysis, Statistical Hypothesis Testing, Analytics, Time Series Analysis and Forecasting, Spreadsheet Software, Tableau Software, Statistical Methods, Statistics, Data Analysis Software, Statistical Modeling, Statistical Inference, Descriptive Statistics, Sampling (Statistics), Quantitative Research, Probability & Statistics, Data Modeling, Data Analysis, Descriptive Analytics, Bayesian Statistics, Marketing Analytics

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.

Advance in your career with recognized credentials across levels.

Subscribe to earn unlimited certificates and build job-ready skills from top organizations.