What Is An N-Gram?

Written by Coursera Staff • Updated on

Explore N-grams to learn what they are, their benefits, and how you can use them in natural language processing to help computers understand and predict language.

[Feature Image] An aspiring natural language processing professional researches “What is an N-gram” on their laptop as part of their coursework.

An N-gram is a word sequence that is “N” units long. Many machine learning algorithms use N-grams to identify the frequency of different sets of text and create language models that understand common language patterns. Using this type of model, you can use N-grams to predict how sentences will end, understand spoken language, and indicate where there may be errors in text.

Placeholder

course

Generative AI with Large Language Models

In Generative AI with Large Language Models (LLMs), you’ll learn the fundamentals of how generative AI works, and how to deploy it in real-world ...

4.8

(3,257 ratings)

356,549 already enrolled

Intermediate level

Average time: 16 hour(s)

Learn at your own pace

What is an N-Gram in NLP?

In natural language processing (NLP), an N-gram is a sequence of “N” items from a text entry or speech. Computers can analyze sequences of words or characters to create N-grams, which provide a statistical representation of text that helps computers understand language patterns and predict which words will come next.

Essentially, N-grams create a probabilistic model that shows how likely a certain word is to appear. For example, you could assess how likely the word “I” is to appear in text and then how likely the word “am” is to appear following “I,” and so on. This helps machines recognize spoken sentences (speech recognition), correct spelling, translate text, and so on. 

Types of N-grams

You can use several types of N-grams to break text into manageable chunks that help build predictive and analytical models. Some of the most common include:

Unigram

Unigrams use single words such as “I”, “want”, and “pizza”. This type of N-gram is useful for basic frequency analysis as you assess the presence of single items and how often they appear in your text.

Bigram

Bigrams assess pairs of consecutive words such as “I want” and “want pizza”. This helps to explore how pairs of words relate and how often one word appears after another. 

Trigram

Trigrams go a step further to analyze sequences of three words, such as “I want pizza”. This helps to provide deeper contextual information about the words and how they relate to each other. You can assign probabilities to the entire sequence of words to gain a better understanding of how often they appear together.

N-gram

You can go beyond trigrams to assess phrases containing four words, five words, and so on. You can determine the appropriate N-gram based on your task and data set. The functionality will remain the same, except you will compute the probability of words appearing in larger sequences.

Uses of N-grams

You can use N-grams in a variety of applications ranging from grammar and spelling to speech recognition and text prediction. Common ways N-grams assist these applications are as follows:

Speech recognition

By understanding the probability of certain words appearing after others, computers can “listen” to your speech and transcribe it. For example, imagine you said the sentence, “I knew a sweet bear at the zoo.” A machine listening to this sentence wouldn’t necessarily know whether you were saying “knew” versus “new,” “sweet” versus “suite,” or “bear” versus “bare.” Using text prediction, the computer can make an educated guess about which homophone you’re choosing in the context of the sentence.

Spelling correction

When typing, you might make a mistake, such as omitting a word or switching a letter. If you typed, “I though I bought bananas,” your computer algorithm could predict with reasonable certainty that you meant to say, “I thought I bought bananas,” and offer a spelling correction. This is because “I thought I” appears in text much more frequently than “I though I,” allowing the algorithm to learn to identify anomalies and flag them as potential errors.

Text prediction

When you type an email, you might notice certain platforms predict the rest of your sentence. This is through the use of N-grams in natural language processing (NLP) algorithms, which analyze the structure and patterns of language and determine the most likely end to your sentence. 

For example, if an NLP algorithm notices a sentence starting “I hope you have a great” will end in “day” 80 percent of the time, and you start your sentence this way, it might show “day” as a possible next word as you’re typing. Language models often analyze large text datasets, such as social media or news sites, to collect data on common sentence structures and build a strong predictive algorithm.

Who uses N-grams?

Professionals who want to analyze or predict text use N-grams for a variety of purposes. Common uses by professionals include:

  • Marketers use N-grams to understand customer search trends and market themes.

  • Search engine engineers use N-grams to understand user activities and train user models.

  • Educators use N-grams to detect plagiarism and compare styles between texts. 

  • NLP engineers use N-grams to build language models by breaking down text into smaller, more meaningful segments. 

Advantages of N-grams

N-grams offer several advantages when it comes to text mining and building language models. One of the main advantages of N-grams is that they reduce the resources needed to analyze large volumes of text. By using N-gram indexes, algorithms can handle more data, and there will be lower costs associated with data manipulation, searching, and storage. Algorithms can use N-grams to quickly locate data, record instances, analyze patterns, find correlations, and compare data sets. 

By breaking down large bodies of text, N-grams also offer benefits when it comes to text prediction and speech recognition. Algorithms can understand the probability of certain sequences of words appearing and use this probability to communicate with humans and assist in speech- or text-related tasks. 

Challenges of N-grams

Depending on your data set, you might encounter certain challenges when it comes to using N-grams. If you have a limited data set, you might not find enough repeated instances of sequences, making it difficult for your algorithm to develop a predictive model. In addition to this, depending on your “N,” you might have challenges related to storage capacity and computational power as the number of possible N-grams increases.

As with any machine learning model, training is an important step to ensure your algorithm can generalize outside of training data. When training your natural language model, you’ll need to use high-quality training data so your model can recognize patterns and make predictions accurately when exposed to new information.

Tools and libraries for implementing n-grams

After building a basic understanding of computer programming and natural language processing applications, you can start implementing N-grams with a few common libraries. Python offers many built-in libraries that can streamline tasks and help you start experimenting. When starting, consider exploring the following libraries:

  • NLTK: A library offering comprehensive tools like ngrams() for tokenization, text analysis, and N-gram generation.

  • spaCy: An NLP library in Python designed for large-scale text processing and efficient N-gram analysis.

  • TextBlob: A beginner-friendly NLP library for text processing built on NLTK with tutorials you can follow to practice your skills.

  • Scikit-learn: A machine learning library that helps you extract N-gram features using functions like CountVectorizer().

Learn more about N-grams and NLP on Coursera

Many NLP applications use N-grams, which are a sequence of “N” items, such as language models and text mining algorithms. You can learn more about how N-grams fit into natural language processing and machine learning with the Generative AI with Large Language Models course by AWS & DeepLearning.AI. Or, for a more comprehensive overview, consider the Machine Learning Specialization offered by Stanford & DeepLearning.AI.

Placeholder

course

Generative AI with Large Language Models

In Generative AI with Large Language Models (LLMs), you’ll learn the fundamentals of how generative AI works, and how to deploy it in real-world ...

4.8

(3,257 ratings)

356,549 already enrolled

Intermediate level

Average time: 16 hour(s)

Learn at your own pace

Placeholder

specialization

Machine Learning

#BreakIntoAI with Machine Learning Specialization. Master fundamental AI concepts and develop practical machine learning skills in the beginner-friendly, 3-course program by AI visionary Andrew Ng

4.9

(32,112 ratings)

598,672 already enrolled

Beginner level

Average time: 2 month(s)

Learn at your own pace

Skills you'll build:

Logistic Regression, Artificial Neural Network, Linear Regression, Decision Trees, Recommender Systems, Tensorflow, Advice for Model Development, Xgboost, Tree Ensembles, Regularization to Avoid Overfitting, Logistic Regression for Classification, Gradient Descent, Supervised Learning, Anomaly Detection, Unsupervised Learning, Reinforcement Learning, Collaborative Filtering

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.

Advance in your career with recognized credentials across levels.

New! DeepLearning.AI Data Analytics Professional Certificate.