Lemmatization vs. Stemming: Understanding NLP Methods

Written by Coursera Staff • Updated on

What are lemmatization and stemming? Let’s consider each NLP method in detail to understand the algorithm methodology, strengths and weaknesses, and how to choose the right one for your requirements.

[Featured Image] A person explains to their colleague the difference between lemmatization and stemming.

Key takeaways

While lemmatization text analysis provides high accuracy and context relevance, stemming offers greater speed but potentially more errors.

  • Two natural language processing (NLP) methods for breaking down text are lemmatization and stemming, each focusing on different aspects of natural language and on recognizing the root meanings of words.

  • Three advantages of stemming are typically faster algorithms, simpler algorithms, and better search performance, while lemmatization offers contextual understanding, reduced ambiguity, and language compliance.

  • You can choose between applying lemmatization and stemming algorithms by determining the precision of your desired output or whether you are searching only for the root of the word.

Explore lemmatization versus stemming, their differences, and the pros and cons of each. If you’re ready to build your NLP skill set, enroll in the Natural Language Processing Specialization from DeepLearning.AI, where in as little as three months, you can learn about supervised learning, data preprocessing, applied machine learning, algorithms, and more.

What is natural language processing?

NLP is a complex field that combines computer science, artificial intelligence, and linguistics. NLP algorithms enable computers to communicate with humans using human language in a conversational way. By analyzing and processing natural language data, NLP systems can extract insights, automate text functions, and respond to user input and requests.

Computers need to know how to break down and process text when extracting language data. Two of these methods are lemmatization and stemming, each focusing on different aspects of natural language and how to recognize the root meaning of words.

Both methods help reduce the dimensionality of large bodies of text and make it easier for machines to group related words. However, stemming takes a simpler approach that’s more prone to errors, while lemmatization is more computationally intensive yet may return more accurate results.

Read more: What Is Artificial Intelligence? Definition, Uses, and Types

What is stemming, and what does stemming do?

This simple form of word reduction focuses on removing word endings (suffixes) to obtain a base form, often resulting in nondictionary words. As a method, though less precise than lemmatization, stemming is quick and efficient when processing large volumes of text.

Stemming algorithms work by having a defined set of suffixes and a required word length. If the word is a reasonable length, the stemming algorithm will scan the word for one of the predefined suffixes (such as “-ed” or “-ing”) and remove the suffix to return the root of the word. Having a required word length helps to prevent errors in this method, such as shortening “fish” to “f.” This method is highly efficient and helps computers process written text in an organized way.

If you choose stemming, your algorithm will return stems such as:

  • “Walk,” “walking,” and “walks” will become “walk.”

  • “Write,” “written,” and “writer” would become “writ.”

  • “Requirement” and “require” would become “requir.”

Advantages of stemming

  • Speed and efficiency: Stemming algorithms are generally faster as they follow simple rule-based approaches.

  • Simplicity: The algorithms for stemming use simple heuristic rules, so they are less complex to implement and understand than other methods.

  • Improved search performance: In search engines and information retrieval systems, stemming helps connect different word forms, potentially increasing the breadth of search results.

Disadvantages of stemming

  • Over-stemming and under-stemming: Stemming can often be imprecise, leading to over-stemming (where words are overly reduced and unrelated words are conflated) and under-stemming (where related words don’t appear related).

  • Language limitations: The effectiveness of a stemming algorithm reduces if words appear in irregular formats (i.e., irregular conjugated forms).

What is lemmatization?

Lemmatization goes beyond simple truncation and analyzes the context of the sentence, considering the word's use in the larger text and its inflected forms. After determining the word's context, the lemmatization algorithm returns the word's base form (lemma) from a dictionary reference.

This technique effectively handles different grammatical categories and tenses, providing a more accurate language representation. For example:

  • “Saw” would return as “see” or “saw” depending on the context of the word (i.e., whether it is a noun or verb in the sentence).

  • “Ponies” would return “pony.”

  • “Requirement” and “required” would return separate words.

Advantages of lemmatization

  • Accuracy and contextual understanding: Lemmatization is more accurate as it considers word context and morphological analysis. It can distinguish between different word uses based on their part of speech.

  • Reduced ambiguity: By converting words to their dictionary form, lemmatization reduces ambiguity and enhances the clarity of text analysis.

  • Language and grammar compliance: Lemmatization adheres more closely to the grammar and vocabulary of the target language, leading to linguistically meaningful outputs.

Disadvantages of lemmatization

  • Computational complexity: Lemmatization algorithms are more complex and computationally intensive than stemming. They require more processing power and time.

  • Dependency on language resources: Lemmatization depends on extensive language-specific resources like dictionaries and morphological analyzers, making it less flexible for use with certain languages, such as Arabic.

What is an example of stemming?

A stem, or the output of stemming, is a shortened version of a word after removing affixes. For example, stemming would turn the words “investment, invest, investing” to the shortened stem “invest.” Stems are only based on removing letters, which can sometimes result in tokens that aren’t complete words. For example, the stem of the word “ponies” could return “poni–.” 

A lemma, on the other hand, is a shortened version of the word without affixes after normalizing the word to something you could look up in the dictionary. To compare how this is different from a stem in practice, consider the word “applied.” The stem of “applied” would be “appli–,” while the lemma would return as “apply.”

When to use lemmatization vs. stemming

When deciding between lemmatization and stemming, consider the type of output you want from your text and the strengths and limitations of each method. Lemmatization is a more resource-intensive process because it requires comprehensive linguistic knowledge. Stemming is a simpler and faster method. While lemmatization provides high accuracy and context relevance, stemming offers greater speed, so it is up to you to determine your priority.

You can see the differences between stemming and lemmatization in the output. For example, let’s say your text has the words “sung,” “sang,” “sings,” “singer,” and “singing.” Stemming would recognize only three of the five words as a conjugation of “sing” and return “sing” for “sings,” “singer,” and “singing.” Because “sung” and “sang” do not have a recognized suffix, they wouldn’t appear in the same category as the others.

Lemmatization, conversely, would recognize the irregular conjugation of the words and output the root word “sing” for all five words.

Another example would be words that share the same stem but have different meanings. For example, “requirement” and “requires” mean different things, but stemming algorithms would return “requir” for both. Lemmatization algorithms would recognize them in context and return each word separately. 

Consider the precision of your desired output or whether you are just looking for the root of the word to choose the most suitable method.

Join Career Chat on LinkedIn to get timely updates on popular skills, tools, and certifications in machine learning and AI. Build or refresh your AI skills with our other free resources:

Accelerate your career growth with a Coursera Plus subscription. When you enroll in either the monthly or annual option, you’ll get access to over 10,000 courses. 

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.