Explore tokenization and learn about one of the key pieces of natural language processing. Plus, learn about tokenization uses across professional industries and how to decide whether tokenization is the right NLP method for your task.
Tokenization is a fundamental part of natural language processing (NLP). In this article, we will examine what NLP is and the role tokenization plays in this process, along with the types of tokenization and the types of tasks you might see tokenization being used for.
Natural language processing (NLP) is a type of artificial intelligence focused on developing computer algorithms that allow computers to read, understand, and generate language. This includes written or spoken language, with the goal of computers being able to comprehend and communicate language in the same way that humans do. To do this, NLP models use a combination of pre-defined rules and machine learning algorithms to analyze the content and intent of messages and interpret the underlying meaning.
Several key components work in tandem to allow NLP algorithms to process and manipulate human language. Some of the common ones include:
Tokenization: Tokenization involves segmenting text into smaller units that are analyzed individually.
Embeddings: Embeddings are words or phrases represented as numerical vectors to make it easier for computers to read and process the input.
Architecture type: NLP systems can use various architectures, including recurrent neural networks (RNNs) and transformers, to perform specific language tasks effectively. The architecture type will play a role in how the algorithm performs.
Tokenization is a term that describes breaking a document or body of text into small units called tokens. You can define tokens by certain character sequences, punctuation, or other definitions depending on the type of tokenization. Doing so makes it easier for a machine to process the text.
When you use tokenization, you must ensure you classify your tokens correctly and that your algorithm is set up to capture your information accurately. Different languages will have different concerns with tokenization, and English is no exception. When it comes to English, one of the tricky aspects of tokenization will be determining how to define your tokens. For example, let’s say you define your words as a set of characters separated by a space or punctuation. This works well with a sentence like “She is at the house.” Your words are all separated nicely by spaces or by punctuation. However, imagine your sentence now says, “She isn’t at the house.” Suddenly, your algorithm classifies “isn” and “t” as separate tokens because punctuation separates the word fragments.
Apostrophes aren’t the only difficult caveat to overcome with tokenization algorithms. You also have to consider different word variations, several word combinations representing a single entity, and informalities. For example, you might want to think of “San Diego” as a single entity, but an NLP algorithm looking for spaces to define words would separate it as “San” and “Digeo.” Bodies of text might also use uncommon phrases, such as using dashes in word combinations like “state-of-the-art.” When choosing your tokenization type and architecture, it is important to consider how you want to define your tokens and what might be the right strategy to minimize any token loss.
Depending on your algorithm, you can choose to define your tokens at various levels of granularity. This will often depend on your use case, and different languages may lend themselves better to different types of tokenization. While words are a common choice for token type, you can also choose to reduce your token size to characters or morphemes, or you can expand to words or phrases. Some types of tokens you can define through tokenization include:
Character tokenization: You can tokenize words into individual characters at the most basic level. Doing so can be useful as it limits the number of defined entities.
Word tokenization: Word tokenization involves splitting text into individual words. For instance, the sentence “The grass is green” is tokenized into four tokens with this method: [“The”, “grass”, “is”, “green”].
Phrase tokenization: You can also tokenize text into phrases or chunks that convey a specific meaning. For instance, the phrase “Los Angeles” might be a single token instead of two separate words.
Sentence tokenization: This type of tokenization segments text into sentences. It separates paragraphs or long blocks of text into distinct sentences for analysis.
Subword tokenization: Subword tokenization dissects words into their constituent morphemes, which are the smallest units of meaning in a language. For instance, “unusual” becomes [“un”, “usual”].
Number tokenization: Number tokenization uses digits as the primary component of the token and segments numbers from the rest of the body of text. For example, if the phrase was, “She had 15 cats,” then “15” would be the number token.
Many professionals use tokenization for NLP tasks. Regardless of your industry, you can employ tokenization techniques for various information retrieval and analysis tasks. Some common include the following:
Information retrieval: Tokenization is important in search engines and information retrieval systems. These systems break down information into tokens to better index and analyze information.
Text preparation: Tokenization helps classify and prepare text by categorizing text into predefined tokens. It helps create feature vectors (groups of similar objects) from text, making it suitable for computer models.
Sentiment analysis: Sentiment analysis relies on tokenization to assess the sentiments expressed in text, such as positive or negative connotations online.
Generative AI: Generative AI has been around since the mid-1900s, with chatbots being one of the most common types. Chatbots use NLP and tokenization to understand user language and respond much like humans.
Analyze feedback: Businesses or organizations might want to understand how their users respond to their products or services. In this case, NLP algorithms can assess the tone and content of feedback and help businesses or organizations make informed changes.
When deciding whether to employ tokenization, you should consider the strengths and weaknesses of the NLP model. While the relative importance of the strengths and weaknesses will vary depending on your needs, knowing the model's benefits and constraints can help you make an educated decision.
Enhances data preparation: Tokenization is a fundamental step in preparing text data for NLP tasks to make the text more suitable for machine learning models.
Able to control granularity: With different levels of tokenization, you can decide how granular you want your tokens (e.g., characters, subwords, words).
Independent of languages: Tokenization techniques can adapt to different languages and scripts to suit different languages.
May struggle with ambiguity: Tokenization may struggle with handling language ambiguities. In these cases, you might need a training model and statistical methods to avoid losing tokens.
Can be resource intensive: Tokenization can be time-consuming if you need to define a high volume of specialized rules.
Might have token loss: Some tokenization processes lose information if the algorithm does not record it, such as when it encounters an unknown word.
May struggle with punctuation: Segmenting tokens that include punctuation, such as apostrophes or dashes, can sometimes be tricky for NLP algorithms.
You can continue learning about the exciting field of machine learning and NLP with courses on Coursera from top universities. For a comprehensive overview while learning at your own pace, consider completing the Deep Learning Specialization offered by DeepLearning.AI.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.