Natural language processing (NLP) and text queries can run more effectively when you identify stop words. But what are they, and what can you do about them? Learn more about stop words and why it’s useful to avoid them.
Words like “um,” “like,” and “you know” are filler words that carry little information. When it comes to programming natural language processing (NLP) models and doing data retrieval, computers need to be told not to include these words. These uninformative words that don’t add substance are called stop words.
Stop words can make it more challenging to extract meaningful information from your data. A common approach for those working in machine intelligence is to create a stop words list. Identifying stop words in advance typically makes it easier for the model to decrease the “noise.” The NLP can ignore these uninformative words and, as a result, move more quickly through larger, more diverse amounts of data to realize insights.
This article explains what stop words are and why to avoid them. We’ll also discuss creating a stop words list and where stop words are used.
In NLP, stop words are inconsequential words with little value in helping processors answer queries. The specific stop words can vary based on context. Someone typically needs to manually filter out words that would not help select relevant content.
In English, examples of stop words include:
Articles: a, an, the
Conjunctions: and, but, or
Prepositions: in, on, at, with
Pronouns: he, she, it, they
Common verbs: is, am, are, was, were, be, being, been
Yet different languages have different stop words (“and” in English, “und” in German).So do different databases based on their subject matter. What ProQuest considers a frequently used and, therefore, uninformative word can vary from what Reuters Web Science considers a stop word, for example.
Knowing stop words for databases you use regularly can also help you hone your search statements. You’ll know which words to exclude and describe your topic with the most significant words.
Information retrieval systems typically work with stop lists that collect uninformative words to discard during indexing. Filtering out for stop words can help the system weigh the relevancy of the content to the topic searched. After all, deciding what data to store or retrieve often relies on determining the ratio of words related to the topic within the text to the number of words overall in the text. By cutting the stop list words, you can reduce the number of words overall considered, which can net more accurate results.
You might need to know about stop words if you want a career as an NLP engineer, NLP data scientist, machine learning engineer, artificial intelligence engineer, or software engineer. Understanding stop words can also help you in fields that rely on text mining. You might not design and develop the algorithms, but knowing how to search more effectively could aid your text analysis in a customer service, risk management, maintenance, health care research, or cybersecurity role.
Stop words improve accuracy and efficiency for information retrieval and search engines. They come in handy when classifying text (e.g., for sentiment analysis) and mining and analyzing large volumes of text. Eliminating stop words simplifies the identification of themes and patterns to surface important information.
Some topic modeling techniques, such as latent Dirichlet allocation (LDA), use stop words to determine identifying topics in document collections.
You could also encounter stop words in machine translation as filtering out the unimportant words reduces noise in the output.
You can find much discussion of the value of culling stop words. Researchers in the field continue to weigh the benefits and drawbacks of the stop words approach. In this section, we summarize some of the main points to consider.
Generally, the stop words approach, when done well, can benefit model quality. Databases programmed to ignore common words can provide more accurate results more efficiently. The model’s search improves, and you simultaneously will get fewer (yet more focused) results returned.
You can’t find a single, standardized list of stop words. That’s because the list of words needs to evolve continually. Plus, it should reflect domain knowledge and have language specificity. For example, Python has its own Natural Language Toolkit. Still, even when using Python, users within the field of finance and accounting might develop their own stop words around auditing or currencies.
The time it takes to curate a stop words list is another limitation. After all, someone has to construct that word list. Plus, having a human compile the list can enshrine bias, as whether a word qualifies or not is a subjective decision. For example, if someone aggressively prunes words from a model, the results could skew in the direction of whatever that analyst thought important from the outset (before the model even runs).
Generating a stop words list is a common solution to avoid the distraction of all those uninformative words. The source of your stop list depends on the context. You’ll need to consider the specific programming language, its own generic stop list (if it has one), and the scope and context of your searches. For example, a company with proprietary products might want to use even more specific terms.
Beyond these basics, you can find rigorous research studies into different approaches to developing stop words lists. Those who research data science balance the computational effort required against the cost of the method.
Understanding stop words and stop words lists can benefit both the people doing database searches or topic queries and those working to program the models and systems doing the work.
Interested in getting involved with stop words lists? Pursuing computer science, specifically data science, is a good starting point, as stop words are common in text analysis and information retrieval. Natural language processing can also call on an understanding of linguistics and statistics.
You can prepare for a career as a data scientist in under five months with the IBM Data Science Professional Certificate on Coursera. Ready to master NLP? The intermediate Natural Language Processing Specialization on Coursera can help you learn cutting-edge techniques over four courses.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.