The 2022 Definitive Guide to Natural Language Processing (NLP)

Natural language processing (NLP) is a field of study that deals with the interactions between computers and human languages. It is also called computational linguistics. Natural language processing aims to computationally understand natural languages, which will enable them to be used in many different applications such as machine translation, information extraction, speech recognition, text mining, and summarization.

NLP technology has come a long way in recent years with the emergence of advanced deep learning models. There are now many different software applications and online services that offer NLP capabilities. Moreover, with the growing popularity of large language models like GPT3, it is becoming increasingly easier for developers to build advanced NLP applications. This guide will introduce you to the basics of NLP and show you how it can benefit your business.

What is Natural Language Processing?

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. It helps computers to understand, interpret, and manipulate human language, like speech and text. The simplest way to understand natural language processing is to think of it as a process that allows us to use human languages with computers. Computers can only work with data in certain formats, and they do not speak or write as we humans can.

Harness the full potential of AI for your business

Natural language refers to the way we, humans, communicate with each other. It is the most natural form of human communication with one another. Speakers and writers use various linguistic features, such as words, lexical meanings, syntax (grammar), semantics (meaning), etc., to communicate their messages. However, once we get down into the nitty-gritty details about vocabulary and sentence structure, it becomes more challenging for computers to understand what humans are communicating.

That’s why NLP helps bridge the gap between human languages and computer data. NLP gives people a way to interface with computer systems by allowing them to talk or write naturally without learning how programmers prefer those interactions to be structured.

To provide a nuanced understanding of natural language processing, it is necessary to understand the different levels with which machines can process and understand natural languages. These levels are as follows:

Phonetical and Phonological level - This level deals with understanding the patterns present in the sound and speeches related to the sound as a physical entity.
Morphological level - This level deals with understanding the structure of the words and the systematic relations between them.
Lexical level - This level deals with understanding the part of speech of the word.
Syntactic level - This level deals with understanding the structure of the sentence.
Semantic level - This level deals with understanding the literal meaning of the words, phrases, and sentences.
Discourse level - This level deals with understanding units larger than a single sentence utterance.
Pragmatic level - This level deals with using real-world knowledge to understand the bigger context of the sentence.

The earliest NLP applications were rule-based systems that only performed certain tasks. These programs lacked exception handling and scalability, hindering their capabilities when processing large volumes of text data. This is where the statistical NLP methods are entering and moving towards more complex and powerful NLP solutions based on deep learning techniques.

From the computational perspective, natural language processing is a branch of artificial intelligence (AI) that combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in text or voice data and extract meaning incorporated with intent and sentiment.

Natural Language Processing is usually divided into two separate fields - natural language understanding (NLU) and natural language generation (NLG).

Natural language understanding vs generation

Natural Language Understanding (NLU)

Natural Language Understanding deals with the ability of computers to understand human language. NLU is all about the computer’s ability to capture meaning and knowledge from human language. The applications of this ability are almost endless, as it provides an appealing solution: automatically construct a structured knowledge base by reading natural language text. Much of information extraction can be described in terms of entities, relations, sentiment, and events.

Natural Language Generation (NLG)

In many of the most interesting problems in natural language processing, language is the output. Natural language generation focuses on three main scenarios:

data to text - text is generated to explain or describe a structured record or unstructured perceptual input;
text to text - typically involves fusing information from multiple linguistic sources into a single coherent summary;
dialogue - text is generated as part of an interactive conversation with one or more human participants.

Text Analytics Computational Steps

There is a significant difference between NLP and traditional machine learning tasks, with the former dealing with unstructured text data while the latter usually deals with structured tabular data. Therefore, it is necessary to understand human language is constructed and how to deal with text before applying deep learning techniques to it. This is where text analytics computational steps come into the picture.

Speech-to-text

Speech-to-Text or speech recognition is converting audio, either live or recorded, into a text document. This can be done by concatenating words from an existing transcript to represent what was said in the recording; with this technique, speaker tags are also required for accuracy and precision. Alternatively, machine learning algorithms might be applied to automatically extract features like intonation patterns that then trigger phoneme sequences that correspond to specific word types, which will lead to more accurate results than using only a transcription dictionary or language model.

OCR

Optical character recognition, or OCR in short, is the process of converting digital images (or pictures) of typed or handwritten text into machine-readable text. OCR software scans an image for one or more characters that resemble letters, numbers, punctuation marks like dashes ("-"), asterisks (*), etc., followed by a space in between each letter. The OCR program then produces what looks like regular printed words on paper - but all the odd-looking symbols have been replaced with letters from your chosen language’s alphabet. This is the first necessary step to take if your application of NLP starts with documents that are not digitized and converted to machine-readable text.

Language Identification

A language is a set of words and their grammatical structure that users of one particular dialect (a.k.a., “language variant”) use to communicate with one another and perform other functions like literature or advertising in certain contexts. Languages like English, Chinese, and French are written in different alphabets. Each language has its own unique set of rules and idiosyncrasies. As basic as it might seem from the human perspective, language identification is a necessary first step for every natural language processing system or function.

Tokenization

The next step in natural language processing is to split the given text into discrete tokens. These are words or other symbols that have been separated by spaces and punctuation and form a sentence.

Tokenization algorithms can be classified on their input and output: Some are restricted to just single sentences; some accept arbitrary blocks of texts; still, others break down text data into individual words only. For alphabetic languages such as English, deterministic scripts usually suffice to achieve accurate tokenization. However, in logographic writing systems such as Chinese script, words are typically composed of a small number of characters without intervening whitespace. The simplest approach matches character sequences against a known dictionary, using additional statistical information about word frequency. However, no dictionary is completely comprehensive, and dictionary-based approaches can struggle with such out-of-vocabulary words. Hence, there is increasing popularity of deep learning methods used for that purpose.

Stemming and Lemmatization

Another important computational process for text normalization is eliminating inflectional affixes, such as the -ed and -s suffixes in English. Stemming is the process of finding the same underlying concept for several words, so they should be grouped into a single feature by eliminating affixes.

The stemming process may lead to incorrect results (e.g., it won’t give good effects for ‘goose’ and ‘geese’). To overcome such problems, we make use of lemmatization. Lemmatization is the process of extracting the root form of a word. It converts words to their base grammatical form, as in “making” to “make,” rather than just randomly eliminating affixes. An additional check is made by looking through a dictionary to extract the root form of a word in this process.

Stemming vs. Lemmatization

Stemming and lemmatization are language-specific processes: an English stemmer or lemmatizer is of little use on a text written in another language.

Sentence Breaking

Check this series

Sentence breaking refers to the computational process of dividing a sentence into at least two pieces or breaking it up. It can be done to understand the content of a text better so that computers may more easily parse it. Still, it can also be done deliberately with stylistic intent, such as creating new sentences when quoting someone else’s words to make them easier to read and follow. Breaking up sentences helps software parse content more easily and understand its meaning better than if all of the information were kept.

Sentence breaking is done manually by humans, and then the sentence pieces are put back together again to form one coherent text. For a computer to do this correctly, it must first be programmed with specific rules that know where sentences should be broken and how they should be reassembled from those segments; computers typically find these segmentation points through simple pattern matching or turning the problem into a graph search. The software divides a sentence into at least two parts. Sentences are broken on punctuation marks, commas in lists, conjunctions like “and” or “or” etc. It also needs to consider other sentence specifics, like that not every period ends a sentence (e.g., like the period in “Dr.”).

Part of Speech Tagging

Part of Speech tagging (or PoS tagging) is a process that assigns parts of speech (or words) to each word in a sentence. For example, the tag “Noun” would be assigned to nouns and adjectives (e.g., “red”); “Adverb” would be applied to adverbs or other modifiers.

The basic idea behind Part-of-speech tagging is that different parts of speech have syntactic rules associated with them: verbs change depending on tense, subjects replace pronouns, determiners like ‘a’ or ’the’ don’t show up after certain prepositions, etc. By assigning tags for every word in language content, one can create more specific machine learning models and rephrase sentences according to data inputs from text mining software.

Chunking

Chunking refers to the process of breaking the text down into smaller pieces. The most common way to do this is by dividing sentences into phrases or clauses. However, a chunk can also be defined as any segment with meaning independently and does not require the rest of the text for understanding.

This breaks up long-form content and allows for further analysis based on component phrases (noun phrases, verb phrases, prepositional phrases, and others).

Syntax Parsing

Syntax parsing is the process of segmenting a sentence into its component parts. It’s important to know where subjects start and end, what prepositions are being used for transitions between sentences, how verbs impact nouns and other syntactic functions to parse syntax successfully. Syntax parsing is a critical preparatory task in sentiment analysis and other natural language processing features as it helps uncover the meaning and intent. In addition, it helps determine how all concepts in a sentence fit together and identify the relationship between them (i.e., who did what to whom). This part is also the computationally heaviest one in text analytics.

Sentence Chaining

Sentence chaining is the process of understanding how sentences are linked together in a text to form one continuous thought. All natural languages rely on sentence structures and interlinking between them. This technique uses parsing data combined with semantic analysis to infer the relationship between text fragments that may be unrelated but follow an identifiable pattern. One of the techniques used for sentence chaining is lexical chaining, which connects certain phrases that follow one topic.

The sentence chaining process is typically applied to NLU tasks. As a result, it has been used in information extraction and question answering systems for many years. For example, in sentiment analysis, sentence chains are phrases with a high correlation between them that can be translated into emotions or reactions. Sentence chain techniques may also help uncover sarcasm when no other cues are present.

NLP Tasks

Many NLP tasks target particular problem areas. These tasks can be broken down into several different categories.

Topic Modeling

Check this series

The main goal for topic segmentation is extracting the main topics from a document. A cohesive topic segment forms a unified whole, using various linguistic operators: repeated references to an entity or event; the use of conjunctions to link related ideas; and the repetition of meaning through lexical choices. Each of these cohesive devices can be measured and then used as features for topic modeling.

Topic models can be constructed using statistical methods or other machine learning techniques like deep neural networks. The complexity of these models varies depending on what type you choose and how much information there is available about it (i.e., co-occurring words). Statistical models generally don’t rely too heavily on background knowledge, while machine learning ones do. Still, they’re also more time-consuming to construct and evaluate their accuracy with new data sets.

In Wikipedia biographies, these segments often pertain to various aspects of the subject’s life: early years, major events, impact on others, etc. Alternatively, scientific research articles are often organized by functional themes: the introduction, a survey of previous research, experimental setup, and results.

Text Classification

The text classification task involves assigning a category or class to an arbitrary piece of natural language input such as documents, email messages, or tweets. Text classification has many applications, from spam filtering (e.g., spam, not spam) to the analysis of electronic health records (classifying different medical conditions).

Deep learning methods prove very good at text classification, achieving state-of-the-art results on a suite of standard academic benchmark problems.

Text Summarization

The text summarization task produces a short extract of arbitrary natural language input, typically a document or article. The goal in the sentence-level text summarization (SLTS) tasks is to create a summary that retains the meaning and style of the source: synthesizing high-level concepts while maintaining factual accuracy without excessive detail.

Keyword Extraction

The keyword extraction task aims to identify all the keywords from a given natural language input. Utilizing keyword extractors aids in different uses, such as indexing data to be searched or creating tag clouds, among other things.

Services like PubMed auto-tag their articles based on AI keyword extraction.

Named Entity Recognition

The entity recognition task involves detecting mentions of specific types of information in natural language input. Typical entities of interest for entity recognition include people, organizations, locations, events, and products.

Examples of highlighting entities in text

Named Entity Disambiguation

Named Entity Disambiguation (NED), or Named Entity Linking, is a natural language processing task that assigns a unique identity to entities mentioned in the text. It is used when there’s more than one possible name for an event, person, place, etc. The goal is to guess which particular object was mentioned to correctly identify it so that other tasks like relation extraction can use this information.

Relation Extraction

The task of relation extraction involves the systematic identification of semantic relationships between entities in natural language input. For example, given the sentence “Jon Doe was born in Paris, France.”, a relation classifier aims at predicting the relation of “bornInCity.” Relation Extraction is the key component for building relation knowledge graphs. It is crucial to natural language processing applications such as structured search, sentiment analysis, question answering, and summarization.

Semantic Search

Semantic Search is the process of search for a specific piece of information with semantic knowledge. It can be understood as an intelligent form or enhanced/guided search, and it needs to understand natural language requests to respond appropriately.

To explain in detail, the semantic search engine processes the entered search query, understands not just the direct sense but possible interpretations, creates associations, and only then searches for relevant entries in the database. Since the program always tries to find a content-wise synonym to complete the task, the results are much more accurate and meaningful.

Sentiment Analysis

Sentiment analysis is a task that aids in determining the attitude expressed in a text (e.g., positive/negative). Sentiment Analysis can be applied to any content from reviews about products, news articles discussing politics, tweets that mention celebrities. It is often used in marketing and sales to assess customer satisfaction levels. The goal here is to detect whether the writer was happy, sad, or neutral reliably.

Question Answering

Question answering (QA) refers to tasks requiring a user to ask a question and receiving an answer. QA systems are of two types: “closed-domain” QA, where the answer comes from a limited set of possible answers (e.g., about France), or " open-domain" QA that is more general in content. The most famous and popular QA systems are the ones that provide answers to factual questions - such as “What is the capital of France?”. The victory of the Watson question-answering system against three top human players on the game show Jeopardy! was a landmark moment for NLP. However, we can also have a system answer more open-ended or subjective questions like “Why do people drink coffee?”. Question answering is commonly applied by search engine providers, such as Google Search.

Predictive Text

Autocorrect, autocomplete, predict analysis text are some of the examples of utilizing Predictive Text Entry Systems. Predictive Text Entry Systems uses different algorithms to create words that a user is likely to type next. For example, the first word will be entered by default. Then for each key pressed from the keyboard, it will predict a possible word based on its dictionary database it can already be seen in various text editors (mail clients, doc editors, etc.). In addition, the system often comes with an auto-correction function that can smartly correct typos or other errors not to confuse people even more when they see weird spellings. These systems are commonly found in mobile devices where typing long texts may take too much time if all you have is your thumbs.

Machine Translation

Machine translation is the task of converting a piece of text from one language to another. Its goal is to enable us to read or listen in a different language without knowing the said one. There are several challenges of Machine Translation: accuracy (not every translated phrase will be accurate), lack of context (can translations contain hidden meaning?), cultural interpretation, and nuances (language may have variations that are hard to capture). Tools like Google Translate are the most popular applications of this task. They come with multiple applications, including automatic website translation services, text to speech, and language learning.

Conversational Agents

The main task of a conversational agent is to have conversations with humans. The most popular type of conversational agent is chatbots – they use simple responses based on a given input. Their function is to provide the answer or perform the requested action. They are used in many different fields: telecommunications (providing support), marketing and sales (24/7 sales and helping customers), education (languages learning). However, some challenges come along with designing this kind of technology: not being able to answer all questions using only natural language understanding, the fact that it may feel dehumanizing if an AI doesn’t act like a human would when talking about emotions, etc., lack of accuracy due to the complexity inherent to natural languages. Currently, most conversational agents operate within a certain field or subject where most of the scenarios have been well defined.

Related Case Study

Amygdala is a mobile app designed to help people better manage their mental health by translating evidence-based Cognitive Behavioral Therapy to technology-delivered interventions. Amygdala has a friendly, conversational interface that allows people to track their daily emotions and habits and learn and implement concrete coping skills to manage troubling symptoms and emotions better. This AI-based chatbot holds a conversation to determine the user’s current feelings and recommends coping mechanisms. Here you can read more on the design process for Amygdala with the use of AI Design Sprints.

Virtual chatbot app screens that supports mental health

Virtual chatbot that supports mental health

Challenges of NLP for Human Language

One big challenge for natural language processing is that it’s not always perfect; sometimes, the complexity inherent in human languages can cause inaccuracies and lead machines astray when trying to understand our words and sentences. Data generated from conversations, declarations, or even tweets are examples of unstructured data. Unstructured data doesn’t fit neatly into the traditional row and column structure of relational databases and represent the vast majority of data available in the actual world. It is messy and hard to manipulate.

Ambiguity

In natural language, there is rarely a single sentence that can be interpreted without ambiguity. Ambiguity in natural language processing refers to sentences and phrases interpreted in two or more ways. Ambiguous sentences are hard to read and have multiple interpretations, which means that natural language processing may be challenging because it cannot make sense out of these sentences. Word sense disambiguation is a process of deciphering the sentence meaning.

Irony and Sarcasm

NLP software is challenged to reliably identify the meaning when humans can’t be sure even after reading it multiple times or discussing different possible meanings in a group setting. Irony, sarcasm, puns, and jokes all rely on this natural language ambiguity for their humor. These are especially challenging for sentiment analysis, where sentences may sound positive or negative but actually mean the opposite.

Domain-specific Knowledge

Natural language processing isn’t limited just to understanding what words mean; there’s also interpreting how they should be used within the wider context; background information that may not be explicitly stated but inferred by the program based on surrounding text and domain-specific knowledge.

Models that are trained on processing legal documents would be very different from the ones that are designed to process healthcare texts. Same for domain-specific chatbots - the ones designed to work as a helpdesk for telecommunication companies differ greatly from AI-based bots for mental health support.

Support for Multiple Languages

For example, the most popular languages, English or Chinese, often have thousands of pieces of data and statistics that are available to analyze in-depth. However, many smaller languages only get a fraction of the attention they deserve and consequently gather far less data on their spoken language. This problem can be simply explained by the fact that not every language market is lucrative enough for being targeted by common solutions.

It is inspiring to see new strategies like multilingual transformers and sentence embeddings that aim to account for language differences and identify the similarities between various languages.

Lack of Trust Towards Machines

Another challenge is designing NLP systems that humans feel comfortable using without feeling dehumanized by their interactions with AI agents who seem apathetic about emotions rather than empathetic as people would typically expect.

NLP Use Cases - What is Natural Language Processing Good For?

There are multiple real-world applications of natural language processing.

Media analysis is one of the most popular and known use cases for NLP. It can be used to analyze social media posts, blogs, or other texts for the sentiment. Companies like Twitter, Apple, and Google have been using natural language processing techniques to derive meaning from social media activity.

Multiple solutions help identify business-relevant content in feeds from SM sources and provide feedback on the public’s opinion about companies’ products or services. This type of technology is great for marketers looking to stay up to date with their brand awareness and current trends.

Content Creation

Artificial intelligence and machine learning methods make it possible to automate content generation. Some companies specialize in automated content creation for Facebook and Twitter ads and use natural language processing to create text-based advertisements. To some extent, it is also possible to auto-generate long-form copy like blog posts and books with the help of NLP algorithms.

Sentiment Analysis

Sentiments are a fascinating area of natural language processing because they can measure public opinion about products, services, and other entities. Sentiment analysis aims to tell us how people feel towards an idea or product. This type of analysis has been applied in marketing, customer service, and online safety monitoring.

Automated Document Processing

Manual document processing is the bane of almost every industry. Automated document processing is the process of extracting information from documents for business intelligence purposes. A company can use AI software to extract and analyze data without any human input, which speeds up processes significantly.

Automated Report Generation

Summarizing documents and generating reports is yet another example of an impressive use case for AI. We can generate reports on the fly using natural language processing tools trained in parsing and generating coherent text documents.

Chatbots for Customer Support

Chatbots are currently one of the most popular applications of NLP solutions. Virtual agents provide improved customer experience by automating routine tasks (e.g., helpdesk solutions or standard replies to frequently asked questions). Chatbots can work 24/7 and decrease the level of human work needed.

State-of-the-Art Machine Learning Methods - Large Language Models and Transformers Architecture

The large language models (LLMs) are a direct result of the recent advances in machine learning. In particular, the rise of deep learning has made it possible to train much more complex models than ever before. The recent introduction of transfer learning and pre-trained language models to natural language processing has allowed for a much greater understanding and generation of text. Applying transformers to different downstream NLP tasks has become the primary focus of advances in this field.

Transformer Model Architecture

The transformer architecture was introduced in the paper “ Attention is All You Need” by Google Brain researchers. The paper proposed a new model architecture based on self-attention and demonstrated that this approach could be used to achieve state-of-the-art results on various natural language tasks such as machine translation, text classification, text generation, and question answering.

Since then, transformer architecture has been widely adopted by the NLP community and has become the standard method for training many state-of-the-art models. The most popular transformer architectures include BERT, GPT-2, GPT-3, RoBERTa, XLNet, and ALBERT.

The advantage of these methods is that they can be fine-tuned to specific tasks very easily and don’t require a lot of task-specific training data (task-agnostic model). However, the downside is that they are very resource-intensive and require a lot of computational power to run. If you’re looking for some numbers, the largest version of the GPT-3 model has 175 billion parameters and 96 attention layers.

Benefits of Natural Language Processing

Saves time and money - NLP can automate tasks like data entry, reporting, customer support, or finding information on the web. All these things are time-consuming for humans but not for AI programs powered by natural language processing capabilities. This leads to cost savings in hiring new employees or outsourcing tedious work to chatbots providers.
Reduces workloads - Companies can apply automated content processing and generation or utilize augmented text analysis solutions. This leads to a reduction in the total number of staff needed and allows employees to focus on more complex tasks or personal development.
Increase revenue - NLP systems can answer questions about products, provide customers with the information they need, and generate new ideas that could lead to additional sales.
Prone to error - NLP technology offers increased quality assurance for a wide range of processes.
Attract potential customers - NLP solutions can generate leads in the form of phone calls and emails that would have been difficult for humans to achieve independently.
Provide better customer service - Customers will be satisfied with a company’s response time thanks to the enhanced customer service.

Summary

If you’ve been following the recent AI trends, you know that NLP is a hot topic. It refers to everything related to natural language understanding and generation - which may sound straightforward, but many challenges are involved in mastering it. Our tools are still limited by human understanding of language and text, making it difficult for machines to interpret natural meaning or sentiment. This blog post discussed various NLP techniques and tasks that explain how technology approaches language understanding and generation. NLP has many applications that we use every day without realizing- from customer service chatbots to intelligent email marketing campaigns and is an opportunity for almost any industry.

Join AI Design Sprint workshops focusing on NLP if you want to find out more about how your company could benefit from artificial intelligence! And if you are looking for a team of engineers experienced in developing various NLP solutions make sure to contact us!

About the author

Wojciech Marusarz

Software Engineer

Wojciech enjoys working with small teams where the quality of the code and the project's direction are essential. In the long run, this allows him to have a broad understanding of the subject, develop personally and look for challenges. He deals with programming in Java and Kotlin. Additionally, Wojciech is interested in Big Data tools, making him a perfect candidate for various Data-Intensive Application implementations.

The 2022 Definitive Guide to Natural Language Processing (NLP)

What is Natural Language Processing?

Harness the full potential of AI for your business

Natural Language Understanding (NLU)

Natural Language Generation (NLG)

Text Analytics Computational Steps

Speech-to-text

OCR

Language Identification

Tokenization

Stemming and Lemmatization

Stemming vs. Lemmatization

Sentence Breaking

Part of Speech Tagging

Chunking

Syntax Parsing

Sentence Chaining

NLP Tasks

Topic Modeling

Text Classification

Text Summarization

Keyword Extraction

Named Entity Recognition

Named Entity Disambiguation

Relation Extraction

Semantic Search

Sentiment Analysis

Question Answering

Predictive Text

Machine Translation

Conversational Agents

Challenges of NLP for Human Language

Ambiguity

Irony and Sarcasm

Domain-specific Knowledge

Support for Multiple Languages

Lack of Trust Towards Machines

NLP Use Cases - What is Natural Language Processing Good For?

Social Media Monitoring

Content Creation

Sentiment Analysis

Automated Document Processing

Automated Report Generation

Chatbots for Customer Support

Are you ready for AI implementation?

State-of-the-Art Machine Learning Methods - Large Language Models and Transformers Architecture

Transformer Model Architecture

Benefits of Natural Language Processing

Summary

About the author

This article is a part of

Insights on practical AI applications just one click away

Done!

Thanks for joining the newsletter

1. Definitions

2. Cookies

3. How System Logs work on the Website

4. Cookie mechanism on the Website

5. Additional information

Want to unlock the full potential of Artificial Intelligence technology?