Last updated on October 29, 2020
By Cindy Espinosa
What is Natural Language Processing?
Natural language processing is a way for computers to understand the human language in its various forms of speech, written text, and idiosyncrasies [1]. There are various applications of NLP across a wide range of disciplines including philosophy, psychology, linguistics, and computational linguistics [2]. Our research group is currently approaching NLP through a computational linguistics lens. While there are varying levels of linguistics our research group is focused on the branch of syntax and semantics in written text, which means the literal meaning of the text as well as the meaningfulness of words and phrases [2].
NLP implementations include “machine translation, information extraction, information retrieval, sentiment analysis, question and answering robots” [2]. A corpus is often used in machine learning, which by definition is a body or collection of finite texts that would be used as the data. The size of the corpus may change as systems advance and can handle larger sizes upwards of 600 million words [2].
There are a few main frameworks used in natural language processing including WordNet and VerbNet [2]. These databases are used for both “commercial and academic research” [2]. However, it is important to note that even though data sets like these are large in quantity, and therefore used in training data and building models on top of it, they contain derogatory language and were created or used by elite institutions. This is an aspect of machine learning and natural language processing that we are keeping in mind as we move forward.
Current Problems of Natural Language Processing
In each area of natural language processing including machine translation, information extraction or IE, information retrieval or IR, sentiment analysis, question and answering robots, there are still questions about how to improve communication between the human language and computer systems.
Machine Translation
Humans know how to translate languages and keep the meaning from one language, the source, to another language, the target, by adjusting their speech or written text based on the context. In the implementation of machine translation, however, there still exists the problem of “naturalness” and “adequacy”, or accuracy of the translation, especially in the example mentioned here between two different languages [2]. By improving existing algorithms like K-means, this goal could be accomplished [3].
Information Extraction
Modern applications of information extraction include not only extracting structured information from written text, but also a variety of media forms like audio, video, and images [2]. The goal of information extraction for multimedia includes automatic annotations of the content in the image, video, or animation [4]. There are also current limitations of this work, because of the volume of unstructured data, so one way to combat the difficulty is for information extraction applications to relate to a specific domain or topic [2,5].
Applications
While all these areas of natural language processing have applications, there are a few areas that are more customer or commercial facing uses like information retrieval, sentiment analysis, and question and answering robots. Some of these products or tasks have made it into our homes or smartphones like voice assistants, language translation applications, or autocorrect.
Information retrieval
The main goal of information retrieval is to find relevant documents based on a database [2]. Typically, this model requires the user input of the required information for the information retrieval model to give an output [2]. The information retrieval model does not offer any analysis of the documents, but rather accurate output. The information retrieval model itself works with another model for documents, another model for user input, and a matching function that evaluates the comparisons between the user input and the documents in the database [6].
Sentiment analysis
Sentiment analysis is probably one of the more familiar uses of natural language processing, because of how it can extract information from the internet and social media [2]. The goal of sentiment analysis is to quantify the opinions and emotions of the public. Oftentimes people interact with customer service online or they make their opinions available via comments or forums towards products and services. The semantic analysis also uses context to form the meaning of words and phrases [2].
Question and Answering Robots
Robots might seem out of the ordinary in relation to natural language processing, but on the internet, it is likely to come across a chatbot that uses this technology. Since natural language processing is a field of artificial intelligence, both are used for these types of robots that require a full range of natural language processing capabilities including human speech, as well as varying levels of text (both syntax and semantics) [2]. Different types of chatbots that are common include technical support, customer service, and language learning tutors [2]. Sometimes these functions are over hotlines, support services, or for the language learning tutors in “language centers” [2].
Natural Language Processing Models and Algorithms
Neural networks, (a computer system of a series of algorithms based on the connections of neurons in the human brain), are popular in the field of machine learning specifically used for recognizing images, and processing speech [7]. In recent years, neural network techniques have been applied to natural language processing tasks, as opposed to the linear models commonly used [7]. There are also different types of neural networks depending on the type of task. Some tasks like document classification, sentiment classification, and question answering are better suited for neural networks where the location of words does not matter in a text, but rather the sequence of certain words do in order to identify a topic or categorization [7]. Next, we’ll dive into two different models used in natural language processing called BERT and K-means algorithm.
What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) is a fine-tuning bi-directional natural language processing model [3]. There are several different types of BERT models as well for different types of tasks and problems including using captions for videos, and specific domains like biomedicine [8, 9]. Currently, other language models are unidirectional (for example: left to right), but then would not work well with fine-tuning models that work best with context from both sides [3]. BERT’s advantages are sentence level and token level natural language processing tasks. Sentence level tasks include inference, predicting, and paraphrasing. Tokenization breaks up paragraphs into sentences or words into a machine-accessible language to perform tasks like information extraction, which finds important information in a text and classifies it as a way to sort through unstructured data by definition [3]. BERT is primarily used for the tasks of “question and answering, sentence classification, and sentence-pair regression” [2].
What is K-means?
K-means is a simple machine learning algorithm for unstructured or unlabeled data. While this algorithm has more applications than in natural language processing, it is still useful for the types of problems within natural language processing like finding similarities between different text documents. K-means clusters groups based on a chosen k from the user and forms “natural” groups. Oftentimes, the data (for example text documents or Tweets) works better if it has been pre-processed. Pre-processing data means removing outliers, so it does not form one cluster for outliers or affect the other groupings as well as removing words called “stop words” (for example “it”, “the”, “a”), which does not add meaning to the words or phrases in the data [10]. Some of the advantages here are if there are unknown groups in a data set as well as it works well for large data sets (in the millions).
Next Steps for Natural Language Processing
Overall there are several areas of natural language processing that are both older areas of study as well as emerging areas that need improvement and more investigation. There are times and places where it might not seem like natural language processing is behind the interaction with a customer service representative, or how data on a public forum is being used to quantify the degree of emotions or opinions, but these processes do use machine learning. There are also areas of improvement like in machine translation with methods of applying new algorithms like neural networks. As natural language processing becomes more accurate, there are also shifts to focus on deriving deeper meanings, especially since the way humans communicate is also always evolving. At the same time, it is important to consider the intersections of natural language processing in the computational field with advances that are also being made in other disciplines like the social sciences or humanities. As natural language processing practitioners work on developing algorithms it would be beneficial to consider the ways experts in other fields are also developing theories on oral and written language, especially considering the variety of writing systems that exist.
Works Cited
[1] Sas.com. 2020. What Is Natural Language Processing?. [online] Available at: <https://www.sas.com/en_us/insights/analytics/what-is-natural-language-processing-nlp.html> [Accessed 7 October 2020].
[2] Lee R.S.T. (2020) Natural Language Processing. In: Artificial Intelligence in Daily Life. Springer, Singapore. https://doi.org/10.1007/978-981-15-7695-9_6
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. (2018) BERT: pre-training of deep bidirectional transformers for language understanding, CoRR, vol. abs/1810.04805
[4] Lapata M. (2010) Image and Natural Language Processing for Multimedia Information Retrieval. In: Gurrin C. et al. (eds) Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_4
[5] Adnan, Kiran, and Rehan Akbar. (2019) Limitations of Information Extraction Methods and Techniques for Heterogeneous Unstructured Big Data. International Journal of Engineering Business Management, doi:10.1177/1847979019890771
[6] “NLP – Information Retrieval,” Tutorialspoint. [Online]. Available: https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_information_retrieval.htm. [Accessed: 17-Oct-2020].
[7] Y. Goldberg, “A Primer on Neural Network Models for Natural Language Processing,” Journal of Artificial Intelligence Research. [Online]. Available: https://www.jair.org/index.php/jair/article/view/11030. [Accessed: 17-Oct-2020].
[8] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A Joint Model for Video and Language Representation Learning,” arXiv.org, 11-Sep-2019. [Online]. Available: https://arxiv.org/abs/1904.01766. [Accessed: 17-Oct-2020].
[9] J. Lee and K. SungDong, “naver/biobert-pretrained,” GitHub. [Online]. Available: https://github.com/naver/biobert-pretrained. [Accessed: 17-Oct-2020].
[10] K. Kitagawa, “Exploring Wine Descriptions with NLP and kMeans,” Kaggle, 24-Jan-2018. [Online]. Available: https://www.kaggle.com/kitakoj18/exploring-wine-descriptions-with-nlp-and-kmeans. [Accessed: 17-Oct-2020].
[11] GiroScience. (2018). Neuromorphic Chip.