Natural Language Processing (NLP) deals with automated management of any written or spoken source generally unstructured, i.e. texts and conversations from any source and domain of application. It is a complex area of research, such the human language, which entails structural and cognitive problems. Morphology, syntax, semantics, and pragmatic aspects of language are all susceptible to automation, while multilinguality is another aspect with increasing importance. Moreover, the demand for interfaces, different from the traditional text box based forms is increasing.
The main goal of this discipline is to develop tools that deal with the understanding of language, the communication of ideas and reasoning. The so-called Information Society requires not only access to all types of materials, primarily text. It needs a solution to manage the huge amount of new documents that are growing at an exponential rate. Currently, it is impossible to manage this huge amount of information effectively, without the aid of a computer and the proper linguistic tools. Nobody, for example, can imagine his work today without using a spell checker, a translator or a search engine.
However, this relatively new area of research represents a major challenge, which necessarily has to be divided into more or less complex subproblems. Our main areas of interest are the following:
Information Retrieval, which includes various tools depending on the expected result: Information Retrieval, Information Extraction, Questions Answering and Text Mining.
Semantic Analysis, from the word meaning to emotions and feelings detection in texts.
Generating Resources for NLP, which are essential for accurate and efficient performance of the language tools.
You can also consult the list of Doctoral Thesis read by our staff.
Information Retrieval (IR), addresses the search for documents, information in those documents, metadata to describe documents, or the database search, whether online or intranet, for texts, images, sound or other data features.
The IR covers many disciplines that usually generate a partial knowledge from a single perspective. Some of them are cognitive psychology, information architecture, information design, human behaviour towards information, linguistics, semiotics, computer science, library and information science.
On the one hand, it is interesting to point out that a discipline which apparently seems distant, as is Machine Vision, has a strong presence in this area. The reason is that in recent years some of the most active disciplines within the field of Information Retrieval is the Multimodal Information Retrieval. While the classical IR has traditionally been based on textual content indexed documents, multimodal recovery uses a set of heterogeneous sources (picture, sound, video and text), searching for options to optimally combine the heterogeneity of sources available.
On the other hand, search engines, such as Google, are the most popular Information Retrieval applications. The Information Retrieval not only searches and offers documents related with some keywords. In fact, it can be specialised depending on the usage of the retrieved information or the purpose of the search.
Although they are popular and wide used, browsers are not perfect tools; there is still room for improvement. The enormous quantity of documents in the network causes major problems for management, while the results are not always satisfactory because incomplete or irrelevant. For example, the present semantic web does not offer the possibility to reduce the number of results of a search, or to specialise a search in restricted domains, thus make an effective and efficent search.
Sometimes, we do not only need a mere list of documents, but the extraction of concrete information, such the one stored in databases, or just the answer to a question. For these purposes two tasks of NLP have been created: Information Extraction and Question Answering. Both of them focus on large amounts of human and technical resources.
The wide variety of digital formats on the network and the boom in multimedia content, have made necessary to develop or adapt tools for finding information on the characteristics of these new formats such as video and image among others.
Nowadays, commercial multimedia search engines, such as the well-known Youtube or Flickr, base the retrieval only on the text related to an image or a video. The research area of Visual Information Retrieval (VIR) works on the development of such tools.
VIR is a sub-discipline of Information Retrieval (IR). For this reason, it started using the traditional IR systems without any specific adaptation to the VIR, in order to perform searches using the annotations related to images or videos. Thus, collections indexed by VIR systems are composed of images or videos, together with annotations describing those ones.
Historically, in the area of the VIR two approaches have been used. In the late 70s, VIR systems were based on the annotations related with the images, VIR systems therefore were Based on Text (RIBT). Later, in the early 90s, in an attempt to overcome the dependence of RIBT systems from the existence of textual annotations for indexing an image or video, emerged VIR systems based on the Content of the Image (CBIR).
In recent years, since the technologies used by CBIR systems matured, a third approach to addressing the problem of the VIR has arisen: the combination of text and image. The efforts are focused on finding appropriate methods for combining information from these extremely diverse sources.
Information Extraction (IE) is a type of Information Retrieval whose goal is to identify the relevant information within a set of texts, ignoring the irrelevant information, and structuring it for its storage in a database.
The data to be extracted is defined by templates that specify the type of information desired. The construction of these templates is done beforehand and depends on the context in which the user is working: the scenario.
From the perspective of language processing systems, Information Extraction consists in complete systems working on different levels - from word recognition to the analysis of sentences, and from understanding the discourse analysis at sentence-level to the full text -.
Named Entity Recognition systems (NER) can be understood as a specialisation of the IE or one of its subtasks.
The Named Entity Recognition is the detection -within a text- of organizations, places, persons, or any term within a predetermined classification set depending on the future use of the information extracted. These systems are considered as a step towards automatic understanding of a text since they add knowledge about its content.
They are usually integrated into IE or text mining complete systems, although they contribute with relevant information to the success of many of the NLP tasks.
Information Retrieval lacks of precise results in general since it returns full text documents depending on the words in question they contain. The Question Answering (QA) is a type of Information Retrieval which answers a specific question such as “Which river runs through Zaragoza?” or “Italian restaurant where I can book a table tonight.” In general, given a certain amount of documents, the system should be able to retrieve answers to questions in Natural Language. QA is seen as a step forward in search technology, a recovery method which requires a more complex language technology and, after recovering all of the texts relevant to the question, that has to find and extract the answer for the user.
The Text Mining is the process of analyzing automatically texts in Natural Language with the aim of discovering information and knowledge, typically difficult to retrieve. The term is an adaptation of the well-known Data Mining, which analyzes large databases looking for information not explicitly stored as “what trends are seen in the market for predicting future demand for SUVs”. If compared with Data Mining, this task needs to apply language technology, given the imprevisibility of documents. One of its most dynamic applications today is bioinformatics. The magnitude of information generated by research on the human genome probably exceeds if compared with the one generated by researches of other areas. Much of this information comes in the form of scientific articles. It is impossible to manually manage this volume of data, thus an automatic process is necessary.
Often, the researcher formulates a hypothesis, then designs an experiment to capture the necessary data and performs experiments to confirm or refute the hypothesis. This is a process that if rigorously conducted, generates new knowledge. In Data Mining instead, data is captured and processed with the hope that they emergence of an appropriate hypothesis. It pretends to retrieve the data describing or showing us why we have certain configuration and behavior.
Due to the fact that the objectives of this task are highly ambitious, all the NLP tools can be part of a system with these features. The complexity of such system is high, since it takes into account all the language aspects, form syntax to semantics.
Writing a text using Natural Language we convey ideas and feelings. The ultimate goal of the NLP toolkit is the automated understanding of content, analysis and use in the form of new knowledge or as an aid to decision making. Within the semantic analysis we can find many problems that currently are the subject of major research efforts. Our lines of research in semantic analysis are described below.
An important part of language processing is the correct interpretation of pronominal reference. While this task is located between the syntactic and semantic analysis, its influence on the content is essential, since the resolution of possible antecedents of pronouns helps to understand the text.
Approaching automatic comprehension of texts written in Natural Language includes assignment of meanings to words, especially polysemous. Like a human does when reading a text, a system for automatic resolution of lexical ambiguity (WSD, of Word Sense Disambiguation ) defines the meaning of words depending the context. For example, a bank can be interpreted in different ways if it is within the sentence “I go to the bank to pay the bill” or “I hope you sitting on the park bench”.
The importance of this task is clear if we think that an Internet search of documents related to the word “bank” could be refined if we distinguish between the possible meanings of this word and we choose the one in which we are interested in. Machine Translation is another task which will benefit from WSD as polysemous words often have the same translation depending on what their appropiate meaning in a text.
The approaches to the task can be summarized in supervised and unsupervised, thus based on the use of corpus annotated with directions or not. Usually, supervised methods employ large corpora annotated with directions and Machine Learning techniques.
The meaning of a sentence is not only based on the words that constitute it, but also on the order, the grouping and the relations between them.
A semantic role can be defined as the relationship between a syntactic constituent (argument of the verb) and a predicate. It identifies the function of an argument of the verb in the event that this verb expresses, for example, an agent, a beneficiary or attachments and, as a cause or manner.
Consider the sentence “the executives gave a standing ovation to the chief”. The words of this sentence are grouped into three small syntactic constituents, each with a different role. The syntactic constituent has the role of executive agent, and the constituents, the chief and a standing ovation, have the recipient and theme roles respectively.
The information obtained from this process of analysis is fundamental for other NLP tasks such as Search and Information Extraction Response. In the case of search response systems, semantic roles could help answer questions such as who or when, for example.
The “textual implication” term is used to indicate the situation in which the semantics of a Natural Language text can be inferred from the semantics of other Natural Language text. More specifically, if the truth of a statement entails the truth of another statement, also called hypotheses. Let the following two texts:
1. The days meeting of the G8 will take place in Scotland.
2. The meeting of the Group of the Eight will last three days.
It is clear that the second semantics can be inferred from the semantics of the former. There is textual engagement between the two texts (the first implies the second). It can also be observed that the recognition of textual involvement requires both lexical processing (eg, synonyms and gathering or meeting between G8 and the Group of Eight) and syntactic.
Another challenge of the PLN is the fact that the same ideas can be expressed in different ways. If we are talking about words or short phrases, we mean synonymy and if the expression is more complex, a complete sentence or a set of sentences, we are referring to a paraphrase.
The importance of paraphrasing is greater if we consider it could be applied to improve current Information Retrieval systems, Question Answering or Information Extraction, in which the outcome should not rely solely on the appearance of the exact words search or question.
The automatic analysis of time and space is becoming a popular challenge since much of the information handled is of this type. The challenge is derived from the necessary combination to move the text to a more precise representation of temporal entities, their relationships to a certain ontology and reasoning skills combined with common sense inference from the temporal axioms.
A common tought is that it causes problems in QA systems where certain types of questions are about “when” or “where”. However,it also affects many areas of the PLN and Artificial Intelligence.
Throughout history humans used language to convey their knowledge, feelings, emotions, communicate with other humans and this function of language has developed the oral, graphic, written or signed.
Although, throughout the eighties, Cognitive Linguistics raised and demonstrated that the metaphorical, not literal meaning, is not just a common part of our linguistic ability. It is also a key issue in human cognition (Lakoff & Johnson 1980). However, the attempts to submit it to examination by computational means have been rather scarce. From the NLP area there are works based mainly on the literal meanings of words (WSD). Only in recent years projects to process non-literal meanings, both lexical and sentence have been started. If we seek to account for the meaning of a text in general from a computational point of view, assuming the cognitive hypothesis that every text contains metaphors and meanings literarales (produced by the very functioning of human cognition), a NLP system must be able to detect and interpret such metaphorical uses correctly. This represents the primary objective of this research area: detection and interpretation of metaphors and non-literal language use.
The Semantic Web is an extended Web, which offers higher meaning in which any Internet user can find answers to questions more quickly and easily with better-defined information. By endowing the Web more meaning and, therefore, more semantics, you can get solutions to common problems in finding information through the use of common infrastructure, through which you can easily share, process and transfer information. This Web-based extended meaning, is based on universal languages that solve the problems caused by lacking of semantic web in which, at times, access to information becomes a difficult and frustrating.
The Web has profoundly changed the way we communicate, do business and perform our work. Communication with virtually everyone at any time at low cost is possible today. We can perform financial transactions over the Internet. We have access to millions of resources, regardless of our geographical location and language. All these factors have contributed to the success of the Web. However, while these same factors also have led to their major problems: information overload and heterogeneity of information sources with the attendant problem of interoperability. The Semantic Web helps to solve these two problems by allowing users to delegate tasks to software. With the Semantic Web, the software is capable of processing the content, reasoning, using, combining it and performing logical deductions to solve everyday problems automatically.
The semantic web is a thriving area born at the crossroads of Artificial Intelligence and web technologies, proposing new techniques and paradigms for knowledge representation to facilitate the location, sharing and integrating resources via the Web. These new techniques are based on the introduction of explicit semantic knowledge that describes and/or structure information and services available, thus capable of being processed automatically by a program. One of the main thrusts of this vision is the notion of ontology as the key tool for reaching an understanding between the parties (users, developers, programs) involved in this common knowledge.
The Authorship Identification (AI) attempts to classify documents by author. It reflects the profile of the author through idiosyncratic brands that are not under his conscious control. The AI is a multidisciplinary area, which brought to it different research areas (language, law and information technology) that work together toward a common goal: to automate the linguistic treatment of the author in legal and judicial fields.
Among the problems hampering the development of this work is the potential complexity in reconstructing the linguistic profile of the author. It may vary depending on the genre or theme, from the time it occurred or even in different sections of the same document under section. Also difficult to identify the presence of the author in works written collaboratively. Besides the aspects discussed, one of the major drawbacks in the AI, is the lack of standard corpora to assess the improvements made and compare them with existing techniques.
Regarding the techniques applied, the first studies were based on the application of statistical techniques, while later, have been introduced automatic learning techniques. Other techniques that have begun implementing in recent years - not free of controversy - are the use of compression algorithms. Compression techniques are common within the text strings, encoding the longest at the lowest possible number of bits.
In order to define the author's writing, on the one hand, a number of brands at different linguistic levels (at the token, syntax, based on the wealth of vocabulary, by frequency of occurrence of words, grammar errors, etc) are used along with statistical methods and Machine Learning. On the other hand, the compression algorithms are applied directly on documents, due to their nature, they have the ability to capture for themselves hallmarks of the author's writing from full text, without the need for prior extraction trademark style.
The main objective of this task is to obtain a reductive transformation of source text through content condensation by selection and/or generalisation of what is important in the source. This is not a novel task, since research on Text Summarization began in the late fifties when different techniques, such as word frequency or position in text were analyzed in order to produce automatic summaries by means of a computer with no human intervention. Since then, many different techniques have been developed and used in Text Summarization and different approaches can be found in the literature. However, this task has experienced a great development in recent years, mostly due to the rapid growth of the Internet. Users have to deal with a huge amount of documents in many formats. As a consequence, we need methods and tools to present all information in a clear and concise way, easy to read and to understand, thus allowing users to save time and resources.
The approach adopted in order to produce a summary is one of the main issues of this task. Regarding this, a summary can be an extract (i.e. when a selection of “significant” sentences of a document is performed), or an abstract, when the summary can be a substitute to the original document. At present, abstract generation is a big challenge, and therefore, semantic knowledge, as well as other NLP techniques, such as Textual Entailment, Anaphora Resolution or Information Extraction, are essential to the proper achievement of the task.
All tasks of the NLP are based on certain resources tailored to help its success in varying degrees. These can be collections of examples, ontologies, dictionaries, etc.. which are essential to solve a particular task, depending to the approach selected.
Most of the tasks of Natural Language Processing are addressed by Machine Learning methods, for which they need to have a sufficient number of examples listed and not listed. That is why the collected works of very different nature and their subsequent labeling with linguistic information is essential if we want to obtain robust and efficient software tools. The GPLSI has participated in several projects involving the building of parallel corpus syntactically and semantically annotated in Spanish, Catalan and Euskera.
An ontology is a representation of our knowledge of the world; it is more than a dictionary where only are outlined the possible meanings of a word. It is merely a taxonomy in which only few relations between terms are showed. An ontology is a set of interrelated concepts and attempts to systematize and correlate our knowledge, from the most abstract to the most concrete. It is a vital tool for new proposals for sharing information over the Internet as the Semantic Web or Web 2.0, and also for NLP tools as the Question Answering systems.