Closed-domain natural language approaches: methods & applications
Author(s)Moreo Fernández, Alejandro
Contributor(s)Castro Peña, Juan Luis (Universidad de Granada)
Zurita López, José Manuel (Universidad de Granada)
Full recordShow full item record
AbstractNatural Language (NL) technologies are gaining an increasing interest deservedly justified by the large number of potential applications aimed for satisfying the constants information needs of the current society. NL represents however a too complex communication mechanism so as to be directly attempted from a computational point of view. For this reason, many of the efforts dedicated to its study were made in the field of closed-domains. In contrast to open-domain systems, the information handled by closed-domain approaches concerns with a delimited field of knowledge and could, in addition, be privately managed. In this way, both retrieval and interpreting mechanisms could take advantage of these a priori conditions to offer a better service. This kind of systems is of particular interest for companies and organization as a potential solution to the access and management of their own knowledge. Usually in Artificial Intelligence (AI), and more specifically in the so-called Expert Systems, there is a clear separation between the automatic processes and the knowledge. Our hypothesis focuses in this separation. The domain knowledge is often a predefined resource, and the automatic processes should be tuned accordingly. Being aware of this fact, AI processes could apply more sophisticated strategies to better approach the problem. Occasionally, one counts with the necessary time and means so as to create additional knowledge resources to better exploit the domain information. This is however a costly task that could be alleviated by means of certain methodologies that could, in addition, improve the final performance. It is therefore our aim to focus on the different automatic processes that could be carried out to develop rich NL applications in basis of the various knowledge levels. The Internet represents an endless source of knowledge and a means of communication that has been strongly integrated as part of our current society. NL technologies could be regarded as a means for managing information in an interpretable and understandable manner. There is already a large quantity of NL technologies that influence actively our daily lives [Seb05, PL08, CZ10]. However, the needs for information of the current society go further, and new and better solutions are constantly demanded. There is thus a large quantity of open-problems of social interest, to which there is still a considerable room for improvement. These problems are precisely the motivation of this Thesis. This PhD dissertation is aimed for identifying and analyzing main difficulties that arise on closed-domain NL approaches in order to propose valuable scientific contributions to the state of the art in form of technological solutions. The extent of interesting open-problems in the field of NL technologies is huge. Notwithstanding, the analytic perspective we propose in basis of the knowledge levels allow these methods to be naturally grouped. Our methodology consists of approaching each group from a representative problem of general interest. Our goal is two-fold: (i) to propose techniques and solutions exploiting the knowledge available in order to better approach the selected problem, and (ii) to abstract conclusions to make our methods become potentially useful and extensible to the rest of problems on its group. In contrast to open-domain systems, the delimitation of the domain imposes an implicit restriction to the system at hand, but is also a controlled knowledge resource that could be efficiently exploited in favor of performance. We could differentiate two kinds of resources in a closed-domain: the knowledge, and the meta-knowledge. On the one side, the knowledge refers to all data explicitly stored in the domain. It is therefore the main source of information that justifies and brings interest to the application. In this regard, knowledge could be structured, e.g. a Database, or unstructured, e.g. a free-text document. On the other side, meta-knowledge refers to additional information explaining the data. This resource, if present, could be used to perform semantic inferences through automatic reasoning. Ontologies, lexicons collecting the domain terminology, or keywords sets highlighting main terms in a domain, could be some examples of meta-knowledge. We start with a preliminary study on closed-domain NL approaches from the lowest knowledge resources available: unstructured knowledge with no meta-knowledge. This starting point groups together all those systems dealing with large volumes of document without any sort of additional information. In this study, we will investigate what makes a term become relevant to a given domain, and which are the main terms in a document. The interest of these open-questions lies on the huge quantity of applications that rely on the concept of term-relevancy. Text Classification [Seb05], Automatic Text Summarization [NM01], or Text Retrieval [Kor97], are just some examples in this regard. We will here adopt the current tendency whereby the problem is usually stated as a Feature Selection problem [YP97]. Specifically, we will rely on the so-called filtering methods and feature selection policies to test the effectiveness of positive correlation as a predominant criterion for term-relevance. This study provides the necessary background upon which our subsequent analyses are conducted. Our analysis focuses later in the intermediate levels of knowledge, paying special attention to the modeling and refinement tasks. In this regard, one of the main concerns a NL process should deal with consists of the identification of the main topics in the discourse, that is, what is the text about, and which are the main entities involved? To this end, we exploit a helpful resource for closed-domains: hierarchical lexicons ¿structured dictionaries collecting terminology and related concepts. Concretely, we address here a highly demanded problem: the automatic analysis and summary of opinions expressed in texts (aka Sentiment analysis or Opinion Mining [PL08, CZ10]), an issue of the utmost interest for companies¿ strategies, political campaigns, and decision making for potential users. By means of specific analysis techniques, we try to offer solutions to linguistic issues such as ambiguity, ellipsis, and anaphora resolution problems. We investigate how to improve the automatic analysis of opinions on news items by previously delimiting the context [MRCZ12b]. Our study continues with collaborative systems, where knowledge is usually structured in separate Information Units. These systems result from the development of the so-called Web 2.0 and represent an interesting field to emerging paradigms such as the e-learning [AHBLGS+12, DSS+02]. Users are no longer mere consumers of information, but could also take an active part as producers of new knowledge. In this case, we investigate how to create scalable collaborative systems in which the shared knowledge grows attending to actual users' needs for information [MNCZ12]. More specifically, techniques for alleviating the costs associated to the creation and maintenance of collaborative systems are offered [MRCZ12a]. In pursuing this goal, we investigate how to bring interpretability to regular expressions, a broadly used formal mechanism in the context of NL technologies. We end our study offering an analysis of NL approaches from the highest knowledge levels: structured knowledge with a suited meta-knowledge model available. In this case, we focus on the learning phenomenon from the point of view of reasoning by analogy. Reaching an advanced interpretation of NL is a fundamental step for many NL technologies such as Virtual Assistants [ELC12], Virtual Simulated Patients [SPG09], or Natural Language Interfaces [PRGBAL+13]. The ultimate goal in this regard is to allow users interact with the system by means of NL in order to facilitate access to information. In this respect, formal grammars represented an effective tool to reach NL parsing. We present a new grammar inference method based on reasoning by analogy to allow the system conjecture about the language while facing unseen expressions or terms. Through a concrete application, we investigate new solutions to the information access problem paying attention to some relevant linguistically-related problems. Providing an exhaustive analysis of all NL technologies and approaches from a theoretical point of view falls beyond the scope of this Thesis. We are rather interested in proposing concrete solutions to real problems in form of computer applications. Therefore, this PhD Thesis results from an engineering process of scientific nature. With this, we intend to reinforce the applicability nuance of this work. On the one side, most of the results obtained through this research have finally been developed as commercial applications. On the other side, most methods and theoretical advances presented in this dissertation have been published in different scientific journals or conferences.