Natural language understanding
NLU concerns developing of computational techniques for the analysis and understanding of human language.
It is at the core of AI methods able to communicate with humans in a natural way and searches the capabilities to extract the semantics
of human text or utterances.
Most of our work in this direction has been done as part of the European project
EMPATHIC,
oriented to create personalized virtual coaches to help elderly people to live independently.
A video showing the results of the EMPATHIC project can be seen
here.
Within the NLU work-package
we have developed algorithms for intent and topic classification of dialogues in four languages:
English, Spanish, French and Norwegian.
A general description of all the NLU tasks involved in EMPATHIC, as well as a description of tasks for the
other work-packages is presented in
Torres_et_al:2019.
The dialogue act taxonomy designed for the development of a conversational agent for elderly is described in
Montenegro_et_al:2019.
An investigation of the factors that influence the capability of correctly detecting whether a user’s utterance has ended or not,
an important question for user understanding, was presented in
Montenegro_et_al:2021. Two methods to
address the question of
A final presentation
of the EMPATHIC project is given in
Olaso_et_al:2021.
Sentiment and topic classification
Sentiment classification arises as a common NLP problem in different contexts (bot-human dialogues,
social networks, survey analysis, etc.). Usually, it consists to assign a sentiment label, from a predefined set
of labels, to a text. Other times, sentiment classification can be given as a score that indicates
where the text is within a positive-negative range.
In Roman_et_al:2019,
we have proposed algorithms based on genetically evolved Gaussian kernels and applied them to
the SemEval2007 Affective Text shared task dataset, where news headlines were manually annotated by experts into six classes.
In
Montenegro_et_al:2020,
we address the question of transfer learning between related topic classification tasks with a hierarchical relationship.
We introduce and validate a method that exploits this hierarchical structure to implement the transfer.
Text representation and analogy resolution
One of the key questions in ML approaches to NLP is how to construct a rich and expressive vectorial representation of words,
sentences, and documents. Methods that construct words embeddings are extensively applied to learn such representations.
Similarity metrics defined on these representations are equally important for semantic analysis of the text.
Another interesting and difficult NLP problem is analogy resolution, where two analogous pairs of related words are given. The task consists
on predicting one of the four words given the other three.
The evaluation of distance/similarity measures for text representation was conducted in
Magalhaes_et_al:2019,
using the word2vec embedding and graph convolutional networks.
We empirically investigated the impact of thirteen measures of distance/similarity.
In
Santana:2017,
Santana:2021,
we show that it is possible to learn methods for word composition in semantic spaces. Instead of expressing the compositional method as
an algebraic operation, we will encode it as a program, which can be linear, nonlinear, or involves more intricate expressions.
One genetic-programming based algorithm was introduced that works with word-embeddings and
finds algebraic expressions for solving different types of analogy tasks.
Translation tasks
Translation tasks are common in NLP and they involve both the automatic translation of text between languages, a question
that usually requires word disambiguation; and the evaluation of the translation effort, in the situations when humans participate of the task.
Two methods for extending or adapting an NLU system to multiple languages was presented in
Montenegro_et_al:2019a. The first method
extracts information from the semantic network
wordnet
in order to do data augmentation, the second method is based on parallel corpora from films subtitles.
ML methods for translation editing effort estimation were presented in
Roman_et_al:2019.