Samsung R&D Institute Poland at SemEval 2022

Can we build machines capable of human-like inference about facts related to a defined knowledge domain? The struggle is ongoing for decades now, and it is perceived to be an important AI target. The Task 9 – R2VQ - Competence-based Multimodal Question Answering – of the International Workshop on Semantic Evaluation (SemEval) 2022 takes a step forward. This year, Samsung R&D Institute Poland (SRPOL) won the first place in the task, beating the competitors with a large margin.

SemEval is a renowned series of international natural language processing (NLP) research workshops whose mission is to advance the current state of the art in semantic analysis in a range of increasingly challenging problems in natural language semantics. The results presentation is hosted by the Association for Computational Linguistics (ACL) conference, which is, as the premier international scientific and professional society, devoted to people working on computational problems involving human language.

The motivation of this year's Task 9 was to create a competence-based question answering (QA) solution, designed to involve rich semantic annotation and aligned text-video objects. It is based on the intuition that textual and visual information are complementary for each other for semantic reasoning.

The Competence-based Multimodal Question Answering task is querying how well a system comprehends the semantics of cooking recipes derived from a collection of texts and videos, and requires demonstrating a kind of understanding of how the knowledge is applied. When viewed over a conceptual domain, this constitutes a competence. Such competence-based evaluation can be seen as applying operational knowledge that a system has for a conceptual domain. This is aimed to be reflective of linguistic and cognitive competencies that humans have when speaking and reasoning.

The winning solution was built as a hybrid system. One part of it was a combination of Reading Comprehension methods utilizing Deep Contextual Language Models with Transfer Learning; the other part was rule-based, to process highly structured data provided by task organizers.