An AL-R Model for Multilingual Complex Named Entity Recognition

By Haojie Zhang Samsung R&D Institute China-Beijing
By Xiao Li Samsung R&D Institute China-Beijing


This paper describes our system for SemEval-2023 Task 2 Multilingual Complex Named Entity Recognition (MultiCoNER II). Our team Samsung Research China - Beijing proposes an AL-R (Adjustable Loss RoBERTa) model to boost the performance of recognizing short and complex entities with the challenges of long tail data distribution, out of knowledge base and noise scenarios. We first employ an adjustable dice loss optimization objective to overcome the issue of long-tail data distribution, which is also proved to be noise-robusted, especially in combatting the issue of fine-grained label confusing. Besides, we develop our own knowledge enhancement tool to provide related contexts for the short context setting and address the issue of out of knowledge base. Experiments have verified the validation of our approaches. In the official test result, our system ranked 2nd on the English track in this task.


We introduce a Knowledge-enhancement and Adjustable-loss Named Entity Recognition system to overcome the main challenges4 paid much attention in this big event. First, to alleviate the insufficiency of the context of the input texts, which is critical for distinguishing and recognizing entities especially in the field of short texts, we feed the inputs into our Wikipedia-based Tool to get the knowledge-enhancement context. Second, to combat the performance degradation caused by the long-tail data distribution, we select and deliberately design the dice loss, here called adjustable loss, to be the optimization object. Meanwhile, we introduce the label-smoothing technology to decrease the overfitting of some grained labels belonging to the same coarse labels. The architecture of our scheme is shown as Figure 1.

Figure 1.  The illustration of our NER scheme

1. Knowledge-enhancement Tool
Knowledge augmentation has demonstrated its effectiveness for the coarse-grained entity recognition task of short texts(Wang et al., 2022). For the fine-grained entity recognition task, we downloaded the latest version of Wikipedia dump, and built two separate knowledge retrieval modules using ElasticSearch5: a sentence retrieval module and an entity retrieval module. The illustration of our knowledge-enhancement tool is shown in Figure 2.

2. Sentence Retrieval Knowledge Base (Se-Kg)
The sentence retrieval knowledge base consists of two fields: sentence field, paragraph field. We create an index for each sentence in the Wikipedia dump as sentence field, the paragraph in which the sentence is stored as paragraph field. The wiki anchors are marked to take advantage of the rich anchor information of Wikipedia. Indexes related to texts of the dataset can be retrieved in sentence field, and the content of the paragraph field will be enhanced as sentence knowledge. Entity Retrieval Knowledge Base (En-Kg) The entity retrieval knowledge base has two fields: title field, paragraph field. We use the title of the page of the Wikipedia dump as title field, and the summary of the page is stored as paragraph field. The entities of the dataset are matched in title field, and the paragraph of the index which is matched will be the entity knowledge enhancement. If the matching fails, no entity knowledge augmentation is performed.

Figure 2.  The illustration of our Knowledge-enhancement Tool.

3. Adjustable-loss
We propose an adjustable-loss RoBERTa model, which leverages RoBERTa as the backbone to encode the text input. In the field of sequence tagging, it is quite often to employ the classical conditional random field (CRF) layer to capture the dependency of output labels, which is proved to be effective to boost the performance (Lafferty et al., 2001; Wang et al., 2022). But in our system, instead of adopting a conditional random field loss as the optimization object, here we employ adjustable dice loss to combat the long-tail input data distribution. To fast the convergence of training process, we use a variant of dice loss.
Dice Loss It derivates from dice coefficient (DSC) which generally is used to evaluate the similarity of two sets. As dice loss is a F1-score oriented loss, which is consistent with the final evaluation metric− overall macro F1 score. So it is naturally to be a candidate of optimization objects. The final loss is as follows:

4. Label Smoothing
In our experiments, we discover that some grained tail labels belonging to the same tail coarse labels such as PER are easily confusing by identification during inference process. An intuition is that the model has a strong confidence of the inference capability, hence causing overfitting. To overcome this problem, we introduce a label smoothing (LS) scheme. More specifically, we add some uniform distribution noise into the original one-hot distribution but keep it a real distribution.

5. Ensemble Method
Ensemble models are used to boost the final prediction performance in our English track. We train a couple of models by setting different random seeds. Meanwhile, to keep a heterogeneity among different models, we introduce some Span-BERT-like8 models such as ERNIE sourced from Baidu9. As ERNIE is pre-trained by some span masking tasks like phrase masking and entity masking (Sun et al., 2019), it will boost the entity extraction as one component of the ensemble models. In the end, we make a decision on the entities by majority voting.

Experimental Results

Table 1.  Results of baseline model and ours models (AL-R-E4F-Ensemble achieves the best performance with macro-F 0.8309 among all our models. It shows that model ensemble helps the system more robust, enabling it to handle complex entity recognition scenarios. )


In this paper, we build a system for complex finegrained NER task. Our system assembles multiple AL-R models containing RoBERTa based pre-trained model and adjustable dice loss to overcome the issues of long-tailed dataset and noise entities. We construct the Knowledge Retrieval Module from which sentence knowledge augmentation and entity knowledge augmentation can be performed on low-context data. In MultiCoNER II task, our system ranks 2nd on English track.

Link to the paper


[1]. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.

[2]. Leslie A Baxter. 2004. Relationships as dialogues. Personal relationships, 11(1):1–22.

[3]. Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018. Joint entity recognition and relation extraction as a multi-head selection problem. Expert Systems with Applications, 114:34–45.

[4]. Thomas G Dietterich. 2000. Ensemble methods in machine learning. In Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings 1, pages 1–15. Springer.

[5]. Markus Eberts and Adrian Ulges. 2019. Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:1909.07755.

[6]. Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2019b. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855.

[7]. Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Kewei Tu. 2021. Improving named entity recognition by external context retrieving and cooperative learning. arXiv preprint arXiv:2105.03654.

[8]. Xinyu Wang, Yongliang Shen, Jiong Cai, Tao Wang, XiaobinWang, Pengjun Xie, Fei Huang,Weiming Lu, Yueting Zhuang, Kewei Tu, et al. 2022. Damo-nlp at semeval-2022 task 11: A knowledge-based system for multilingual named entity recognition. arXiv preprint arXiv:2203.00545.