AI

A Class-Balanced Soft-Voting System for Detecting Multi-Generator Machine-Generated Text

By Renhua Gu Samsung R&D Institute China - Beijing
By Xiangfeng Meng Samsung R&D Institute China - Beijing

1. Introduction


SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text. There are 3 subtasks for different detection scenarios. This paper proposes a system that mainly deals with Subtask B. It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task. Our team AISPACE conducted a systematic study of fine-tuning transformer-based models, including encoder-only, decoder-only and encoder-decoder models. We compared their performance on this task and identified that encoder-only models performed exceptionally well. We also applied a weighted Cross Entropy loss function to address the issue of data imbalance of different class samples. Additionally, we employed soft-voting strategy over multi-models ensemble to enhance the reliability of our predictions. Our system ranked top 1 in Subtask B, which sets a state-of-the-art benchmark for this new challenge.

2. Method


Based on the analysis of the task situation, we have carried out preliminary studies of several methods and integrated pre-trained language models fine-tuning, class-balanced weight loss function, and soft-voting model ensemble into our system.

2.1 Data Process

Subtask B shares same generators, same domains and same language with subtask A. The statistical analysis reveals that subtask B lacks training data from PeerRead Source while subtask A can provide necessary data to fill the gap. To strengthen data source for training, we merged A and B train data into a unified dataset, removing any duplicated items and those present in the dev set. We then relabeled all the texts based on task B labels. The resulting training data consists of 127,755 items. However, it is important to note that the training data does not include any PeerRead texts generated by BLOOMZ, unlike the dev data. We can still assess the model's generalization ability using this data.

2.1 Data Process

Subtask B shares same generators, same domains and same language with subtask A. The statistical analysis reveals that subtask B lacks training data from PeerRead Source while subtask A can provide necessary data to fill the gap. To strengthen data source for training, we merged A and B train data into a unified dataset, removing any duplicated items and those present in the dev set. We then relabeled all the texts based on task B labels. The resulting training data consists of 127,755 items. However, it is important to note that the training data does not include any PeerRead texts generated by BLOOMZ, unlike the dev data. We can still assess the model's generalization ability using this data.

2.2 Fine-tuning Transformer-based Models

Fine-tuning pre-trained models is typically effective approach for downstream tasks. Our system utilize a series of Transformer-based models, including encoder-based(Roberta, Deberta, Longformer), decoder-based(XLNet) and encoder-decoder(T5) models, to develop a multi-class classifier through fine-tuning. One purpose is to determine which architecture is better suited in such tasks. Another purpose is to construct more bases that excel in different generators, which will benefit the overall ensemble results.

2.3 Class Balanced Weighted Loss

In this task, each class has a different number of samples. The number of human-written samples number is even 5-6 times greater than others. To address the sample imbalance of different classes, we employed a weighted loss function during training to balance the contribution of each class sample to the loss.

For multi-class classification, the commonly used loss function is ordinary cross-entropy (CE). However, when there is an imbalance-sample problem, the class-balanced weighted cross-entropy (WCE) will significantly improve the performance.

2.4 Soft Voting

To enhance robustness and stability across generators and domains, we employ an ensemble approach by using the soft voting method with multiple base models. Firstly, we obtain the confusion matrix of each base classifier. Secondly, we select the model that outperforms in a specific class. Finally, we integrate all the soft-max probability distribution matrix of all outperformed models to obtain the average probability distribution, and make the final decision based on it.

3. Results

3.1 Multiple Methods on Roberta-large Experiment

On the baseline Roberta-large, an ablation study was conducted. The performance comparison is shown in Table I. The results of experiments indicate that supplementing the data source significantly improves performance. Therefore, the supervised fine-tuning is crucial in such cases. Additionally, a weighted loss function can mitigate sample imbalance issue.

Table 1. The performance Comparison of multiple methods on Roberta-large

3.2 Different Base Models Experiments

Further, we fine-tuned a series of transformer-based models to select the most suitable base model. The results in Table II shows that the encoder or decoder can achieve top performance while the Encoder-Decoder is poor for this task. For input size, 512 tokens exceed 1024 tokens. To include longer input has no contribution to the result.

Table 2. The performance comparison of different base models

3.3 Different Base Model Ensemble Experiments

At last, we conducted a model ensemble by soft voting method to ensure robustness and generalization and reduce the effect of noise. The selected single base fine-tuned model is chosen based on its performance in the specific class. We tested various combinations, and the results are shown in Table III. We attempted to combine various single base models, including 2, 3, and 4 types. Compared to the best single model, the ensembled model showed significant improvement, even with the least number of ensemble types. Furthermore, as the differences in ensemble models increased, the results improved even further. Additionally, if the ensemble base models perform well individually in every class, the overall result is also improved.

Table 3. The performance comparison of different base model ensemble

4. Conclusion


This paper presents a systematic study on detecting machine-generated text from multi-generators and multi-domains. We fine-tuned a series of transformer-based models and found that the encoder architecture is better suited for the task. We employed a weighted Cross Entropy loss function to address the sample imbalance. To improve robustness and generalization, various base models were ensembled by soft-voting method, and resulting in 99.46\% accuracy on the dev set. In the final test, our system ranked 1st. Moving forward, we plan to explore more widely used LLMs and work towards enhancing our capabilities in few-shot learning and transfer-learning for similar tasks.