Samsung R&D Institute Ukraine's Document AI Team Secures Top Honors in ICDAR2025 Competition

Samsung demonstrated technical superiority in the field of Document AI by winning both tasks in the CoLiE Competition of the prestigious ICDAR 2025.

Accurate temporal analysis of texts is a cornerstone of advancing natural language processing (NLP) and Document AI. Competitions like the ICDAR 2025 Competition on Automatic Classification of Literary Epochs (CoLiE) provide a critical platform to benchmark cutting-edge solutions against global research efforts. This year, Samsung R&D Institute Ukraine (SRUKR) secured 1st place in both its tasks: Literary Epochs Classification and ChronoText Classification, showcasing their expertise in automated text analysis and temporal modeling. The results were just announced at the Competition Session at ICDAR 2025 on September 19 in Wuhan, Hubei, China. "Winning the ICDAR 2025 competition is a testament to our team's dedication and expertise in Document AI," said Olga Radyvonenko, Leader of the Input Intelligence Lab at SRUKR. "We are thrilled to be recognized on a global stage for our innovative solutions."

In Task 1 – Literary Epochs Classification, teams had to classify texts into six literary epochs spanning from Classicism (1660–1798) to Contemporary (2000–present). SRUKR’s team achieved 1st place by focusing on robust preprocessing, strategic model selection, and preprocessing techniques.

Key Contributions:

Preprocessing Pipeline: Cleaned datasets by resolving hyphenated word splits, standardizing whitespace, correcting encoding errors, and removing non-textual artifacts.

Model Selection: Leveraged Gemma-3-27B-IT (temperature=0) as the core LLM, chosen for its superior performance in date prediction tasks.

Postprocessing Strategy: Validated predictions against a 1660–2025 range, using an ensemble of models (Qwen2.5-32B-Instruct, Phi-4, and Gemma-3-27B-IT at varying temperatures) to correct out-of-range outputs. This reduced invalid predictions by 43%, with residual errors defaulted to the median year (1880).

The team’s approach balanced precision and robustness, demonstrating strong generalization across diverse literary styles and epochs.

Next, in Task 2 – ChronoText Classification, participants were challenged with pinpointing a text’s century and decade of origin. SRUKR’s team secured its 1st place by refining their Task 1 pipeline and introducing task-specific optimizations.

Key Innovations:

Unified Framework: Retained the preprocessing steps from Task 1 but adapted outputs to map 4-digit publication years to century/decade labels (e.g., 1885 → 19th century, 1880s).

Fine-Tuning with LoRA: Applied LoRA (Low-Rank Adaptation) to Qwen2.5-7B-Instruct, fine-tuning on the training dataset for 3 epochs with a learning rate of 5e-5. The model’s attention layers incorporated Rotary Position Embeddings (RoPE) to better capture temporal patterns.

Ensemble Validation: Out-of-range predictions were defaulted to 1880, ensuring consistency. Final outputs were converted to century/decade labels with high accuracy.

This method outperformed competitors by effectively bridging coarse-grained epoch classification with fine-grained decade identification.

"With our approach, we managed to tread the fine line between precision and robustness, which allowed us to generalize across diverse literary styles and epochs," said Artem Shcherbina, a key contributor to the competition.

Technical Excellence and Future Impact

SRUKR’s success underscores their commitment to advancing Document AI. Their solutions combined: LLM Expertise, Domain-Specific Pipelines, Ensemble Robustness - hybrid approaches to mitigate model uncertainty and improve reliability. "These techniques not only excel in competition settings but also hold promise for real-world applications, such as digitizing documents, enhancing productivity of working tools, and improving AI-driven content curation," said Sergii Lytvynenko, Team Leader of SRUKR's Visual Intelligence Team.

Building on this expertise, SRUKR has extended its capabilities to the realm of Document AI, focusing on universal document parsing.

During Samsung AI Forum 2025, Samsung Research presented "Universal Document Parsing and Structuring for Multimodal Generation." SRUKR team contributed to the development of a table recognition solution as part of their Document AI initiative. The contributions include: Model Training and Optimization; Evaluation and Addressing Multiple Complex Challenges: Ensured accurate parsing of documents containing text, images, and tables, with a focus on understanding relationships between these elements. These solutions aim to improve work productivity by reducing the time users spend searching for information, making document AI a versatile tool for various document types and formats.

Looking Ahead

The team’s achievements at ICDAR 2025 reflect Samsung’s dedication to pushing the boundaries of Document Intelligence. As temporal text analysis continues to grow in importance, SRUKR remains at the forefront, bridging cutting-edge research with practical innovation.

Join us in celebrating this milestone. We are excited about the future and the impact our Document AI solutions will have on various user applications.

For more details, visit the ICDAR 2025 CoLiE Competition website