Optical character recognition (OCR) is one of the crucial technologies for providing the best user experience to Samsung’s customers. It allows for the digitization of any text, both handwritten and printed, enabling users to edit, share, translate, or read aloud by simply taking a photo. Ensuring the high quality of their solution is vital for Samsung R&D Institute Ukraine (SRUKR)’s document intelligence team, and competitions serve as an excellent benchmark to evaluate their performance against the latest work of other researchers and developers in the field.
SRUKR has achieved notable success by ranking second in the Competition on Multi Font Group Recognition and OCR. In addition, in collaboration with the Taras Shevchenko National University of Kyiv(KNU), they secured first place in the Competition on Recognition and Visual Question Answering (VQA) on Handwritten Documents. Both competitions took place in spring 2024. The official results are available on the competitions’ websites and will be officially presented at the International Conference on Document Analysis and Recognition (ICDAR2024) in September 2024 in Athens, Greece.
The organizers of the Competition on Recognition and VQA on Handwritten Documents have highlighted the shortage of high-quality tools for handwriting recognition and related tasks for languages that use scripts other than Latin. They also noted that even for English, the number of publicly available datasets is limited, posing challenges for academic research. To address this, the participants were provided with unique datasets of handwritten texts in four languages: English, Telugu, Bangla, and Hindi. They were tasked with solving three problems: isolated word recognition, page-level recognition, and VQA on handwritten documents.
The joint team from SRUKR and KNU participated in Task A - Isolated Word Recognition. They secured first place with a character recognition rate of 98% and a word recognition rate of 94%, outperforming the runner-up by more than 10%. Similar to their approach for the font group classification task, the winning approach was based on Convolutional Recurrent Neural Network (CRNN) with a Connectionist Temporal Classification (CTC) objective function. The CRNN architecture consisted of three components: convolutional layers (in this case, an adapted *EfficientNet-V2), recurrent layers (a 1-layer **bidirectional Gated Recurrent Unit (GRU) with 128 cells), and a transcription layer. During the recognition phase, the team used an adapted version of the token-passing algorithm, which allowed the incorporation of a language model, thereby improving recognition rates. This adapted token-passing algorithm identifies the most probable sequence of complete words using a class-based 3-gram language model.
The Competition on Multi Font Group Recognition and OCR focused on historical fonts in several languages. The high accuracy demonstrated by the SRUKR team’s approach indicates the robustness of their solution, effectively handling centuries-old documents where writing style, vocabulary, and orthography differ significantly from contemporary standards.
The competition comprised two tasks: font group recognition and optical character recognition. SRUKR ranked second in both tasks, achieving a character error rate of 3.04% for font group recognition and an impressive 0.85% for OCR. These results were achieved using only the data provided by the organizers.
For the OCR task, the team utilized a transformer-based Deep Neural Network (DNN) to recognize text in images. The group recognition model featured an Efficient Net backbone that outputs a sequence of features, which are then passed through a series of transformer layers to predict character distribution for each position. The model was trained using CTC loss, with Rotary Position Embedding (RoPE) employed within the attention layers. An ensemble of three similar models was trained, and the result from one model was selected based on the total frequency of 5 grams in the training set. If the sums were equal, smaller n-grams, up to two, were considered.
The approach to font group recognition was based on a Convolutional Recurrent Neural Network (CRNN) with a CTC objective function. The CRNN consisted of a Convolutional Neural Network (CNN) and a Multi-Layer Perceptron (MLP). The CNN, specifically an adapted *EfficientNet-V2, extracted features from the images, and the MLP determined the font of each character. Several post-processing rules were applied to further refine the classification results for each character based on the output for neighboring characters.
* EfficientNet-V2 are a family of image classification models, which achieve better parameter efficiency and faster training speed than prior arts by using neural architecture search (NAS) to jointly optimize model size and training speed.
** Gated recurrent unit (GRU) is a type of recurrent neural network, which means it can store information from previous inputs to make more informed predictions about future inputs. A Bidirectional GRU, or BiGRU, is a sequence processing model that consists of two GRUs - one that processes the input in its original order (forward direction), and another that processes it in reverse (backwards direction).