The Machine Translation team of Samsung R&D Institute Poland (SRPOL) has just written another chapter to its long history of development and improvement. This year SRPOL MT team has competed with the best, in a task where the goal was to translate between 10 Indian languages and English in both directions. The task was organized as a part of the Workshop on Asian Translation (WAT 2021). This was the 8th edition of the WAT conference and competition, which was thought as a counterweight to western-focused Workshops on Machine Translation. As many Asian countries are experiencing rapid growth these days, the need for easy and fluent communication becomes essential to keep up this tendency. The system submitted by SRPOL was the best among all participants in 18 out of 20 translation directions that it covered.
For over 8 years, starting from a small 2-person project, SRPOL has been building its competences in Machine Translation. Following the different technological paths and meanders, the team constantly improved the quality of machine translation and broadened the language coverage they support. Since 2017 the team felt strong enough to compete with top research institutions and companies in international Machine Translation challenges. For several years in a row SRPOL was the winner in the International Workshop on Spoken Language Translation IWSLT competition for submitting English-German models, both for text and speech translation. The next step was to challenge themselves and do the same with other languages (less resourced ones) in even more prestigious competitions. Last year SRPOL stood on the podium of the Workshop on Machine Translation WMT 2020 for Czech and Polish languages.
This year, they decided to throw themselves into the deep end and compete in a translation of Indian languages. As the population of India is about 1.3B, which is almost 20% of the world population, it is becoming a key market for Samsung. This is why SRPOL’s machine translation team decided to target and cover this part of the linguistic world. Ashamed to admit, at the beginning they could not even name all the 10 languages the competition included (thankfully that changed very quickly). The task to be solved was to create only 2 multilingual systems that would translate between 10 Indian languages and English. The languages were Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil and Telugu. The task was more complicated since each language (except Hindi and Marathi) has a different script and none of them is anyhow similar to Latin script the SRPOL’s engineers are familiar with. Additionally, most of the languages in this task can be considered as low-resource ones i.e. there is not much parallel corpora between these languages and English, thus all known techniques to generate synthetic data came into play.
The preparation of the systems was very challenging, since no one in the translation team could not only understand a word in any of the languages, but even recognize the language by looking at its script. Fortunately, crash course in basic rules and linguistic tools for Indian languages was enough to keep going. As soon as the MT team took control of the “input”, they could execute and exploit all the experience they had to train very strong and competitive models. They have trained the multilingual models, basing on the experience gained while building commercialized models for Bixby Vision service as well as the on-device translation used as a Samsung Browser plug-in.
Different variations of training that help in case of low resource languages were proposed as well as various techniques to mine the in-domain (as similar as possible to test domain) sentences from monolingual data. To train the models all known, best and verified techniques like backward-forward translation, adding sampling noise, multi-agent dual learning, distillation, and others were used. For over a month the team followed the training curves, analyzed progress, experimented with mixing and matching models and synthetic corpora generation. Finally, ending up with several possible candidate models, an extensive optimization of different methods of fusing (ensembling) them into one was performed. The models were so good in terms of translation quality that we decided to use them to improve existing production systems for Indian languages.
No preparations and experiments that were done to come up with the final submissions could be possible without strong support from Samsung Research HQ, especially the resources we used on Supercom. We have used tens of hundreds of GPU hours to train, translate and then train again our models. It is also worth mentioning that again, we have prepared the submission under strict lockdown in Poland, all work and communication was done remotely by holding daily video conferences.
This success is not only ours, but also SRPOL’s and Samsung’s as a great place that enables research like this!
Machine Translation team of Samsung R&D Institute Poland (SRPOL)