New research that will significantly contribute to the field of natural language processing (NLP) and machine translation (MT) will be presented at Empirical Methods in Natural Language Processing (EMNLP) 2023 in Singapore this December (6-10th). Authored by Dr. Steinþór Steingrímsson in collaboration with Professor Andy Way at DCU and Associate Professor Hrafn Loftsson at Reykjavik University, the research focuses on a sentence alignment tool for parallel corpora. EMNLP is one of the two leading high impact conferences in the area of natural language processing and artificial intelligence (AI), along with ACL.
Titled “SentAlign: Accurate and Scalable Sentence Alignment”, the paper describes an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in large document collections and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. When evaluated on two different evaluation sets as well as a downstream MT task, SentAlign outperforms five leading sentence alignment tools.
The SentAlign tool offers numerous benefits for the research community by speeding up the time-consuming task of aligning parallel corpora. The benefits extend to MT, crosslingual NLP tasks, and core NLP research, where high-quality parallel data is a foundational resource. The impact of these features highlights the significance of SentAlign for the NLP community.
SentAlign is available on https://github.com/steinst/SentAlign
Earlier this year, Dr. Steingrímsson defended his PhD thesis entitled “Effectively Compiling Parallel Corpora for Machine Translation in Resource-Scarce Conditions” at Reykjavik University, where he was co-supervised by Prof. Loftsson and Prof. Way. Steinþór visited ADAPT in DCU for 4 months in 2020. He was a member of ADAPT from 2019-23.