Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages.

Patil, Vaidehi ; Talukdar, Partha ; Sarawagi, Sunita (2022) Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages. In: 60th Annual Meeting of the Association for Computational Linguistics.

PDF
516kB

Abstract

Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.

Item Type:	Conference or Workshop Item (Paper)
Source:	Copyright of this article belongs to Association for Computational Linguistics
ID Code:	128250
Deposited On:	18 Oct 2022 10:54
Last Modified:	15 Nov 2022 08:42

Repository Staff Only: item control page