中国科学院机构知识库网格系统: Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only

文献类型：会议论文


作者	Lu JL(陆金梁)2,3 ; Zhang JJ(张家俊)1,2,3
出版日期	2023-12
会议日期	December 6-10, 2023
会议地点	Singapore
英文摘要	Recent studies have revealed the remarkable cross-lingual capability of multilingual pre-trained language models (mPLMs), even when pre-trained without parallel corpora (mono-mPLMs). Intuitively, semantic alignments may be the reason behind such capability but remain under-explored. In this work, we investigate the alignment properties from the token perspective in mono-mPLMs and find that the alignments correspond to the geometric similarity of embedding space across different languages. Nevertheless, mono-mPLMs tend to damage this geometric similarity at the higher layers due to the lack of cross-lingual interactions, thus limiting their cross-lingual transfer capabilities. To address this issue, we introduce token-level and semantic-level code-switched masked language modeling, employing the self-induced token alignments to explicitly improve cross-lingual interactions over layers of mono-mPLMs without relying on parallel sentences. We evaluate our method on various natural language understanding tasks and unsupervised machine translation tasks. The results demonstrate that our methods outperform the strong baselines and achieve comparable performance with mPLMs trained with parallel corpora.
会议录出版者	Association for Computational Linguistics
源URL	[http://ir.ia.ac.cn/handle/173211/57386]
专题	紫东太初大模型研究中心
通讯作者	Zhang JJ(张家俊)
作者单位	1.Wuhan AI Research, Wuhan, China 2.Institute of Automation, Chinese Academy of Sciences, Beijing, China 3.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
推荐引用方式 GB/T 7714	Lu JL,Zhang JJ. Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only[C]. 见:. Singapore. December 6-10, 2023.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。