PolyLM, an open source multilingual LLM, unveiled
Researchers from DAMO Academy and Alibaba Group have introduced PolyLM, an open source multilingual large language model (LLM) designed to address the limitations of existing models. Available in two model sizes, 1.7B and 13B, PolyLM offers advanced capabilities in understanding, reasoning, and generating text across multiple languages.
PolyLM excels in Spanish,
Russian, Arabic, Japanese, Korean, Thai, Indonesian, and Chinese, complementing existing models. Its training strategy facilitates knowledge transfer from English to other languages, enhancing its multilingual performance. To improve understanding of multilingual instructions, PolyLM utilises the MULTIALPACA data set, which provides high-quality multilingual instruction data.
The researchers utilised a massive data set of 640B tokens from sources like Wikipedia, mC4, and CC-100 to train PolyLM. They employed a curricular learning technique, gradually increasing the focus on low-resource languages while initially concentrating on English. This approach ensured the transfer of general knowledge across languages.
PolyLM’s evaluation involved a benchmark comprising multilingual tasks such as question answering, language understanding, text generation, and cross-lingual machine translation. The experiments showcased PolyLM’s superior performance in non-English languages compared to existing models of similar size. The model’s multilingual capabilities were further enhanced using multilingual instruction data.
With PolyLM’s introduction, the AI community now has access to a powerful multilingual LLM. Its proficiency in major non-English languages and advanced training techniques makes it a significant milestone in the field of natural language processing.