Helsinki-bases Silo AI has completed the training of the Poro model — a new milestone in its mission to create large language models (LLMs) for low-resource languages.
Named after the Finnish word for “reindeer,” Poro is the first of a family of open-source multilingual LLMs. The startup is building the models alongside the University of Turku and the EU’s High Performance Language Technologies (HPLT) project.
Poro is a 34.2 billion parameter model, designed to process English, Finnish, and code. It’s been trained on a dataset of 1 trillion tokens.
“What we are proving with Poro is that we can build competitive models for low-resource languages, like Finnish,” Peter Sarlin, co-founder and CEO of Silo AI, told TNW.
The <3 of EU tech
The latest rumblings from the EU tech scene, a story from our wise ol’ founder Boris, and some questionable AI art. It’s free, every week, in your inbox. Sign up now!
Sarlin explained that in generic LLMs, high-resource languages like English dominate, meaning that the capabilities of low-resource languages reach the extent of translation, but aren’t representative of the language and the culture of a specific country.
According to the startup, Poro outperforms all existing open-source language models on Finnish, including Mistral, FinGPT, Llama, and the BLUUMI 176 billion parameter model.
To achieve this, the team used a novel training approach, by pairing Finnish with high-resource languages. It determined optimal data reuse frequencies for low-resource languages and integrated translated paired texts between Finnish and English. This method relies on cross-lingual signals to boost the understanding of the connections between languages — and in turn, boost performance for Finnish, while not compromising it in English.
Poro has also achieved another milestone: it’s the first multilingual model that has been trained on a EuroHPC supercomputer. “This is proof that we’re able to train LLMs on the AMD-based LUMI supercomputer, instead of an NVIDIA-based supercomputer,” Sarlin said.
A step towards European sovereignty
Open-source multilingual LLMs are key to ensuring language diversity, cultural representation, and democratic accessin artificial intelligence. They’re also critical for Europe’s AI sovereignty.
“From a commercial perspective, these models build a baseline and infrastructure that allows European companies to innovate on top,” Sarlin noted. “This way companies can create IP, create competitive edge, and [create] great business that ensures that value stays in Europe with them.”
Poro is available for free under the Apache 2.0 License, which allows both commercial and research use. SiloAI is currently working on the Nordic languages (Swedish, Norwegian, Danish, and Icelandic), and is planning to expand to all other official languages of the EU.