The rise of artificial intelligence (AI) has sparked a global race, with nations vying for dominance in this transformative technology. Amidst this competition, Europe has emerged with a distinct vision: digital sovereignty. This vision centers on empowering Europe to control its digital destiny, fostering innovation, and ensuring that critical technologies, like AI, are developed and governed within its borders. A key component of this strategy is the development of open-source large language models (LLMs), a goal being pursued through the ambitious OpenEuroLLM project.
OpenEuroLLM represents a collaborative effort involving around 20 organizations, spearheaded by Jan Hajič, a computational linguist from Charles University in Prague, and Peter Sarlin, CEO and co-founder of Finnish AI lab Silo AI. The project's central objective is to create a series of "truly" open-source LLMs that cater to all official languages of the European Union, including the 24 currently recognized, as well as languages spoken in countries seeking EU membership, such as Albania. This forward-thinking approach underscores Europe's commitment to inclusivity and future-proofing its digital infrastructure.
This initiative aligns seamlessly with Europe's broader push for digital sovereignty. The continent is actively working to bring mission-critical infrastructure and tools closer to home. Major cloud providers are investing in local infrastructure to ensure EU data remains within its borders. Even global AI players like OpenAI have introduced offerings that allow customers to process and store data in Europe. The EU's recent $11 billion investment in a sovereign satellite constellation, a direct competitor to Elon Musk's Starlink, further exemplifies this commitment. OpenEuroLLM is a natural extension of this strategic direction.
The financial commitment to OpenEuroLLM reflects the significance of the project. The budget allocated specifically for model development is €37.4 million, with approximately €20 million contributed by the EU's Digital Europe Programme. While this sum may seem modest compared to the massive investments made by corporate AI giants, it's crucial to consider the broader context. The overall budget, including funding for related work, is significantly larger. Furthermore, the project leverages the substantial resources of the EuroHPC project, which boasts a budget of around €7 billion and provides access to supercomputing centers across Europe. These centers, located in Spain, Italy, Finland, and the Netherlands, will play a crucial role in training the massive LLMs.
Despite the project's ambitious goals and the substantial resources at its disposal, the sheer number of participating organizations has raised concerns about its feasibility. Some industry experts, like Anastasia Stasenko, co-founder of LLM company Pleias, have questioned whether a large consortium can achieve the same level of focus and agility as a smaller, dedicated private AI firm. Stasenko points to the recent successes of European AI companies like Mistral AI and LightOn, which have demonstrated the power of smaller, more focused teams. These companies, she argues, have a clearer sense of ownership and accountability, driving them to make decisive choices in finance, market positioning, and reputation.
However, the OpenEuroLLM project isn't starting from scratch. It builds upon the foundation laid by the High Performance Language Technologies (HPLT) project, coordinated by Hajič since 2022. HPLT has been developing free and reusable datasets, models, and workflows using high-performance computing. With HPLT scheduled to conclude in late 2025, it serves as a valuable precursor to OpenEuroLLM, providing a wealth of data, expertise, tools, and experience. Hajič emphasizes that OpenEuroLLM represents a broader participation, with a specific focus on generative LLMs. This head start should enable the project to accelerate its progress.
Hajič anticipates the release of the first versions of the LLMs by mid-2026, with the final iterations expected by the project's conclusion in 2028. While these timelines may seem ambitious, especially considering the project's nascent stage, Hajič assures that the groundwork has been laid. The project officially commenced on February 1st, but preparations have been underway for a year, including the tender process that began in February 2024.
The OpenEuroLLM consortium comprises a diverse range of organizations, spanning academia, research, and the corporate sector. Participating entities hail from various European countries, including Czechia, the Netherlands, Germany, Sweden, Finland, and Norway, in addition to the EuroHPC centers. Notable corporate partners include Silo AI, Aleph Alpha, Ellamind, Prompsit Language Engineering, and LightOn.
One notable absence from the consortium is Mistral AI, a prominent French AI unicorn that has positioned itself as an open-source alternative to companies like OpenAI. While Mistral's participation would have been a significant boost to the project, Hajič confirmed that initial conversations with the startup did not lead to a formal collaboration. However, the project remains open to new participants from EU organizations, although entities from non-EU countries like the U.K. and Switzerland are ineligible due to funding restrictions.
The overarching goal of OpenEuroLLM is to create "a series of foundation models for transparent AI in Europe," preserving the "linguistic and cultural diversity" of all EU languages. This translates to a core multilingual LLM designed for general-purpose tasks where accuracy is paramount. Furthermore, the project aims to develop smaller, "quantized" versions of the models for edge applications where efficiency and speed are critical.
The project faces the challenge of ensuring equitable performance across all EU languages, particularly those with limited digital resources. Hajič acknowledges this hurdle but emphasizes the importance of developing robust benchmarks for these languages to accurately assess their performance and avoid biases introduced by unrepresentative benchmarks.
Data is the lifeblood of LLMs, and OpenEuroLLM will leverage the extensive dataset developed by the HPLT project. Version 2.0 of this dataset, released four months ago, was trained on 4.5 petabytes of web crawls and over 20 billion documents. OpenEuroLLM will supplement this data with additional information from Common Crawl, an open repository of web-crawled data.
The concept of "open source" in the context of AI is complex and often debated. While the Open Source Initiative (OSI) has established definitions for traditional software, the application of these principles to AI models is more nuanced. Open-source AI proponents advocate for the free availability of not only the models themselves but also the datasets, pre-trained models, and weights – the complete package. However, the OSI's definition acknowledges the challenges posed by proprietary data and redistribution restrictions, making training data availability non-mandatory.
OpenEuroLLM, despite its commitment to being "truly open," will likely face similar challenges. Hajič acknowledges the limitations and emphasizes the project's commitment to quality. While the goal is to make everything open, the project may need to make compromises to comply with copyright laws and ensure the highest quality models. This could mean keeping some training data under wraps, but accessible to auditors upon request, particularly for high-risk AI systems as defined by the EU AI Act. The project aims to maximize the openness of data, especially that sourced from Common Crawl, but will ultimately prioritize compliance with AI regulations.
Interestingly, OpenEuroLLM is not the only project pursuing open-source LLMs in Europe. EuroLLM, launched a few months prior, shares similar goals and is also co-funded by the EU. This overlap has raised concerns about duplication of effort and the potential for confusion. Andre Martins, head of research at Unbabel, a partner in the EuroLLM project, highlighted the similarities and stressed the importance of collaboration between the two initiatives. Hajič acknowledged the "unfortunate" situation and expressed hope for cooperation, while noting that OpenEuroLLM's EU funding restricts collaboration with non-EU entities, including U.K. universities involved in EuroLLM.
The emergence of models like DeepSeek, with their impressive cost-to-performance ratios, has raised questions about the true costs of building LLMs. While DeepSeek's specific development costs remain undisclosed, it demonstrates the potential for achieving significant results with potentially fewer resources than previously thought. Peter Sarlin, technical co-lead on OpenEuroLLM, believes that the project has access to sufficient funding, primarily to cover personnel costs. He points out that the partnership with EuroHPC centers provides access to substantial compute resources, significantly reducing the financial burden.
Sarlin emphasizes that OpenEuroLLM's focus is on developing open-source foundation models, not consumer- or enterprise-grade products. This distinction is crucial, as building a chatbot or AI assistant requires significantly more effort and resources. OpenEuroLLM's contribution lies in providing the essential AI infrastructure upon which European companies can build their own applications. Sarlin, with his experience leading Silo AI, which has already launched open models supporting several European languages, believes that the project's budget is sufficient for its objectives. The upcoming "Europa" models from Silo AI, designed to cover all European languages, further demonstrate the existing expertise and technological foundation upon which OpenEuroLLM can build.
Post a Comment