The Voice Inside AI's Head Speaks English
On March 20, Saudi Arabia announced that it planned to invest $40 billion into AI development. This move comes on the heels of the UAE’s creation of Jais and Falcon, two large language models (LLMs) developed within the last year. However, as these Arabic-speaking nations race for AI supremacy, they are starting behind the pack.
As students of Arabic can attest, Arabic words frequently change meaning depending on unwritten short vowels, with context playing an important role determining pronunciation. In addition, letters can take up to three different shapes depending on what precedes or follows them, and many of them are connected, which computers can struggle to make sense of. As a consequence, Arabic is significantly harder to represent in a coding model than most other languages.
In addition, LLMs require training in massive amounts of digital text. Arabic is the fourth-most-spoken language globally, but it constitutes less than 1% of all internet content. That leaves AI developers with much less material to work with. Over 70% of the content used to train the UAE’s Jais—considered the premier Arabic LLM—was in English due to a lack of Arabic material. Consequently, Jais struggles with higher reasoning tasks using Arabic compared to English models such as ChatGPT.
While the billions of investment dollars pledged by Saudi Arabia may seem like a mammoth amount, it pales in comparison to the funding put into English-language LLMs. Sam Altman, the CEO of OpenAI, recently shared that he is seeking investments upwards of $7 trillion to enhance model training capacity – close to 2000x greater than Saudi Arabia’s planned investment.
As more and more people in Arabic-speaking countries turn to AI to complete tasks, users have expressed fears that Arabic is “falling behind” in the world of artificial intelligence. As of now, they’re right. English-language LLMs have the advantage in language simplicity, funding, and source material. Available investment dollars is a part of the equation, but it is not the only one.