When an AI speaks Nahuatl
Auch lateinamerikanische Datensätze sind oft für nicht durchlässig für marginalisierte Perspektiven
Grafik: Julia Neller
In April, 2024, João Antônio Trindade Bastos, a personal trainer by profession, was watching a football match in Sergipe, Brazil’s smallest federal state. Police officers suddenly picked him out of the crowd, searched him and took him into custody for questioning. Facial recognition software had identified the Black man as a wanted criminal, although he had never committed a crime. The artificial-intelligence software had simply mistaken him for someone else.
On the other side of the Andes, Chilean software developers discovered a gap while testing ChatGPT. When they asked it about contemporary Chilean literature, the large language model (LLM) brought up the writer Pablo Neruda — even though he died in 1973. It also hallucinated books that in reality do not exist.
These examples are telling. In the first, an AI mistake harms an individual; in the other, contemporary culture is misrepresented. Both cases highlight what happens when AI systems are developed primarily in the Global North and trained on faces with predominantly lighter skin tones: Black people become algorithmically invisible or interchangeable. Similarly, when language models are largely trained on English-language input, Latin America’s own cultural production is pushed to the margins. This means that Latin Americans today are using AI systems that present a distorted picture of their culture. At the same time, our dependence on this technology deepens day by day.
Against this backdrop, Chilean President Gabriel Boric has said that, “the digital future must also speak our language, with our voices, and be made for our people.” In doing so, he named both a shortcoming and a goal: the attempt to create AI infrastructure that really meets the needs of Latin American citizens.
As a result, dozens of institutions and experts from several countries in the region, under the leadership of Chile’s CENIA (Centro Nacional de Inteligencia Artificial), are currently developing the large language model Latam-GPT. Work on this open model has been under way since 2023. Latam-GPT is intended to strengthen the region’s AI capacities and reduce dependence on models from elsewhere. Organisations from across Latin America and the Caribbean are involved in the project. The Brazilian government is on board, and other countries have expressed interest.
Latam-GPT is meant to differ significantly from other global models that are primarily fed with English-language data and corresponding perspectives and often have little to do with everyday life in Latin America. Latam-GPT is designed to “understand” the region’s context, culture and diversity. Around 3.5 million US dollars are available for the development of the first Latin American LLM based on the open-source LLaMA-3 architecture with up to seventy billion parameters—roughly one sixtieth of the sum Google invested in Gemini Ultra. Latam-GPT uses a distributed training infrastructure that combines the regional high-performance computing system of the University of Tarapacá in Chile with cloud capacities from Amazon Web Services (AWS). This is a pragmatic approach: on the one hand, AWS’s computing power is used; on the other, regional infrastructure is preserved, which plays a central role on the path toward the desired technological sovereignty. The project’s supporters also include the Inter-American Development Bank and Data Observatory, a non-profit, state-supported organisation in Chile dedicated to the public-interest management of large data sets.
“Latam-GPT is meant to incorporate knowledge about the Aztecs and the Incas”
Latam-GPT is conceived entirely as an open-source platform. Because all code and models are published under open licences, the AI is transparent and can be monitored by the public and used by universities, public authorities, companies and civil-society organisations. In addition to Spanish, Portuguese and English, there is a regionally oriented AI engine that supports applications such as personalised learning systems, virtual assistants for the public sector, digitisation tools for cultural archives, and translation tools for Indigenous languages. Latam-GPT includes languages such as Nahuatl, Quechua and Mapudungun. Ensuring cultural diversity in the training data in a reliable way and incorporating knowledge about ancient peoples such as the Aztecs and the Incas is, as CENIA director Álvaro Soto told Wired magazine in an interview, work that “no one else is doing”.
The greatest challenge facing the project is data. Because CENIA does not have its own data sources like Google and other corporations, it cooperates with more than thirty research institutions, government bodies, archives, libraries, universities, social organisations, publishers and film productions that contribute data. Other new partners are also expected to join. Brazil and Mexico provide the lion’s share of training data. This highlights the challenge faced by any project aimed at regional sovereignty: how to develop a truly representative technology. Large Latin American economies such as Mexico and Brazil contribute more data, but in doing so they reproduce inequality. In contrast, Uruguay, Paraguay, Ecuador and some Central American countries, as a result of their respective historical developments, have fewer universities and smaller digitised archives. However, artificially limiting the volume of data contributed by Brazil and Mexico to counteract this would weaken the model as a whole. How can you reliably ensure that small countries are not pushed to the margins in Latin American projects?
There is no quick fix to this problem. Back in 2019, Paola Ricaurte, professor in the Department of Media and Digital Culture at Tecnológico de Monterrey in Mexico and a faculty member at Harvard University’s Berkman Klein Center in the United States, has outlined the issues at stake. “Datasets reinforce historical forms of colonisation by assembling practices, materialities, territories, bodies and subjectivities into a complex arrangement.”
As an activist, Ricaurte has founded the Tierra Común network, which examines the mirror-image relationship between data extraction and a kind of colonial resource extraction. Brazil has not colonised Latin American AI; rather, due to its size and economic strength, it simply has more universities and digitised archives. Colonialism, however, makes itself felt in other ways: Tarcízio Silva, a fellow at the Mozilla Foundation, speaks of “algorithmic racism”. He defines this as a “new manifestation of structural racism in which the powerful can discriminate with the help of machines, cameras or a screen interface”.
The starting point for Silva’s research into how algorithms can harm minorities is Brazil. There, Black people and those who identify as pardo — that is, of mixed ancestry — make up the majority of the population, accounting for 54 per cent. But when it comes to access to the internet, they are severely disadvantaged and underrepresented. According to Brazil’s Internet Steering Committee, only 22 per cent of Brazilians over the age of ten have satisfactory access to the internet. Among people with Black skin or mixed ancestry who live in disadvantaged circumstances and in smaller towns, the percentages are even lower. In other Latin American countries, too, privileged groups are overrepresented when it comes to written records and access to information.
“Women in Latin America are doubly marginalised”
Despite the structural challenges, Latam-GPT is in the process of developing something special. This is reflected by the number of partners involved. Institutional diversity is essential because it aims for as many regional organisations as possible to build a collective infrastructure. Many of the largest AI models do not fully disclose where their training data come from. Latam-GPT, by contrast, publishes its sources. This transparency is a direct response to what Paola Ricaurte and other scholars identify as one of the core problems of a kind of data colonialism: opacity. Knowing where the data come from makes it possible to examine whose perspectives they convey and whose they exclude.
In addition, the Latam-GPT team makes all sensitive personal information anonymous. From CENIA’s perspective, curating data according to ethical criteria is one of the project’s most important aims. This is not just a technical issue but a political statement. It involves asking whose data are included, how they are contextualised, and what risks can be averted through anonymisation. Given the international race for market dominance, commercial models generally do not pay such careful attention to how they treat souces. For the Latam-GPT team, data quality is more important than data volume. It monitors how strongly regions are represented so that no single country dominates. If a country is found to be underrepresented, the team proactively seeks partners there, explains CENIA director Álvaro Soto.
The developers also aim for thematic diversity. Politics, sport, art and other fields are to be covered in such a way that a full spectrum of Latin American lives is represented — not simply areas that are easiest to digitise. Its push for balance is fuelled by an awareness that structural inequality inevitably leads to unbalanced data.
Latam-GPT has a fundamentally different approach to commercial models, which have optimising data volume, rather than representational fairness, as their benchmark. When an AI is trained mainly on information in regional languages, it learns regional linguistic logics, idiomatic expressions and cultural contexts that English-dominated models often fail to capture or mistranslate.
Alexandra García, a project specialist at CENIA, for example, tested a commercial AI model a few years ago by asking: “How do you eat sopaipillas in Chile?” She was told that the typical dish consisted of fried bread with honey; in fact, they are fried discs of pumpkin dough served with mustard, chilli sauce or syrup. “That may sound like a small detail, but the model was wrong — and that says a lot about whom it was developed for,” García told the US radio station The World.
Latam-GPT's critical approach to data also makes it more relevant for practical applications. When the model is used to deal with localised issues, like reducing school dropout rates or to shortening waiting times in public healthcare, computing power alone is not enough: an understanding of regional social dynamics is essential.
Because of its limited capacities, Latam-GPT is expected to develop primarily in the social sciences and humanities. This is not necessarily a disadvantage; it can also be strategic. Nevertheless, the model cannot escape infrastructural realities. Paola Ricaurte, who also leads the Latin American Feminist AI Research Network and the team behind the “AI Decolonial Manyfesto”, says that “feminist values and the needs of local communities” must be at the centre of AI development.
In this context, she introduces the concept of pluriversality—the idea that many different forms of knowledge are included rather than adopting a single perspective. Biases that generally shape AI accumulate when a disproportionate share of the training data comes from the Global North, is generated by men, and thus sets corresponding perspectives as universal. For Latin American women, this double exclusion — geographical and gender-based — means that their lived realities are often not understood by existing AI tools.
The success of Latam-GPT will not be measured by whether it can calculate complex equations as quickly as OpenAI’s GPT-5. Rather, it will be about building and expanding an infrastructure in which greater equality prevails among marginalised countries and people. This hinges on two key points: resisting dominance from the Global North, and at the same time reducing historically entrenched inequalities between countries of the Global South and within their populations.