GEOPOLITICS

AI and Digital Colonialism: When Silicon Valley Becomes the New Empire

Investigation into AI as a vector of digital colonialism: data extraction from the Global South, algorithmic bias against non-English speakers, African talent exploitation, Latin American content moderation labor, the 95% English training data problem, sovereign AI resistance, and indigenous data sovereignty movements.

INHUMAIN.AI Editorial · February 26, 2026 · 28 min read

The history of colonialism is the history of extraction: resources taken from peripheries to enrich centers, value flowing in one direction, and the populations whose labor and land produced that value receiving little or nothing in return. The specific resources change — gold, rubber, oil, labor — but the structural dynamics remain consistent across centuries.

Artificial intelligence is reproducing these dynamics with digital resources. Data is extracted from populations across the Global South to train AI systems that are owned, operated, and monetized by companies in the United States and, increasingly, China. The human labor required to make AI systems functional — data labeling, content moderation, reinforcement learning from human feedback — is disproportionately sourced from low-wage workers in Kenya, the Philippines, Venezuela, India, and other developing nations. The AI systems built with this extracted data and labor are then deployed back into these same populations, often performing poorly because they were designed for English-speaking, Western contexts and retrofitted for everywhere else.

This is not a metaphor. It is a structural analysis. The term “digital colonialism” is used by scholars, policymakers, and activists across the Global South to describe the extractive relationship between the technology platforms of the Global North and the populations of the Global South. AI has intensified this relationship by creating new categories of extraction — behavioral data, linguistic data, cultural knowledge — and new categories of exploitation — ghost labor, algorithmic imposition, infrastructure dependency.

INHUMAIN.AI covers the global AI landscape, and the global AI landscape includes the billions of people for whom AI is not a product they use but a system imposed upon them. This analysis maps the extractive dynamics of AI across data, labor, language, infrastructure, and governance.

Data Extraction: The New Resource Curse

The AI industry’s appetite for data is effectively unlimited. Large language models are trained on trillions of tokens of text. Computer vision models require billions of labeled images. Recommendation systems consume behavioral data from billions of users. The quality and quantity of training data directly determine AI system performance.

Where does this data come from? The honest answer is: from everywhere, with consent from almost no one.

The Web Scraping Pipeline

The primary source of training data for large language models is the public internet, collected through web scraping. Common Crawl, a nonprofit that maintains a regularly updated archive of the web, is the foundational data source for most major LLMs. Its archive contains petabytes of data scraped from billions of web pages worldwide.

This data includes content created by people across the Global South — blog posts, social media content, forum discussions, news articles, government documents, creative works — scraped without the knowledge or consent of the people who created it and without any compensation flowing to them or their communities.

The extractive dynamic is asymmetric in every direction:

Dimension	Global North	Global South
Data contributed	Large volume, high-quality, English-dominant	Growing volume, diverse languages, underrepresented
Value captured	Virtually all (models owned by US/Chinese companies)	Virtually none
AI products received	Optimized for English/Western contexts	Poorly adapted, bias-prone, limited language support
Governance over data use	GDPR, state privacy laws (imperfect but existing)	Minimal regulation, weak enforcement
Consent obtained	Terms of service (dubious consent)	None (data scraped from public internet)

Platform Data Extraction

Beyond web scraping, global technology platforms — Meta, Google, TikTok, Amazon — collect behavioral data from billions of users across the Global South. This data is used to train AI systems that improve those platforms’ products, generating revenue for the platform operators while the users whose data drives the improvement receive nothing beyond access to the platform itself.

The scale is enormous. Meta’s platforms (Facebook, Instagram, WhatsApp) have approximately 3 billion monthly active users, a disproportionate share of whom are in the Global South, where these platforms have become de facto communication infrastructure. WhatsApp is the primary messaging platform for hundreds of millions of people in India, Brazil, Nigeria, and across Africa. The behavioral data generated by this usage — communication patterns, social graphs, content preferences, commercial transactions — flows to Meta’s servers and feeds its AI systems.

The argument that users “consent” to this data extraction by accepting terms of service is legally thin and ethically hollow. In contexts where a platform is the primary means of communication, refusing to accept terms of service is not a meaningful choice. It is like “consenting” to breathe polluted air because the alternative is not breathing.

The Ghost Workers: AI’s Hidden Labor Force

AI systems do not train themselves. Behind every large language model, every content recommendation algorithm, every computer vision system is an army of human workers who perform the labor that makes AI functional: labeling data, moderating content, providing feedback on AI outputs, and performing the thousands of small tasks that bridge the gap between raw computation and useful AI.

This labor force is overwhelmingly located in the Global South, overwhelmingly low-paid, and almost entirely invisible to the users of AI products.

Data Labeling

Data labeling — the process of annotating images, text, audio, and video so that AI systems can learn to recognize patterns — is a labor-intensive process that has created a global industry. Companies like Sama (formerly Samasource), Scale AI, Appen, and Remotasks operate data labeling facilities and crowdsourcing platforms that employ hundreds of thousands of workers, predominantly in Kenya, Uganda, India, the Philippines, and Venezuela.

The pay is typically between $1.50 and $5 per hour, depending on the task, the platform, and the country. By the standards of the locations where these workers are based, these wages may be competitive with local alternatives. By the standards of the value these workers create — they are producing the training data that makes billion-dollar AI systems functional — the compensation is extractive.

A data labeler in Nairobi who spends eight hours annotating images for a self-driving car dataset earns approximately $12-$20 for a day’s work. The autonomous vehicle system trained on that dataset may eventually be valued at billions of dollars. The labeler receives no equity, no royalties, no ongoing compensation, and no credit.

Content Moderation

Content moderation — the process of reviewing user-generated content to remove material that violates platform policies — is the most psychologically damaging form of AI-adjacent labor. Moderators review images and videos depicting violence, child sexual abuse, terrorism, self-harm, and other disturbing content, making decisions about what to remove at high volume and under time pressure.

Major platforms outsource content moderation to contractors in the Global South. Meta’s content moderation for Africa is largely performed by workers in Kenya, employed by outsourcing firms. These workers have described being exposed to traumatic content for hours each day, with inadequate psychological support, restrictive contracts that prevent them from discussing their work, and the constant threat of termination for missing productivity targets.

In 2023, a Kenyan court ruled in favor of former content moderators who had sued Meta and its contractor over working conditions, including allegations of inadequate support for workers exposed to traumatic content. The case highlighted the gap between the working conditions of content moderators in the Global South and the corporate values that the platforms they serve profess.

RLHF: The Human Behind the AI

Reinforcement Learning from Human Feedback (RLHF) — the technique used to align large language models with human preferences — requires human evaluators to read AI outputs and rate them for quality, helpfulness, harmfulness, and accuracy. This work is intellectually demanding, often psychologically taxing (evaluators must read and assess toxic, hateful, and disturbing content), and disproportionately performed by workers in the Global South.

OpenAI’s partnership with Sama for RLHF labeling was reported on extensively. Workers in Kenya were paid approximately $1.50-$2.00 per hour to read and classify text that included descriptions of violence, abuse, and other disturbing content. The contract was eventually terminated, reportedly due to the psychological impact on workers.

The RLHF labor pipeline creates a particularly stark irony: AI systems are being made “safe” and “aligned” through the labor of workers who are themselves subjected to unsafe working conditions and whose interests are not aligned with those of the companies that employ them.

The Language Gap: 95% English, 100% Deployed

The most fundamental form of algorithmic bias in AI is linguistic. Large language models are trained predominantly on English-language text, yet they are deployed to serve populations across the world’s 7,000+ languages.

The Numbers

Statistic	Value
Estimated share of LLM training data in English	90-95%
Share of world population that speaks English natively	~5%
Share of world population that speaks English (any proficiency)	~17%
Number of languages with meaningful LLM support	<100
Number of living languages worldwide	~7,000
Languages with virtually no digital text corpus	~5,000+

This disparity means that AI systems perform dramatically better for English speakers than for speakers of other languages. A user interacting with ChatGPT, Claude, or Gemini in English receives responses that are more accurate, more nuanced, more culturally contextual, and more linguistically sophisticated than a user interacting in Swahili, Yoruba, Bengali, or Quechua.

The performance gap is not merely a quality-of-service issue. It is a structural inequality that compounds existing disadvantages. When AI systems are used in healthcare, education, legal services, and government — and they increasingly are — a performance gap between English and other languages translates directly into a gap in the quality of services available to different populations.

Arabic: A Case Study

Arabic illustrates the challenges particularly well, and is relevant to INHUMAIN.AI’s coverage of Gulf AI strategies.

Arabic is spoken by approximately 400 million people across 25 countries. It exists in three broad registers: Classical Arabic (the language of the Quran and classical literature), Modern Standard Arabic (the formal written and broadcast language), and numerous spoken dialects (Egyptian, Levantine, Gulf, Maghrebi, and others) that differ from each other as much as Romance languages differ from each other.

Most AI training data in Arabic is in Modern Standard Arabic, which is used in formal writing but is not the native spoken language of any Arabic speaker. The spoken dialects that people actually use in daily communication are poorly represented in training data because they are primarily oral or informal (text messages, social media posts with non-standard spelling and grammar).

The result is AI systems that can process formal Arabic reasonably well but struggle with the language as it is actually spoken. An Egyptian user texting in Egyptian Arabic, a Moroccan user posting in Darija, or a Gulf user communicating in Khaliji encounters AI systems that misunderstand, mistranslate, or simply fail.

The UAE’s Technology Innovation Institute developed Jais specifically to address this gap. But one model, trained by one institution, does not solve the structural problem. The Arabic language is underserved by AI because Arabic speakers’ data is undervalued by the companies that build AI systems. See: Gulf States AI.

African Languages

The situation for African languages is far worse than for Arabic. Africa is home to approximately 2,000 languages, very few of which have meaningful digital text corpora. Hausa, Yoruba, Igbo, Amharic, Swahili, Zulu, and a handful of others have limited but growing digital presence. The vast majority of African languages have virtually no digital text that could serve as AI training data.

This means that AI systems are effectively non-functional in the languages spoken by hundreds of millions of Africans. The AI revolution, as experienced by a Hausa speaker in northern Nigeria or a Wolof speaker in Senegal, is a revolution that does not include them.

Projects like Masakhane, a grassroots research initiative focused on NLP for African languages, are working to address this gap. But these projects operate with budgets that are infinitesimal compared to the resources available to commercial AI labs, and they face the fundamental challenge that the economic incentives for multilingual AI development do not align with the needs of underserved language communities.

Cloud Dependency: Infrastructure as Control

The Global South’s relationship with AI is mediated by cloud infrastructure that is overwhelmingly controlled by three American companies: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). These three providers control approximately 65% of the global cloud market.

This concentration creates multiple forms of dependency:

Computational dependency. Any organization in the Global South that wants to train, deploy, or use AI systems must do so on infrastructure owned and operated by American companies. The compute required for AI is not available locally, and building local alternatives requires capital and expertise that most developing nations lack.

Data residency. When organizations in the Global South use American cloud services, their data is stored on infrastructure governed by US law, including the CLOUD Act, which allows US law enforcement to compel American companies to produce data stored anywhere in the world. This creates sovereignty concerns: data about citizens of Kenya, Brazil, or Indonesia may be subject to American legal jurisdiction regardless of where the data was generated.

Pricing power. The concentration of cloud infrastructure gives the dominant providers pricing power over Global South users. Cloud computing costs are set by American companies for global markets, and they may not reflect the purchasing power or economic context of developing nations.

Exit barriers. Once an organization’s AI systems are deployed on a particular cloud platform, migrating to another provider is technically complex and expensive. This creates lock-in effects that reduce the bargaining power of developing-nation users.

India’s response to this dependency has been the development of domestic cloud infrastructure, including the India AI compute initiative, which aims to build sovereign GPU clusters. But India is one of the few developing nations with the scale, capital, and technical base to attempt this. For most Global South nations, cloud dependency is a structural condition, not a problem they can solve.

Algorithmic Bias: Encoded Inequality

AI systems encode the biases present in their training data, and their training data is predominantly Western, English-language, and reflective of the perspectives, values, and demographics of the Global North. When these systems are deployed in the Global South, they import those biases.

Facial Recognition

Facial recognition systems have been extensively documented to perform worse on darker-skinned individuals, particularly darker-skinned women. Research published in 2018 demonstrated that commercial facial recognition systems from IBM, Microsoft, and Face++ had error rates for darker-skinned women that were 10-34 percentage points higher than for lighter-skinned men.

When these systems are deployed in the Global South — for border control, law enforcement, identity verification, and access to services — their higher error rates translate into disproportionate harm to the populations they are deployed among. A facial recognition system that works well for European faces and poorly for African faces, deployed in an African country, is a colonial technology: it was built for one population and imposed on another.

Credit and Financial Systems

AI-driven credit scoring systems, increasingly deployed in developing nations through mobile banking platforms, encode assumptions about creditworthiness that may not apply across economic and cultural contexts. A credit model trained on US consumer data embeds American assumptions about employment patterns, spending behavior, and financial infrastructure that may not transfer to the informal economies, communal financial practices, and different institutional contexts of the Global South.

The deployment of these systems without adequate localization and validation can result in systematic denial of credit to populations that are creditworthy by local standards but do not match the patterns the AI was trained to recognize. This is not a technical error. It is the predictable consequence of deploying a system designed for one context in another.

Healthcare AI

AI diagnostic tools trained on medical imaging from predominantly Western populations perform worse on patients with darker skin or with conditions that present differently in non-Western populations. Dermatological AI, for example, has been documented to perform poorly on darker skin because its training data disproportionately represents lighter-skinned patients.

When these tools are deployed in healthcare settings in the Global South — often promoted as solutions to the shortage of specialist physicians — they risk misdiagnosis, delayed treatment, and inappropriate care. The promise of AI democratizing healthcare access becomes, in practice, AI exporting Western medical biases to populations that need contextually appropriate medical AI.

Sovereign AI as Resistance

Across the Global South, a growing movement of scholars, policymakers, technologists, and activists is articulating alternatives to the extractive model of AI development. The concept of “sovereign AI” — AI capability developed under national or regional control, using local data, serving local needs — has become a rallying point.

African Union AI Strategy

The African Union’s continental AI strategy, adopted in 2024, emphasizes data sovereignty, local capacity building, and the development of AI systems that address African priorities. Key principles include:

African data should be governed by African institutions
AI systems deployed in Africa should be evaluated for contextual appropriateness
African AI talent should be developed and retained on the continent
The benefits of AI should be distributed equitably, not extracted by foreign platforms

The strategy is aspirational. Its implementation depends on the capacity and political will of individual African nations, many of which face severe resource constraints. But it represents a significant articulation of an alternative framework for AI development.

India’s Data Localization

India has pursued data localization requirements that mandate certain categories of data be stored and processed within Indian territory. The Personal Data Protection Bill and subsequent regulations impose requirements on cross-border data transfers that are explicitly designed to prevent the extraction model that has characterized India’s relationship with American technology platforms.

India’s AI initiatives also emphasize the development of models for Indian languages (Hindi, Tamil, Telugu, Bengali, and others that are spoken by hundreds of millions of people but poorly served by existing AI systems) and the creation of Indian-controlled compute infrastructure.

Latin American Initiatives

Several Latin American nations have articulated AI strategies that emphasize sovereignty and equity:

Brazil has developed AI ethics guidelines that emphasize the rights of vulnerable populations and has proposed AI regulation modeled on elements of the EU AI Act
Chile has established an AI policy framework that addresses the concentration of AI power in foreign companies
Colombia has developed AI ethics frameworks through multi-stakeholder processes that include indigenous and Afro-Colombian communities

These initiatives face the same structural challenge: the capital, compute, talent, and institutional capacity required for sovereign AI development are concentrated in the Global North, and building alternatives requires resources that most developing nations do not have.

Indigenous Data Sovereignty

The most radical articulation of resistance to AI extractionism comes from indigenous data sovereignty movements, which assert the rights of indigenous peoples to control data about their communities, cultures, territories, and traditional knowledge.

Te Mana Raraunga (New Zealand)

Te Mana Raraunga, the Maori Data Sovereignty Network, was founded in 2015 and has become a globally influential articulation of indigenous data rights. The network’s principles assert that data about Maori people, collected by any entity, remains subject to Maori governance and that the use of such data must serve Maori interests.

This framework challenges the fundamental assumption of AI development: that data on the public internet is available for training without restriction. Te Mana Raraunga argues that data about indigenous peoples is not a freely available resource but a taonga (treasure) that carries obligations and relationships.

CARE Principles

The CARE Principles for Indigenous Data Governance — Collective Benefit, Authority to Control, Responsibility, and Ethics — provide a framework that complements the FAIR data principles (Findable, Accessible, Interoperable, Reusable) with an explicitly equity-oriented framework:

FAIR Principles	CARE Principles
Focus on data quality and access	Focus on people and purpose
Technology-oriented	Rights-oriented
Enable data sharing	Enable data sovereignty
Emphasize openness	Emphasize governance

The tension between FAIR and CARE is directly relevant to AI. The AI industry’s demand for open, accessible training data conflicts with indigenous communities’ demand for control over data that represents their cultural heritage, traditional knowledge, and collective identity. An AI system trained on scraped indigenous cultural material without permission is engaged in a form of cultural extraction that indigenous communities have explicitly rejected.

First Nations Information Governance (Canada)

The First Nations Information Governance Centre (FNIGC) in Canada has established the OCAP principles (Ownership, Control, Access, Possession) as a framework for governing research data about First Nations peoples. OCAP asserts that First Nations communities own, control, have access to, and physically possess data about their communities.

These frameworks have influenced broader discussions about data governance, and they present a fundamental challenge to the AI industry’s operating model. If data sovereignty becomes a widely recognized principle — not just for indigenous communities but for all communities — the extractive model of AI training data collection becomes legally and ethically untenable.

The Content Moderation Colony

The global content moderation industry has been described as a “colonial economy” by researchers studying the labor conditions of content moderators in the Global South.

The structure is straightforward: American and Chinese technology platforms generate enormous volumes of user content that must be reviewed for policy compliance. This review is performed by workers in the Global South — Kenya, the Philippines, India, and increasingly Venezuela and other countries with large populations of educated, English-speaking workers willing to accept low wages.

The work is inherently harmful. Moderators are exposed to traumatic content — child sexual abuse material, graphic violence, terrorism, self-harm — for hours each day. The psychological consequences are well-documented and include PTSD, anxiety, depression, and substance abuse.

The economic structure is extractive. Moderators are typically employed by outsourcing firms (Sama, Majorel, Teleperformance, Accenture) rather than directly by the platforms they serve. This insulates the platforms from labor liability while ensuring that the lowest-wage workers in the chain bear the highest psychological costs.

The content moderation economy also reveals a disturbing irony: AI systems are being made “safe” through the labor of workers who are themselves subjected to conditions that are manifestly unsafe. The toxicity removed from platform users’ experience does not disappear. It is absorbed by workers in the Global South, and the AI systems trained on their moderation decisions learn to replicate their judgments at scale — displacing the workers whose labor made the AI functional in the first place.

Resistance and Alternatives

The dynamics described in this analysis are not inevitable. They are the product of specific policy choices, business models, and power structures that can be challenged and changed.

Data trusts and cooperatives. Several proposals envision data governance models in which communities collectively control their data and negotiate terms for its use, including compensation. Data trusts could give Global South communities bargaining power over the use of their data in AI training.

Local AI development. Initiatives like Masakhane (African NLP), AI4D Africa, and various national AI programs aim to build AI capability within the Global South, using local data, serving local needs, and retaining value locally.

Regulatory frameworks. The African Union’s AI strategy, India’s data localization requirements, and Brazil’s AI ethics guidelines represent emerging regulatory responses to digital extraction. These frameworks are imperfect and unevenly enforced, but they represent the beginning of a governance infrastructure for equitable AI.

Fair labor standards. The Fairwork Foundation and other organizations have developed frameworks for assessing and improving the labor conditions of digital workers, including AI data laborers. These frameworks provide a basis for advocacy and accountability.

Open-source multilingual AI. Projects that develop open-source AI models with genuine multilingual capability — not English models with other languages bolted on as an afterthought — could reduce dependency on American commercial AI systems. But these projects require sustained funding and institutional support that has been difficult to secure.

The Central Question

The central question of AI and digital colonialism is the same question that has defined colonial dynamics for centuries: who benefits?

AI is creating enormous value. That value is being captured overwhelmingly by a small number of companies in a small number of countries. The data, labor, and attention of billions of people in the Global South contribute to this value creation but receive almost nothing in return. The AI systems built with their contributions are then deployed back to them, performing poorly, encoding biases, and deepening dependency on foreign infrastructure.

This is not a technology problem. It is a power problem. The technology could be developed differently — with consent, with compensation, with local participation, with genuine multilingual capability, with respect for data sovereignty. It is not, because the current extractive model is enormously profitable for those who benefit from it, and because the populations who bear its costs lack the political and economic power to demand change.

INHUMAIN.AI exists to make these dynamics visible. The first step in challenging digital colonialism is documenting it. That is what this analysis does. The second step is building alternatives. That is the work of the movements, institutions, and communities described here. We watch. We document. We amplify.

For the broader geopolitical context, see: AI Geopolitics: Who Controls Inhuman Intelligence Controls the Century.

For analysis of how Gulf states are positioning themselves within these dynamics, see: Gulf States AI: The $100 Billion Desert Bet.

For investigation of HUMAIN’s labor and data practices, see: HUMAIN Watch.

For the AI safety implications of biased and extractive AI development, see: The Complete Guide to AI Safety.

In This Article