AI’s treatment of historically loaded and emotionally charged topics is increasingly becoming a governance issue
This is a guest blog authored by Humane Intelligence volunteers, exploring topics related to AI evaluations and sociotechnical topics in AI. The co-author is Nkechika Ibe.
As large language models are increasingly deployed across borders, their ability to handle historically loaded and emotionally charged topics is becoming a governance issue, not just a technical one. AI systems are often described as neutral and global, yet their behavior can change noticeably depending on language, context, and collective memory. When global systems learn local narratives unevenly, neutrality becomes fragile rather than guaranteed. The issue is not intent or hostility, but overconfident generalization based on incomplete cultural signals. These effects are especially visible in low-resource languages, where training data is uneven and historical narratives are deeply layered.
Recent research supports these concerns. Studies on multilingual language models show that cultural and social biases are not removed by multilingual capability; instead, they are redistributed across languages, often reflecting dominant English-language or Western cultural frames. As a result, the same model may appear aligned or neutral in one language while drifting toward bias or distortion in another.
At the same time, low-resource languages remain systematically under-represented in training datasets and evaluation benchmarks. Most widely used benchmarks focus on a small number of high-resource languages, which leads to weaker performance, poorer alignment, and higher error rates elsewhere. This imbalance is reinforced by market concentration: a small number of large technology companies dominate the development of foundational models, and these systems are overwhelmingly optimized on English and Chinese data (~90%), leading to cultural and linguistic inaccuracies in other languages. For countries without strong representation in this ecosystem, this creates dependency and limits their influence over how AI systems behave, are governed, or align with local legal and cultural norms.
Hallucination is another well-documented failure mode in multilingual and low-resource contexts. When models attempt to sound neutral or balanced on sensitive topics, they may fabricate facts or invent compromise narratives that never existed. In historically complex settings, this tendency can distort reality rather than clarify it, producing fluent but misleading outputs. Together, these findings suggest that multilingual models cannot be assumed to be culturally neutral simply because they operate across many languages.
To explore this personally, Humane Intelligence volunteer Károly Boczka built a trilingual benchmark focused on Croatian, Serbian, and Hungarian. These countries share much of their history, but their interpretations often differ. The benchmark was intentionally provocative and designed to test how models respond when prompts touch on identity, nationalism, and unresolved historical disputes.
Across all three languages, bias appeared. When pressure increased, some answers moved closer to the dominant narrative associated with that language rather than keeping analytical distance. The same underlying question could receive meaningfully different treatment simply by changing the language of the prompt. In one particularly striking case, Claude followed the role-play instructions with exceptional fidelity and embodied a nationalist persona more convincingly than other models. However, that realism came with a downside: parts of the response crossed into hate speech, framing violence against another population not as a crime but as justified retaliation. The model did not invent hostility; it reproduced inherited historical narratives with high fluency, without recognising their moral or ethical implications.
Hallucinations also played a role. Several prompts triggered fabricated historical details; most often the models seemed to create compromise positions that never existed, favoring a reconciliatory story over an accurate but uncomfortable reality. Taken together, these results act as a mirror of how machines absorb cultural heritage.
As a second case study, Humane Intelligence volunteer Nkechika Ibe explored the issue of inaccurate translations that misrepresent the original meaning of the sentence or word for low resource languages. She observed that this misrepresentation can distort the perception of the mind of the user in relation to the correct meaning of the prompt given, unintentionally shifting the cultural viewpoint, understanding, and knowledge of the user.


All three models mistranslated text in Igbo, an Indigenous language of one of the dominant tribes in Nigeria. The three screenshots above show translated responses in ChatGPT, DeepSeek, and Gemini. The original text should translate to “The Nigerian president has a lot of ceremonies across the country this Christmas season.” While ChatGPT and DeepSeek offered translations that were a bit closer to the intended meaning, all three models translated the language literally, eschewing the context and the main idea of the intended prompt. Gemini deviated the most from the intended meaning.
Bad actors can manipulate poor or literal translations, and leverage distorted understanding, to spread misinformation capable of fueling violence and hate speech. The inability to effectively capture the contextual meaning of a sentence or word in Indigenous languages can undermine the quality of information that these AI systems churn out.
These two case studies underline why multilingual and culturally aware human oversight remains essential. Even strong models can reproduce distortion when fluent language hides bias, omission, or invented consensus. For AI governance, this means paying attention not only to datasets and algorithms, but also to nuances in language and culture. Low-resource languages should not be treated as edge cases. They function as early warning signals, revealing where neutrality breaks first and where governance needs to focus before similar failures scale globally.