The Sum of Human Knowledge? Not in One Wikipedia Language Edition
Image credit: Denis Schroeder (WMDE), Wikidata Items Map 2014 - 2017.
“The sum of human wisdom is not contained in any one language, and no single language is capable of expressing all forms and degrees of human comprehension.” Ezra Pound
Every language edition has freedom and editorial independence
Though I had used Wikipedia for years, it was only ten years ago when I discovered how each language edition community can freely organize its content—as there is no central editorial board. The Catalan version of the encyclopedia, in my native tongue, can have pages dedicated to its culture without impediment. Some might take this for granted, but I cherished this principle because of my memories of my grandfather, who was forbidden to speak his language in public during the forty years of Franco’s dictatorship, and of my mother, who did not have not the chance to be educated in her mother tongue. I did not immediately become a contributor, but I wanted to learn more and, hopefully, one day give back. Today, I am doing so as a researcher with the Wikipedia Cultural Diversity Observatory (WCDO). Though the English Wikipedia has brought much attention to the larger Wikimedia project, that project’s future and potential growth lie in many smaller languages and cultures, which are often overlooked -- and under threat, as many human languages are likely to disappear by the end of the century.
The poet Ezra Pound said that “the sum of human wisdom is not contained in any one language, and no single language is capable of expressing all forms and degrees of human comprehension”1. Obviously, the same is true of Wikipedia. At the observatory, we work to discover the knowledge that is local to each language, the cultural pearls from every place in the world, and promote its exchange. I believe this can be advanced by means of a model assessing project cultural diversity. Such model will then allow us to better encourage Wikipedia language communities to raise awareness, organize events, adopt tools, and incorporate cultural diversity as part of their own strategic plans.
Researching the cultures in Wikipedia language editions
Although cultural diversity appears now to be a crystal-clear priority for the movement, it was not that obvious in 2011, when I attended my first Wikimania. In the most popular and crowded Wikipedia conference, the multitude of nationalities reminded me of an encyclopaedian version of the United Nations. Our apparent differences were in clothing, colors, gestures and many other details. Before the conference, a friend of mine asked me a key question: if English Wikipedia has most of the articles, why should there be hundreds of other language editions? I hesitated a bit, and my answer was that for the different language editions to exist, they had to be different.
Finding these differences became my main interest in Wikipedia. Even though I was initially more focused on the Catalan Wikipedia, I found an exciting quest in using algorithms to compare the contents from any language edition. I could see the extent and particularities of the coverage of each topic in each language, as if they were patterns revealed in an aerial view, unperceivable to the eyes of other editors. Analyzing the editors’ behaviour and the extent of topics in articles became the object of my Master’s thesis and later of my PhD thesis. By understanding how this editing process unravels in the data and in other researchers’ work, I found many reasons to justify the need for multiple language editions. I will try to summarize them into three.
The first aspect I saw during my research was that the articles of every language edition are limited to specific groups of points of view or have a “linguistic point of view.” This was something intuitive to any Wikipedia user. Some topics are dealt very differently in the Catalan and Spanish Wikipedia - especially those concerning politics and culture. Hecht and Gergle2 showed us that these variations in points of view between the language versions of the same article could be measured by taking into account the outgoing links in the text they have in common. Even in general topics, like ‘Psychology’, one can find differences of 20% in the links pointing at different articles. Massa and Scrinzi3 pointed out that topics that elicit controversy, such as for instance articles about the terrorist “Osama Bin Laden” or the international struggle “Israeli-Palestinian conflict”, showed the fewest number of links in common.
This led me to think that even though Wikipedia asks for a neutral point of view (NPOV) (i.e. a fair representation of the different available points of view on a topic), we know this is an ideal. Since a language edition is a community phenomenon, group interests and power dynamics tend to reinforce or undermine certain points of view. Some perspectives are unknown or simply ignored, and very few are novel or exclusive to that particular group of speakers. This latter category is very valuable. Such novelty and uniqueness is in fact a contribution, and should be seen as a valuable complement to other language editions.
Linguists sometimes defend a linguistic perspective by saying that every language is a specific worldview, or at least, one of a particular context. Each language you speak gives you concepts to map things and situations, and classify them according to the experience of generations. Any language accumulates knowledge in the vocabulary used to label the species of plants, the nouns to describe climatological changes in the natural environment, and the idioms the idioms and adjectives that have originated in order to understand human character and history in a specific way. Being able to compare linguistic differences and observe from multiple perspectives allows you to contrast and understand reality better.
The eminent linguist Benjamin Lee Whorf went a bit further with this perspective and reinforced the idea that we need more than one language to gain depth in thinking. He claimed that all knowledge is provisional, and therefore, multilingual competences allow you advance faster in its development. “Western culture has made, through language, a provisional analysis of reality and, without correctives, holds resolutely to that analysis as final. The only correctives lie in all those other tongues which by aeons of independent evolution have arrived at different, but equally logical, provisional analyses”4. This quote inevitably reminded me of how Wikipedia allows us to compare the different points of view, jumping through the parallel versions of an article that exists in several language editions.
The second aspect I saw during my research was that the language editions are influenced by the territories where the language is spoken and they are the most complete at creating content about them. Hecht and Gergle5 measured in several language editions the number of links directed to articles geolocated on the territories where the language is spoken. With such a simple metric they could determine that each Wikipedia tends to be self-focused, as results indicated that these articles received many more links than other geolocated articles, i.e., they were more prominent in the linked graph structure.
Even though geolocated articles show relevant language differences, one could argue that this is only a small portion of each Wikipedia. The articles about many other topics such as traditions, history, organizations, politics, and so on can explain the idiosyncrasies of a culture and the territories where the language is spoken. This way, by collecting all the articles about these topics, I thought we could get a better idea of what is really genuine in the cultural and geographical contexts of every language edition.
I hence proposed an algorithm to collect such articles and I entitled the selection of articles “Cultural Context Content” (or CCC). My first questions were (1) how many articles would each Wikipedia dedicate to their cultural contexts, and more importantly, (2) what would be the extent of this group of articles.
As far as the Catalan Wikipedia was that it would overcompensate for the linguistic and cultural genocide suffered during the past century, and that it would also be influenced by the current political self-determination struggle. This might result in an exaggerated number and proportion of articles set in this cultural context, which would be centered around Catalonia, Valencia, Balearic Islands, Andorra and a few scattered territories in the south of France and in the Aragonese autonomous community. Surprisingly, the proportion was only 20% and since the first measurement it has decreased to the current 17.09%6. In fact, taking into account the top forty language editions, the average proportion of content dedicated to their cultural context is a quarter of each Wikipedia7. Some like the English and the Japanese presented more than half of them. Others like the German, French and Italian had lower proportions (33.7%, 26.9% and 18.8% respectively).
It is difficult to answer why some Wikipedia language editions dedicate more articles to their context than others, as it may depend on many factors. The proportion of articles dedicated to CCC is not related to the density of the population, nor to the number of editors or the territorial area. But it is surely an indicator of appreciation towards their culture and places. The fact that the proportion of articles dedicated to CCC remains stable over time in every Wikipedia language edition, implies that editors are motivated to continuously create and represent the most significant places around them. This came as a surprise to me, as I expected it would decrease with the growth of each Wikipedia language edition.. Why would editors continue to create articles about their culture after the main cities, political figures and historical events have already been documented?
At the beginning I was not sure whether to consider the large extent of CCC as an undesired bias. But my interpretation of the presence of these “local encyclopedias” drifted from acceptance to encouragement, especially when I realized that the proportion of pageviews was even higher than the proportion of articles itself8 . Then I assumed that each Wikipedia could have a fundamental role in illuminating the context of each language to readers, and this is probably a key ingredient to explain the overall success and popularity of Wikipedia. One could say that the differences that every language edition present are even more valuable to readers than editors, which totally justifies the effort.
Even though I have not yet verified whether this higher reader interest towards CCC articles is applicable to all language editions, the hypothesis “context-encyclopedia-key-ingredient-to-success” is very plausible. In smaller Wikipedias with little traffic we see the inverse trend. For instance, in some African vernacular languages, the proportion of articles dedicated to their context is very low. Considering 39 Wikipedia language editions in Africa, the average proportion of articles dedicated to each cultural context is 11.1% (median 13.8%). Why is that so? Because these languages are often relegated to a private use while English or French is used for education and official matters. Only Afrikaans – a language with a social situation similar to European languages – has 23.9% of content dedicated to its context. Hence, we can say that cultural context content creation and consumption is a good indicator of a healthy Wikipedia in a society.
The third and probably the most relevant aspect I saw during my research was that language editions do not cover one another’s cultural context content, i.e. they do not have sufficient cultural diversity in their content. In 2012, Bao, Hecht, Carton, Quaderi and Horn9 found out there is a language gap between Wikipedia language editions, that is, every language edition has many articles with no equivalent version in other languages. Also, contrary to what my Wikimania friend thought (that English Wikipedia would be the only necessary language edition, a sort of “catch-all encyclopaedia”), bigger language editions do not cover the articles from smaller ones. Considering this, I wondered whether this language gap could be due to the cultural context. The results showed that, on average, 60% of the articles that are not translated in any language edition are related to the language cultural context10.
When CCC articles were shared across languages, it tended to be with those geographically closer or with those language editions which had the largest number of articles (especially English, German and French Wikipedias). It surprised me that sometimes articles related to the context of small Wikipedias were not covered at all, even though one might think it would be an easier effort to the community of editors. Some Wikipedians told me that multilingualism dynamics tends to be translating from bigger language editions into smaller ones. Besides one must also consider the difficulties in accessing the content from an unknown language about an unknown territory. As a result, big Wikipedia language editions do not cover the diversity of knowledge available in smaller languages either.
The excess content about the Western world is part of this so-called systemic bias. To me, the large amount of content Wikipedias devote to their context-based institutions, entertainment and sports does not seem to be the problem – as it is popular and read. Instead, it is the lack of reciprocal content about their cultural contexts that impedes reaching a minimum of content about the world’s cultural diversity. Perhaps even more important is the struggle of these small encyclopaedias to represent their own cultural context. We have to work on both cases.
The debate on the role of Wikipedia in the future of languages and human knowledge
The first article in a non-English Wikipedia was in the Catalan Wikipedia. It was about Àbac (Abacus), an ancient calculating tool, and it was written by an editor from Andorra named Cdani, who requested Jimmy Wales, the co-founder of Wikipedia, to create a Catalan Wikipedia where Catalan editors could write in their native language so as “not to inflict his terrible English” on the English Wikipedia which had been created two months earlier11. In fact, Wikipedia has always been global and the need for growth is still very present. In the recent Wikimania 2018 held in South Africa, Jimmy Wales reminded the community about the “desire to be in every language and every culture, on every continent and in every place”12 and celebrated the first thousand articles in the Zulu Wikipedia.
With the recognition of milestones being reached by small languages, the Wikimedia movement acknowledges that information and knowledge are determinants of wealth creation and social development for any society in general. For several years this has been one of the main directives of UNESCO, which claims that the inclusion of languages in the digital world is urgent, as the digital divide will only increase their marginalization13. In this sense, Wikipedia has set a long-term strategic direction aimed at knowledge equity by 2030. This is understood as putting “the focus of our efforts on the knowledge and communities that have been left out by structures of power and privilege”, by breaking down “the social, political and technical barriers preventing people from accessing and contributing to free knowledge”14.
Barriers such as the digital divide – lack of Internet – prevent millions of people from using Wikipedia. At the same time, the inclusion of new languages in the Wikipedia project is not as easy as encouraging their speakers to become editors, as they come across other obstacles as well. Van Dijk15 states that the lack of language standardization including a common grammar, the degree of editor literacy, the language status or the attitude of speakers towards their language, all these factors have a major impact. This latter factor is especially delicate as speakers should have the conviction that their language is worthy of such endeavour. But when the speakers internalize marginalization and a subsidiary position, it becomes very difficult to envisage that history could have been different, and revitalize the language and grow a Wikipedia.
In fact, I believe that the problem of little content in Wikipedias of less-resourced languages should not only be seen as a language problem, but also as a local knowledge problem, considering that language and knowledge are inextricable. I am most certain that a way to help speakers of endangered languages to enter or expand Wikipedia is to send them the clear message that their knowledge matters, and that it is what we need in order to reach the best depiction of human cultural diversity. Conceivably, the language problem cannot be tackled without tackling the recognition of their speakers’ knowledge, encouraging its representation, or at least the representation of its most relevant concepts (i.e. geographical places, traditions and leaders) from the speakers’ points of view - as we suggest in the Wikipedia Cultural Diversity Observatory.
During the first ten years, Wikipedia grew to include more than 260–270 active language editions, and since then it has remained stable at around 300. This represents an incredibly low number as compared to the approximately 7,000 languages that reportedly exist on the planet16. Many linguists like Dalby17 foresee massive language death in the next decades. Konai18 presents evidence of a massive die-off caused by the digital divide, and estimates that only 5% of all languages can obtain an online presence (i.e. around 350 languages). The manner in which they can defy this fate and survive remains an open question. But it seems obvious to me that Wikipedia is the best available strategy for these endangered languages, independently of whether they are fully revitalized or not.
When we fear language loss, we may precisely fear the disappearance of that aforementioned worldview, one that required some time to get established and refined. No matter whether it is a real entire worldview or not, a collaborative encyclopaedia provides the best chance to allow any language speakers to immortalize it. The use of Wikipedia and this local knowledge in education may be crucial in order to have a chance to pass it on to further generations, and in any case, Wikipedia characteristics such as its wide variety of topics, linked nature and extensive use of images constitute a corpus of knowledge essential to revitalize the language or to study its nuances at any future point. The difficulty lies in breaking all the barriers and encourage speakers to edit articles.
I have no doubt that given that Wikipedia is the fifth most visited website on the Internet, its communities and strategic direction will react and be a clear example in leadership assuming the necessary efforts to take up the cultural diversity challenge. Throughout the commitment to knowledge equity, Wikipedia is in a position to make one step forward towards cultural diversity. It would be easy to subscribe and commit to the UNESCO declaration on Cultural Diversity (2001)19, which defines cultural diversity “as a source of exchange, innovation and creativity” [...] “as necessary for humankind as biodiversity is for nature”. This means that defending cultural diversity is not only a matter of respect for the heritage, but a pragmatic decision towards humanity’s progress. The UNESCO declaration adds that “[cultural diversity is] the common heritage of humanity and should be recognized and affirmed for the benefit of present and future generations”. Making a public commitment to this declaration accompanied by several measures such as revising content policies would most surely bring positive results.
Maturity levels model for cultural diversity in Wikipedia communities
Once we have agreed that Wikipedia must take an active role in preserving cultural diversity, we might ask ourselves what we can do now with the current communities. How could we align all the movement members towards improving cultural diversity in their content? One way we find particularly useful is to evaluate the maturity of each language community in terms of cultural diversity. A maturity model allows us to understand the situation and barriers an organization comes across when incorporating certain elements in view of succeeding at a particular aspect20 (Mettler, 2011). For cultural diversity in Wikipedia we propose each language community to work on the a) discourse, b) organization (through events and tools), c) degree of awareness of the gaps (through metrics and visualizations), and d) strategy (by setting goals and priorities).
Figure 2 below shows a preliminary version of the maturity model. The different sorts of barriers and levels are based on discussions I held with the communities during international Wikipedia conferences, while the different incorporated elements in the pursuit of cultural diversity are my own suggestions. I named the levels: (1) Unintentional, (2) Spontaneous, (3) Organized, (4) Controlled and (5) Distributed.
The more a community moves towards the later levels, the more it is able to create a culturally diverse array of content (or closer to the sum of human knowledge in terms of cultural diversity) in its own language and even contribute to the content of other languages. Having a mature understanding of cultural diversity implies that, first, you represent your cultural context (e.g. cities, monuments, leaders, etc.) and, second, you share this content by exporting it across the other language editions, as well as cover their cultural context content.
At the first level, Unintentional, cultural diversity is not yet a goal and not even a topic of discussion. The few editors working on the language edition try to cover the very basic encyclopedic knowledge usually based on a western perspective. Cultural diversity is scarce considering the superficial knowledge a basic encyclopaedia provides: world capitals, most spoken languages, among others. Editors usually come across barriers such as lack of Internet, lack of translation tools or lack of self-recognition of the value of their own language and culture.
At the second level, Spontaneous, the community exists and in terms of cultural diversity editors start creating content about nearby places and people, as they consider it valuable to readers. Even though there is no strategy, they recognize the value of representing their own cultural context and of translating articles from other language editions - they incorporate certain elements of discourse. However, there are no community conversations on how editors should organize themselves to create content more efficiently (i.e. using lists or contests) and all contributions are spontaneous. They lack editors and an offline team to move further.
At the third level, Organized, a few people emerge within the community with an organizational mindset that allows them to propose topic-dedicated events. In terms of cultural diversity, some events are dedicated to visually representing their own heritage (e.g. Wiki Loves Monuments21), to spread it across other languages (e.g. Catalan Culture Challenge22), and to cover the cultural context of other languages (e.g. Asian Month23). Members of communities which reached this level sometimes have a big picture and are partially aware of the contents that are missing, but they lack measurements and tools to better organize themselves and prioritize their top value actions.
At the fourth level, Controlled, there are different new roles: event organizers, content experts and international relations. They are able to consider the big challenge to cover the cultural context content of other language editions, and they engage in all sorts of events to do it. An example could be the regional Wikimedia CEE Spring contest organized by eastern and central European languages. At this level, the use of metrics and data visualizations in order to be aware of the content coverage is incipient, but very useful in to know the cultural context content of every language edition (% of articles) and the knowledge gaps. However, few editors access the metrics. With no regular measurement and no constant communication, the figures on cultural diversity and gaps might not trigger any further action.
At the fifth level, Distributed, cultural diversity is seen as a top priority. Communities count on different area experts (in the field of events, metrics, communication, etc.) and know how to establish reasonable goals and organize themselves to accomplish them. The degree of coverage of other cultural and geographical contexts is common knowledge across the community and editors are aware of the main knowledge gaps. Cultural diversity has its dedicated events and contests and it is also a recurring requirement for other contests based on general topics (e.g. Women, Art, Books, etcetera.). At this level, discourse, organization, indicators and strategy are at an advanced stage for the community to represent the existing world cultural diversity. The community has a strong culture in addressing knowledge gaps and every member is able to find the necessary events and resources to do it. The metrics assessing the extent of the gaps are constantly visible in the different types of community communications (e.g. newsletter, mailing list, etc.) that reach the entire group, and the use of tools to browse valuable articles is common in events.
According to the model above, maturity in communities progresses one level at a time. If, for instance, a community is at level 2 (i.e. Spontaneous), it will not be able to fast forward to level 4 (i.e. Controlled) without previously passing through level 3 (Organized), gaining the necessary community capacity. Each level requires revising the current processes with more skills and knowledge. While I am writing this, no community has reached the fifth level (and only a few are located on the fourth), because metrics and data visualizations are also being developed and are to be implemented by the end of 2019. I believe that the more awareness raised on content cultural diversity and the more usable the tools become, the easier it will be for communities to embrace these values and practices. In the end, cultural diversity is a core value of the global movement and the different elements of the model are aimed at improving current activities.
Without metrics and tools, it is hard for communities to work on topics they may not be able to identify in a foreign language. Metrics may be useful to provide editors with specific points to address the cultural diversity or culture gap problem and have more impact in their contributions. In the near future I hope to obtain feedback from the communities and understand more thoroughly the barriers that separate one level from another. For instance, the use of a survey would be helpful in order to obtain data and refine the model, while at the same time disseminating it. The maturity model for cultural diversity is a working vision in order to help language communities make progress through specific and attainable steps.
Towards a stronger sense of a global community
Thirty years before the commercialization of the Internet and forty years before the birth of Wikipedia, media theorist Marshall McLuhan anticipated that the world would become a global village. Each place would be connected through technology and information would continuously flow without entailing cultural uniformity24. The Internet may not have yet lived up to such humanist ideals, but I truly believe Wikipedia has managed to create a fascinating space, where speakers of any language can present information from different points of view, and search for consensus through a shared representation of provisional knowledge.
As I am writing these lines, I believe cultural diversity remains an unopened box to most of the movement. The sum of human knowledge cannot be contained in one language edition. The sum of human knowledge depends on representing and sharing the content of every language with other languages, in other words, it depends on the content exchange between languages. Current research shows that large language editions like English, French or German cover a considerable amount of content relative to the cultural context of other languages, but this is not usually the general case nor is it sufficient. We cannot be content when African languages do not reach even a minimal representation of their related cultural context, hence lacking to provide a perspective on their leaders, places, food, traditions, among others.
All in all, I am confident that cultural diversity will become one of the main objectives in the future. Whenever I attend a Wikipedia meeting or event, I realize that we enjoy being part of a global community. Editors feel this sense of unity in diversity, and the very fact of recognizing the value of cultural diversity and fostering content exchange will strengthen the movement in many senses. I am not sure I can promise my grandfather or mother a specific extent to which Catalan will be used in the next century, the number of new Wikipedias in the next ten years, or the state of coverage of all cultural contexts by minor language editions. But I am positive that Wikipedia is the best possible way to spread human knowledge as there is nothing more Wikipedian than being culturally diverse.
Thanks to the valuable suggestions on improving the article to Robin Taylor, Laura Vincze, Joseph Reagle, David Laniado, Denny Vrandečić, Stephane Coillet-Matillon and Jake Orlowitz.
1 Ezra Pound, ABC of Reading (New Directions Publishing, 1960), 34.
2 Brent Hecht and Darren Gergle. "The Tower of Babel Meets Web 2.0: User-generated Content and its Applications in a Multilingual Context." Proceedings of the SIGCHI conference on human factors in computing systems. ACM, (2010).
3 Paolo Massa and Federico Scrinzi. "Manypedia: Comparing language points of view of Wikipedia communities." Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration. ACM, 2012.
4 Benjamin L Whorf. Languages and logic. (Foundations of Cognitive Psychology, 1941), 244.
5 Brent Hecht and Darren Gergle. "Measuring self-focus bias in community-maintained knowledge repositories." Proceedings of the fourth international conference on Communities and technologies. ACM, (2009).
6 Marc Miquel-Ribé and David Laniado. "Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions." Frontiers in Physics. 5. 12. (2018)
7 Marc Miquel-Ribé and David Laniado. "Cultural identities in wikipedias." Proceedings of the 7th 2016 International Conference on Social Media & Society. ACM (2016).
8 Marc Miquel-Ribé. Identity-based motivation in digital engagement: the influence of community and cultural identity on participation in Wikipedia. Dissertation. Universitat Pompeu Fabra (2017).
9 Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn and Darren Gergle. "Omnipedia: bridging the wikipedia language gap." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM (2012).
10 Marc Miquel-Ribé and David Laniado. "Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions." Frontiers in Physics. 5. 12. (2018)
15 Ziko Van Dijk. "Wikipedia and lesser-resourced languages." Language Problems and Language Planning 33.3 (2009): 234-250.
17 Andrew Dalby. Language in danger. (The Penguin Press: Allen Lane, 2002).
18 András Kornai. "Digital language death." PloS one 8.10 (2013).
20 Tobias Mettler. "Maturity assessment models: a design science research approach." International Journal of Society Systems Science (IJSSS) 3.1/2 (2011): 81-98.
24 Stearn, Gerald Emmanuel. McLuhan: Hot & Cool (1968), p. 272.