Wikipedia's Not So Little Sister Is Finding Its Own Way

Lydia Pintscher

In 2012, Wikipedia had grown and achieved so much in over a decade of creating an encyclopedia. But it was also at a point where fundamental change was needed: The world around Wikipedia was changing and Wikimedia had to find ways to make its content more accessible and support its editors in maintaining an ever increasing body of content in over 250 languages. The vision of a world in which every single human being can freely share in the sum of all knowledge was not achievable in this scattered way.

Ever since 2005 at the very first Wikimania, Wikimedia’s annual conference, one idea kept coming up: to make Wikipedia semantic and thus make its content accessible to machines. Machine-readability would enable intelligent machines to answer questions based on the content and make the content easier to reuse and remix. For example, it was not possible to easily find an answer to the question of what are the biggest cities with a female mayor because the necessary data was distributed over many articles and not machine-readable. Denny Vrandečić and Markus Krötzsch kept working on this idea and created Semantic MediaWiki, learning a lot about how to represent knowledge in a wiki along the way. Others had also started extracting content from Wikipedia, with varying degrees of success, and making the information available in machine-readable form.

So when the first line of code for the software that came to power Wikidata was written in 2012, it was an idea whose time had come. Wikidata was to be a free and open knowledge base for Wikipedia, its sister projects and the world that helps give more people more access to more knowledge. Today, it provides the underlying data for a lot of technology you use and the Wikipedia articles you read every day.

Being able to influence the world around you is such an important and empowering thing and yet we are losing this ability a bit more everywhere every day. More and more in our daily lives depends on data so lets make sure it stays open, free and editable for everyone in a world where we put people before data. Wikipedia showed how it can be done and now its sister Wikidata joins to contribute a new set of strengths.

Growing Up

Wikidata always had bigger ambitions, but it started out by focusing on supporting Wikipedia. There were nearly 300 different language versions of Wikipedia, all covering overlapping (but not identical) topics without being able to share even basic data about these topics. Considering that most of these language versions had only a handful of editors, this was a problem. Small language versions were not able to keep up with the ever changing world and, depending on which language you could read, a vast amount of Wikipedia content was inaccessible to you. Perhaps someone famous had died? That information was usually available quickly on the largest Wikipedias but took a long time to be added to the smaller ones — if they even had an article about the person. Wikidata should help fix this problem by being a central place that stores general purpose data (like those found in those “infoboxes” on Wikipedia, such as the number of inhabitants of a city or the names of the actors in a movie.) related to the millions of concepts covered in Wikipedia articles.

To start this knowledge base, Wikidata began by solving a simple but long-standing problem for Wikipedians, the headache of links between different language versions of an article. Each article contained links to all other language versions covering the same topic but this was a lot of redundancy and caused synchronisation issues. Wikidata’s first contribution was to store these links centrally and thereby eliminate needless duplication. With this first simple step, Wikidata has helped eliminate over 240 million lines of unnecessary wikitext from Wikipedia and at the same time created pages for millions of concepts on Wikidata, providing the basis for the next stage. Once the initial set of concepts were created and connected to Wikipedia articles it was time for the actual data and the ability to make statements about the concepts (e.g. Berlin is the capital of Germany). And, last but not least, followed the capability to use this data in Wikipedia articles. Now Wikipedia editors could enrich their infoboxes automatically with data coming from Wikidata.

Along the way a fantastic community maintaining that data developed, much faster than the development team could have dreamed of. This new community included new people who had never contributed to a Wikimedia project before and were now becoming interested because Wikidata was a good fit for them. It also included contributors from adjacent Wikimedia projects who were more interested in structuring information than writing encyclopedic articles and found their calling in Wikidata.

The number of concepts represented in Wikidata Items.

The number of editors on Wikidata since its start. The circles indicate the beginning and end of the mass-import of interwiki links.

Later Wikidata’s scope expanded to also support the other Wikimedia projects like Wikivoyage, Wikisource, Wikimedia Commons and so on since they can benefit from the same kind of centralized knowledge base as Wikipedia.

As it evolved, Wikidata became an attractive source for Wikimedia projects and those who used to data-scrape Wikipedia infoboxes. External websites, apps, and visualisations used this information as a basic ingredient: from a website for browsing artwork, to book inventory manager, to history teaching tools, to digital personal assistants. Now, Wikidata is used in countless places without most users even being aware of it.

And most recently it became clear that we need to think beyond Wikidata and think of a large network of knowledge bases running the same software (Wikibase) to publish data in an open and collaborative way, called the Wikibase ecosystem. In this ecosystem many different institutions, activists and companies are opening up their data and making it accessible to the world by connecting it with Wikidata and among each other. Wikidata doesn’t need to be and shouldn’t be the only place where collaborative open data happens.

At the time of writing of this chapter Wikidata provides data about more than fifty-five million concepts. It includes data about such things as movies, people, scientific papers and genes. Additionally it provides links to over 4,000 external databases, projects and catalogs, making even more data accessible. This data is added and maintained by more than 20,000 people every month and used in over half of all articles in Wikimedia projects.

Helping People (and Machines) Come Together

Just like Wikipedia is not like any other encyclopedia, Wikidata is not like any other knowledge base. There are a number of things that set Wikidata apart. They are a result of striving to be a global knowledge base and covering a multitude of topics in a machine-readable way.

The most important differentiator is probably the acknowledgement that the world is complex and can’t easily be pressed into simple data. Did you know that there is a woman who married the Eiffel Tower? That the Earth is not a perfect sphere? A lot of technology today is trying to simplify the world by hiding necessary complexity and nuance. Conflicting worldviews need to be surfaced. Otherwise we take away people’s ability to talk about, understand, and ultimately resolve their differences. Wikidata is striving to change that by not trying to force one truth but by collecting different points of view with their sources and context intact. This additional context can, for example, include which official body disputes or supports which view on a territorial dispute. Without this focus on verifiability instead of truth and not trying to force agreement it would be impossible to bring together a community from different languages and cultures. For the same reason, Wikidata doesn’t have an enforced schema that restricts the data, but, rather, has a system of editor-defined constraints that highlight potential problems.

Being able to cover different points of view and nuance is not enough however for a truly global project. The data also needs to be accessible to everyone in their language without privileging any particular language by design. Because of this, every concept in Wikidata is identified by a unique ID instead of an English name. Q5, for instance, is the identifier for the concept of a human. It is then given labels in the different languages: “human” in English, “người” in Vietnamese and “ihminen” in Finnish. This way the underlying data is language-independent and everyone can see the data in their language when viewing or editing it. This of course does not eliminate the language issue but it goes a long way towards more equity in contributing to Wikimedia’s content.

Besides fabulous people, Wikidata’s ultimate secret sauce are its connections. All concepts in Wikidata are connected to each other through statements. The statement “Iron Man -> member of -> Avengers” for example tells us that Iron Man is a member of the Avengers. That one connection alone does not tell us much yet. But if you take a number of other similar connections you can easily get a list of all Avengers. And then make a list of the movies they first appeared in and the actors they were portrayed by. A lot of simple individual connections taken together are powerful. If you add on top of that the wide range of topics Wikidata covers it becomes even more powerful because you can make connections that have not been made before. How about a list of species named after politicians? Now possible, thanks to these simple connections! And those are just the connections inside Wikidata itself; Wikidata also connects to a large amount of external databases, catalogs and projects that make even more data available. Since Wikidata has such a large number of links to external resources it can act as a hub so that way you, and even more importantly any machine, can find a vast amount of additional information based on a single piece of data. If the ISBN of a book is known, then knowing its entry in the relevant national library is just a hop away. There might not be a direct link from an artist’s entry in the Louvre’s catalog to their entry in the Rijksmuseum’s catalog but with Wikidata this connection is easily made, opening up yet more options for discovering knowledge.

Wikidata links to more than 4,000 external databases, projects and catalogs, creating a vast network.

Impacting Wikipedia

Its close connection to Wikipedia made all the difference for Wikidata, especially at the start. Without the community, experience, mindshare and tools that Wikipedia provided, Wikidata would not be where it is today. Wikidata gained a lot from its close association with Wikipedia. It is also giving back of course, not just by significantly lowering maintenance burdens through centralisation of data but also in a number of more subtle and indirect ways.

Before Wikidata the different Wikimedia projects and language versions of each project worked in silos to a large degree. There was little collaboration on content across project and language boundaries. Wikimedia Commons had been around for a while as a central repository for media files that are shared between all Wikimedia projects, but by its nature it did not force a lot of collaboration. Because of this a large part of the editors associated first and foremost with their language version of Wikipedia and only a distant second, if at all, with the Wikimedia Movement as a whole. Statements like “The Wikipedia in this and that language is terrible” were not uncommon when Wikidata started. The thought of using content that is shared with these other Wikipedias that were perceived as inferior was deemed frightening. Equally, the thought that the large Wikipedias could gain anything from contributions by smaller projects was unthinkable. By helping people connect across language and project boundaries, Wikidata has helped to steer Wikipedia away from a silo mentality towards a truly global movement where every project is recognized and valued for their contribution to the sum of all knowledge.

Wikidata also helps Wikipedia by being a fundamental building block for technical innovation - big and small. Simple changes like the improved search box when linking to another article in VisualEditor become possible thanks to structured data in Wikidata. Now the selector shows you the short description from Wikidata and you can select the right article to link to without having to look it up. Wikidata also makes possible more fundamental changes like overhauling Wikimedia Commons in order to make images more discoverable for Wikipedia editors and others. Wikidata provides the data necessary to build better experiences for Wikipedia’s editors and readers.

Through the data in Wikidata we can also understand Wikipedia better. We can analyse much more easily what content is covered and what is missing. Take the gender gap. It was known for a long time that Wikipedia’s content is skewed towards covering men. The simple fact that there are more Wikipedia articles about men than women is not very helpful for a big community though as it is too broad a problem to be motivated by and meaningfully make progress on. Wikidata allows us to see a more detailed picture and analyse the content by time period, country, profession of the person and other relevant characteristics. We can also see if there is a difference between the language versions of Wikipedia to see if any of them has a particularly narrow gender gap so we can learn from them. We can also see the geographic distribution of Wikipedia’s content and find blind spots on Wikipedia’s map of the world. The same can be done for any other content bias or gap that needs to be understood better. This way, Wikidata helps Wikipedia learn more about itself.

The gender gap on Wikipedia visualized per country of citizenship of the article’s subject. (tool by Envel Le Hir at denelezh.org)

Better understanding the knowledge that Wikipedia covers is a necessary first step towards countering biases and filling gaps. Wikidata can also help there by making it possible to generate automated worklists for a topic you care about. Interested in video games? You can make a list of all video games released in the last 10 years which are missing a publisher and start adding that data. How about party affiliations of politicians in your recent local election? Monuments in the city you last visited that are missing street addresses? All that is just a few clicks away, making it easier to contribute to collecting the sum of all human knowledge and making Wikipedia more complete.

And last but not least, Wikidata helps bring new contributors to Wikipedia. It opens up Wikimedia to new types of people, ones more interested in structuring information and connecting data points than writing long prose. And the small contributions that can be made on Wikidata lend themselves well to beginners who are initially overwhelmed by writing full articles. It also is a gateway for institutional contributors like galleries, libraries, archives and museums who want to make their content accessible.

Wikidata’s influence on Wikipedia far exceeds simply providing a few data points for infoboxes. It is a driver and supporter of change. Growing up with a big sister is not always easy. There’s the occasional disagreement and even fight but in the end you make up and stick together because you are the best team there could be. It is amazing to have someone to look up to. Wikidata is a project in its own right now, with its own reason for existence… but it will always be there to support Wikipedia.

Thank you, big sister! Wikidata owes you.