Mapping the Contents in Wikipedia

Having just returned from CHI2009 conference on Human-Computer Interaction, many of the topics there focused on where and how people obtain their information, and how they make sense of it all. A recent research topic in our group is understanding how people are using Wikipedia for their information needs. One question that had constantly come up in our discussion around Wikipedia is what is exactly in it. We have so far done most of our analyses around edit patterns, but not so much analysis have gone into what do people write about? What topics are the most well-represented? Where topic areas have the most conflict?

In one of our recent CHI2009 papers, we explored this issue. Turns out that Wikipedia have these things called Categories, which people use to organize the content into a pseudo-hierarchy of topics. We devised a simple path-based algorithm for assigning articles to large top-level categories in an attempt to understand what topic areas are the most well-represented. The top level categories are:

Using our algorithm, the page “Albert Einstein” can be assigned to these top-level categories:

This mapping makes some intuitive sense. You can see that the impact Albert Einstein has made in various areas of our society such as science, philosophy, history, and religion. Using the same ideas and algorithm, we can now do this mapping for all of the pages in Wikipedia, and find out what top level categories have received the most representation. In other words, we can figure out the coverage of topic areas in Wikipedia.

(You may have to click on the graphic here to see it in more detail.)

We can see that the highest coverage has gone toward the top-level category of “culture and the arts” at 30%, followed by “people” 15%, “geography” 14%, “society and social science” 12%, and history at 11%. What’s perhaps more interesting is understanding which ones of these categories have generated the most conflicts! We used the previously developed concept called Conflict Revision Count (CRC) in our CHI2007 paper, and showed which top level categories have the most conflicts:

In this figure, the categories are listed in order of the total amount of conflicts clockwise from “People”. This means that People did receive the most amount of conflict, followed by Society and Social Sciences, etc. However, the percentages in each topic is normalized by the number of article-assignments in that topic. So the metric developed here can be interpreted as the amount of conflict in each topic that has been normalized by the size of the topic, which can be interpreted as the amount of contentious in articles of the topic.

“Religion” and “Philosophy” stand out as highly contentious despite having relatively few articles.Turns out that “philosophy” and “religion” have generated 28% of the conflicts contentious-ness each. This is despite the fact that they were only 1% and 2%, respectively, of the total distribution of topics as shown above.

Digging into religion more closely, we see that “Atheism” have generated the most conflict, followed by “Prem Rawat” — the controversial Guru and religious leader, “Islam” and “Falun Gong”.

Wikipedia is the 8th ranked website in the world, so it is clear that a lot of people get their information from Wikipedia. The surprising thing about Wikipedia is that it succeeded at all. Common sense would suggest that an encyclopedia in which anyone can edit anything they want would result in utter nonsense. What happened is exactly the opposite: Many users and groups have gotten together to make sense of complex topics and debate with each other about what information is the most relevant and interesting to be included. This helps with us keeping sane in this information world, because we now have a cheap and always accessible content on some of the most obscure content you might be interested in. At lunch today, we were all just wondering what countries have the lowest birth rate. Well, surprise!! Of course, there is a page for that, which we found using our iPhones.

The techniques we have developed here enable us to understand what content is available in Wikipedia and how various top level categories are covered, as well as the amount of controversy in each category.

There are of course many risks in using online content. However, we have been researching tools that might alleviate these concerns. For example, WikiDashboard is a tool that visualizes the social dynamics behind how an wiki article came into its current state. It shows the top editors of any Wikipedia page, and how much they have edited. It also can show the top articles that a user is interested in.

We are considering adding this capability to WikiDashboard, and would welcome your comments on the analysis and ideas here.

All web users can guide the content in Wikipedia by participating in it. If we realized that the existence of our society depends on the healthy discourse between different segments of the population, then we will see it not just as a source of conflict, but a source of healthy discussion that needs to occur in our world. By having these discussions in the open (with full social transparency), we can ensure all points of view are represented in this shared resource. Our responsibility is to ensure that the discussion and conflicts are healthy and productive.

Kittur, A., Chi, E. H., and Suh, B. 2009. What’s in Wikipedia?: Mapping Topics and Conflict using Socially Annotated Category Structure. In Proceedings of the 27th international Conference on Human Factors in Computing Systems (Boston, MA, USA, April 04 – 09, 2009). CHI ’09. ACM, New York, NY, 1509-1512.

Additional information

Focus Areas

Our work is centered around a series of Focus Areas that we believe are the future of science and technology.

Licensing & Commercialization Opportunities

We’re continually developing new technologies, many of which are available for¬†Commercialization.


PARC scientists and staffers are active members and contributors to the science and technology communities.