I’m training machine learning models to classify articles on Wikipedia. To do this well, I need a lot of training data. I’d originally intended to use Wikipedia categories to automatically classify articles. Unfortunately, I can’t assume articles in Category:Philosophy or one of its subcategories should all be “Philosophy” as it turns out categories aren’t used consistently enough to use them out-of-the-box.
A category cycle is when a category is its own ancestor. For example (as of this writing) Category:Quintephones is a subcategory of Category:Electronic musical instruments, which is a subcategory of Category:Quintephones. This specific example isn’t problematic, but if such a cycle exists which includes many very general topics, it can make classifying articles by their category difficult.
In principle, Wikipedia categories should contain no cycles – a parent category should be strictly broader than its children. In practice, this is not the case. These cycles cause problems when trying to use categories as information in machine learning models and for labeling training data.
There are too many relationships (“category A is in category B”) to consider one-by-one. I need some way to narrow down category relationships to the ones which cause the most problems. So, given a strongly-connected directed graph of Wikipedia category relationships I’d like to try to automatically identify the most problematic relationships (the ones which shouldn’t exist), manually examine the results, and then remove the relationships I can tell are bad.
The first tool we’ll be using is centrality. Roughly, centrality measures which entities are “most important” in a network of related things. The most important entities don’t necessarily have problematic relationships, but they are the ones most likely to. There are several types of centrality, each with their own strengths and weaknesses. We’ll be using several of them to analyze the various strongly-connected digraphs we found last time.
For this post, We’ll be studying the second-largest, which contains 77 categories mainly dealing with Californian geographical features. We’ll call this the California subgraph.
For directed graphs, there are actually two types of degree centrality – indegree and outdegree. For our case, these respectively correspond to:
- Indegree centrality: The number of subcategories of “A”.
- For example, “Science Fiction” and “Fantasy” are subcategories of “Fiction by Genre” as they are more specific forms of “Fiction”.
- Outdegree centrality: The number of pages “A” lists as a parent category.
- For example, “Fiction” is a parent category of “Fiction by Genre” as it includes more broad topics than just lists of fictional works, organized by genre.
Category:Shasta Cascade has three child categories in our graph (that is, only considering child categories in this list), so its indegree centrality is 3:
- Category:Butte County, California
- Category:Shasta County, California
- Category:Tehama County, California
The page has one parent category which is in our graph, Category:Sacramento Valley, so its outdegree centrality is 1.
Below is a graph plot of the categories. Each dot (or “node”) is a single category from the California subgraph, where arrows point from a child category to one of its parents. Larger dots represent categories with more subcategories.
The category with the highest indegree and outdegree centrality is Category:Sacramento Valley, with 9 parent categories in this subgraph and 11 child categories. Second place is Category:San Joaquin Valley. These are immediately, obviously the problem pages in this graph. While parts of these valleys are in various Californian counties, the fact that parts of them extend into the counties are not the “essential—defining—characteristics” of those topics. By this definition, they belong in none of the other categories in this subgraph. What happens if we remove the outgoing edges from these nodes?
After Removing the Problematic Relationships
Removing the outgoing edges from “Category:San Joaquin Valley” and “Category:Sacramento Valley” completely eliminates all cycles in the California subgraph. These seventeen edges are the minimum required to do so. The result looks like a much more sensible graph of category parent/child relationships.
Since this is an easy problem to fix, I’m going to go ahead and make these changes to the Sacramento Valley and San Joaquin Valley categories myself, and start a discussion on the talk pages pointing to this post for explanation.
I’m still going to explore other types of centrality measures to get a feel for their strengths and weaknesses – I’ll likely need to use multiple techniques to fix the largest graph, as it has nearly 12,000 categories.