I’ve been playing with Wikipedia corpora for a few months now, and it’s in a good enough state that I’ve even constructed a basic Bayesian classifier which classifies Wikipedia articles into the top 19 Library of Congress Classifications I care about. The problem I’m having now is one of training data – to be reasonably good, most machine learning models need lots of high-quality labeled data.
I need some way to label a large number (at minimum, thousands) of training examples from a wide variety of Wikipedia pages. Enter: Wikipedia categories.
Note: for all data in this post, I used the September 1st, 2021 dump of Wikipedia pages and articles.
“The central goal of the category system is to provide navigational links to Wikipedia pages in a hierarchy of categories“
“Category chains formed by parent–child relationships should never form closed loops; that is, no category should be contained as a subcategory of one of its own subcategories.”
Seen above, Star Wars is a “Science Fiction” story, so it falls in the “Science Fiction” category. The Science Fiction category lists all pages (and categories) which list themselves as being part of the “Science Fiction” category. So good so far.
Further, the Science Fiction category itself is a member of the Fiction by Genre category. This makes sense. Naively, we would think that this could form a reasonable hierarchy – any article in “Fiction by Genre” or one of its subcategories should fall under the Literature classification.
This is where the strangeness of crowdsourced data comes in. Given any category, we could keep following its parents until we reach one which is manually labeled. This is one potential path, starting from “Science Fiction” and successively choosing parent categories:
- Science Fiction
- Fiction by Genre
- The arts
- Human activities
- Human biology
- Branches of biology
- Subfields by academic discipline
- Categories by topic
- Categories by parameter
- Wikipedia categories
- Wikipedia categorization
- Wikipedia content administration
- Wikipedia administration
- Online encyclopedias
- Free content
- Copyright law
- Law by issue
- Categories by issue
- Categories by parameter
There are two immediate problems. First – we’ve entered a cycle! The Categories by parameter category is it’s own parent/child. Second – we’ve clearly gone way off topic. Star Wars should not be classified as a “Law” or “Biology” article, even though it falls into “Law” and “Biology” categories.
This is unfortunate, contradicts the intent of Wikipedia categories, and I’ll have to work around it. Before doing so, I’d like to study it and see how extensive this problem really is. Enter: Connected digraphs.
A directed graph, commonly “digraph” represents a set of elements which are connected with one-way relationships. Your biological family tree forms a digraph – you are the offspring of your parents; they are not your offspring (assuming no time-travel incest shenanigans). Digraphs aren’t only limited to one-way connections; relationships may go two ways. A common example of digraphs is that of a city’s downtown. Some intersections are connected with two-way roads, and others by one-way roads. For example:
In this city, we say that “Intersection A is linked to Intersection B”, but also “B is not linked to A” since the road from A to B is one-way. In a well-designed downtown, it should be possible to eventually get from any intersection to any other intersection – it should be “strongly connected”. In the downtown above, unfortunately, anyone who gets to F can never escape. Similarly, once someone gets to C they can’t get back to A, B, D, or E – they may only proceed to F and never return. We would say that the digraph representing the intersections of the downtown above is “weakly connected” as there is not a path from each intersection to every other intersection (without breaking the law).
If we were to turn the road connecting E to F into a two-way road, this would resolve the issue – everyone who got stuck at F would be free to get to the rest of this city. Such a digraph would be “strongly connected”.
To determine the scope of the problem with the Wikipedia category hierarchy, I want to find all such strongly connected graphs, as they will cause problems for any labeling scheme based on Wikipedia Categories.
<insert coding montage>
The Largest Strongly-Connected Wikipedia Category Digraph
I found it. This massive strongly-connected graph contains 11,895 of all 2,102,570 categories. That is, 0.57% of all categories belong to this single strongly-connected digraph. This single digraph contains as disparate topics as:
The problem is that since these topics are so broad, once a category (or article) gets classified into one of these, it immediately inherits all of them. Using any category in this group would add lots of noise to my model.
The second-largest is 77 geography-related categories in California.
The third-largest is 70 categories related to wars in the Middle East.
The fourth-largest is 53 geography-related categories in the southern US and central America.
There are several hundred smaller strongly-connected graphs, for a total of 356 distinct digraphs. Most of these are obvious mistakes, such as:
There’s three things I’m going to do with this information.
- Many of the smaller (but still sizeable) categories all fit into the same LCC classification. I can easily use these as training data.
- Rather than assume that Wikipedia articles are transitively of the same topic as their parent categories, I should reduce how much I weight that information based on how many steps it takes to get to a classified category.
- I can use centrality measures to find which categories are most problematic, and both fix those in my own data set and possibly recommend these changes to Wikipedia.