The Largest Connected Digraph of Wikipedia Categories

I’ve been playing with Wikipedia corpora for a few months now, and it’s in a good enough state that I’ve even constructed a basic Bayesian classifier which classifies Wikipedia articles into the top 19 Library of Congress Classifications I care about. The problem I’m having now is one of training data – to be reasonably good, most machine learning models need lots of high-quality labeled data.

I need some way to label a large number (at minimum, thousands) of training examples from a wide variety of Wikipedia pages. Enter: Wikipedia categories.

Note: for all data in this post, I used the September 1st, 2021 dump of Wikipedia pages and articles.

Wikipedia Categories

Categories are listed at the bottom of many Wikipedia articles. They form a sort-of-hierarchy of pages. Per the Wikipedia:Categorization article, these sound like exactly what I’m looking for:

“The central goal of the category system is to provide navigational links to Wikipedia pages in a hierarchy of categories

“Category chains formed by parent–child relationships should never form closed loops; that is, no category should be contained as a subcategory of one of its own subcategories.”

Seen above, Star Wars is a “Science Fiction” story, so it falls in the “Science Fiction” category. The Science Fiction category lists all pages (and categories) which list themselves as being part of the “Science Fiction” category. So good so far.

Further, the Science Fiction category itself is a member of the Fiction by Genre category. This makes sense. Naively, we would think that this could form a reasonable hierarchy – any article in “Fiction by Genre” or one of its subcategories should fall under the Literature classification.

This is where the strangeness of crowdsourced data comes in. Given any category, we could keep following its parents until we reach one which is manually labeled. This is one potential path, starting from “Science Fiction” and successively choosing parent categories:

  • Science Fiction
  • Fiction by Genre
  • Fiction
  • The arts
  • Culture
  • Human activities
  • Humans
  • Human biology
  • Branches of biology
  • Subfields by academic discipline
  • Categories by topic
  • Categories by parameter
  • Wikipedia categories
  • Wikipedia categorization
  • Wikipedia content administration
  • Wikipedia administration
  • Wikipedia
  • Online encyclopedias
  • Free content
  • Copyright law
  • Law by issue
  • Categories by issue
  • Categories by parameter

There are two immediate problems. First – we’ve entered a cycle! The Categories by parameter category is it’s own parent/child. Second – we’ve clearly gone way off topic. Star Wars should not be classified as a “Law” or “Biology” article, even though it falls into “Law” and “Biology” categories.

This is unfortunate, contradicts the intent of Wikipedia categories, and I’ll have to work around it. Before doing so, I’d like to study it and see how extensive this problem really is. Enter: Connected digraphs.

Connected Digraphs

A directed graph, commonly “digraph” represents a set of elements which are connected with one-way relationships. Your biological family tree forms a digraph – you are the offspring of your parents; they are not your offspring (assuming no time-travel incest shenanigans). Digraphs aren’t only limited to one-way connections; relationships may go two ways. A common example of digraphs is that of a city’s downtown. Some intersections are connected with two-way roads, and others by one-way roads. For example:

In this city, we say that “Intersection A is linked to Intersection B”, but also “B is not linked to A” since the road from A to B is one-way. In a well-designed downtown, it should be possible to eventually get from any intersection to any other intersection – it should be “strongly connected”. In the downtown above, unfortunately, anyone who gets to F can never escape. Similarly, once someone gets to C they can’t get back to A, B, D, or E – they may only proceed to F and never return. We would say that the digraph representing the intersections of the downtown above is “weakly connected” as there is not a path from each intersection to every other intersection (without breaking the law).

If we were to turn the road connecting E to F into a two-way road, this would resolve the issue – everyone who got stuck at F would be free to get to the rest of this city. Such a digraph would be “strongly connected”.

To determine the scope of the problem with the Wikipedia category hierarchy, I want to find all such strongly connected graphs, as they will cause problems for any labeling scheme based on Wikipedia Categories.

<insert coding montage>

The Largest Strongly-Connected Wikipedia Category Digraph

I found it. This massive strongly-connected graph contains 11,895 of all 2,102,570 categories. That is, 0.57% of all categories belong to this single strongly-connected digraph. This single digraph contains as disparate topics as:

The problem is that since these topics are so broad, once a category (or article) gets classified into one of these, it immediately inherits all of them. Using any category in this group would add lots of noise to my model.

The second-largest is 77 geography-related categories in California.

The third-largest is 70 categories related to wars in the Middle East.

The fourth-largest is 53 geography-related categories in the southern US and central America.

There are several hundred smaller strongly-connected graphs, for a total of 356 distinct digraphs. Most of these are obvious mistakes, such as:

Future Work

There’s three things I’m going to do with this information.

  1. Many of the smaller (but still sizeable) categories all fit into the same LCC classification. I can easily use these as training data.
  2. Rather than assume that Wikipedia articles are transitively of the same topic as their parent categories, I should reduce how much I weight that information based on how many steps it takes to get to a classified category.
  3. I can use centrality measures to find which categories are most problematic, and both fix those in my own data set and possibly recommend these changes to Wikipedia.

What are the longest n-grams on Wikipedia?

TL;DR

I found the longest n-gram on Wikipedia which appears more than 1,000 times. It is 103 words long. It is:

The township is governed by a three-member board of trustees, who are elected in November of odd-numbered years to a four-year term beginning on the following January 1. Two are elected in the year after the presidential election and one is elected in the year before it. There is also an elected township fiscal officer, who serves a four-year term beginning on April 1 of the year after the election, which is held in November of the year before the presidential election. Vacancies in the fiscal officership or on the board of trustees are filled by the remaining trustees.

It appears 1,275 times (exactly, word-for-word), only in articles about small towns in Ohio. It’s part of the Ohio Revised Code which as far as I can tell applies to all Ohioan townships.

What is an n-gram?

An n-gram is a sequence of words which appears often in a text, or in a corpus of texts. Finding these is helpful for document classification as repeated sequences of words often have a meaning different than their individual words would suggest. An “Academy Award” isn’t just an award from an academy, but a specific award from the Academy of Motion Picture Arts and Sciences. Letting our models know during training that “Academy Award” may mean something different than “academy” and “award” can make it easier to identify fine-grained topics like this.

How I found this

After cleaning my Wikipedia corpus, I took several additional steps to normalize it:

  • I made everything lowercase
  • Sequences of numbers became “_num_
  • A “_num_” followed by a percent symbol is replaced with “_percent_
  • Certain recognized date formats become “_date_“, for example “10 January 1995

That is, for the purposes of my analysis I consider numbers to be “semantically identical”. The phrase “he was 29 years old” doesn’t have a significantly different meaning from “he was 81 years old”. For first-pass machine learning models the noise and variance introduced by treating every number as a completely different “word” is rarely useful.

Beyond the above, I define a “word” to be any sequence of characters matching the regex “[\w']+“, and I define words to be “distinct” if, and only if their lowercase forms exactly match.

To find the n-grams, I employed an iterative algorithm which takes as input all currently-known n-grams. On first iteration it only finds all 2-grams. The second iteration, it finds all 3-4-grams. Each iteration, the maximum n-gram length it finds doubles. You can find the successive commands I ran to do this in my GitHub repository.

Other long n-grams

The longest n-gram appearing more than 10,000 times is:

There were _num_ households out of which _percent_ had children under the age of _num_ living with them, _percent_ were married couples living together, _percent_ had a female householder with no husband present, and _percent_ were non families. _percent_ of all households were made up of individuals and _percent_ had someone living alone who was _num_ years of age or older. The average household size was _num_ and the average family size was _num_.

The longest (and most-frequent) n-gram appearing more than 100,000 times is:

from _num_ to _num_

which is recognizable as a range of numbers expressed in natural language.

I mean, technically it’s “_num_ _num_ _num_ _num_” because sequences of numbers appear so often, but I don’t like it since it doesn’t have any “real words”.

The longest n-gram appearing more than 100,000 times which only contains alphabetic words is “a member of the“. Most n-grams of this length serve common structural purposes such as this.

In general, the most common n-grams of length longer than 5 are from templates, where article content has been imported from another source, or where it is very clear that the same author uses identical language in a long list of articles. For example:

4,966: ... is a moth in the family Crambidae it was described by ...
3,935: In the CDP the population was spread out, with _percent_ under the age of _num_ _percent_ ...
4,907: ... together _percent_ had a female householder with no ...
4,003: ... is a protein that in humans is encoded by the ...

Charts

I made graphs of the longest n-gram of a given frequency. I’ve dropped lengths for n-grams which appeared less often than longer n-grams. This first graph includes the special handling I use for numbers, dates, and percentages.

This second graph is for the “more traditional” n-grams which only include alphabetic words.

Conclusion

This is going to let me get rid of a lot of noise when I begin training models on the Wikipedia corpus. In this case, looking for n-grams allowed me to find a lot of duplicated, stilted, not-really-human language which would have messed up my model’s ability to understand how words are used.

Rather than consider all uses of words like “elected” as the same, I can consider it a special case when it appears in the 103-word monstrosity I mentioned at the beginning. Models are sensitive to frequency, and lots of cases of a word being used in an identical context incorrectly teaches the model that this is common or expected. In reality, a human composed the passage once, and it was likely automatically duplicated (or manually copy+pasted) over a thousand times. There isn’t much to be learned from such usage. Instead, for cases like this the 103-word n-gram provides strong evidence that an article is about a very specific type of thing – a township in Iowa, which is exactly what I want my model to learn from this.

On a smaller scale, being able to treat n-grams like “the United States” as entities distinct from words like “states” and “united” means that the model won’t be tricked into confusing “United Airlines” or “States Records” with “the United States”.

This is the sort of analysis it is almost always worth doing on a corpus – while going through this work I iterated dozens of times on the code I used to clean the corpus, as each time I found cases where there were bits of formatting or common human mistakes that caused this analysis to give strange results. Now that I can get results like this, I can be confident that overall my data is clean enough for more complex work.

Cleaning the Wikipedia Corpus: Articles and Text

In Natural Language Processing, it is rare that a corpus only includes information we care about. Often they include lots of low-quality data that we can’t (or shouldn’t) learn much from. The procedure is specific to the analysis being performed, so rather than simply repeat what I’ve done (or use my code, available on GitHub), I recommend reading this to understand the reasoning for including or excluding Wikipedia articles. Focus on the methods I employ, so you can modify them should you determine you have different needs.

My Goals

Before I can continue, I need to set my goals with analyzing the Wikipedia corpus.

I want to build a classifier which analyzes Wikipedia “content” articles and determines the Library of Congress classification (LCC) of the article. For example, Cayetano Heredia University should be classified as LE, while Peter B. Bennett would be CT. I may eventually try to get numerical classifications (e.g. LE-68), but not until after I’ve got good general subject classifiers.

Wikipedia “content” articles, roughly, identify a topic and summarize it. This excludes things like Wikipedia user pages and categories, but includes “List of” articles. The linked article explains in exhaustive detail what is or is not an article and I’ll generally be following that.

I’m specifically interested in using “prose” text for analysis. That is, text similar to this blog post rather than metadata in XML tags or structured data, like tables and charts. Most of the text in Wikipedia is prose, and individual types of charts need their own parsers, so I’m not interested in that until I’ve made a lot of progress on the more general problem.

This is a “fun educational project” for me. I’ll be doing it as long as I’m having fun and learning new things. If I stop doing work for it, I may come back to it later.

And as with any open-ended project like this, I expect my goals will become more refined as I research into related topics.

Metadata to Exclude

Each Wikipedia article comes with a chunk of metadata:

<mediawiki>
  <page>
    <title>Abstract (law)</title>
    <ns>0</ns>
    <id>766</id>
    <revision>
      <id>982995708</id>
      <parentid>940438615</parentid>
      <timestamp>2020-10-11T16:47:04Z</timestamp>
      <contributor>
        <ip>REDACTED</ip>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="3382" xml:space="preserve">

For my analyses I care about the title and content of articles, not who wrote them or when. I also care about the unique identifier, which can be more convenient for representing a map of all articles than the titles themselves.

Articles to Exclude

The August 2021 dump I’m using has 21,409,406 pages.

There are ten types of articles I’m automatically universally excluding. They are the “Namespaced” articles which either discuss the nature of Wikipedia (Such as “What is an article?” above) or are really helper articles (Each file uploaded to Wikipedia gets a page).

Article NamespaceNumber of Articles
Category2,100,543
Wikipedia1,170,425
File915,410
Template592,437
Portal93,146
Draft59,584
Module12,598
MediaWiki2,212
TimedText1,352
Help957
Total4,948,664

Next, I’ll be ignoring any redirect articles (example). These have no content and are just an alias for an actual article. There are 10,111,832 redirect pages.

If I stopped here, this would leave me with 6,348,910 articles, the number of English Wikipedia articles at the time of the August 2021 dump.

Lastly, I’ll be discarding disambiguation pages (example). They don’t really fit into a LCC classification and have the potential to add a lot of noise to any models I’m training even though they are traditionally considered “content” articles. This excludes a further 55,798 articles.

These are the classes of articles I already know I want to exclude. In the future, I may identify other types of articles, but these will require more thorough analyses.

Text to Exclude

I consider that text is a link to be metadata and not part of prose. So

Cultural anthropology has a rich [[methodology]], including [[participant observation]]

becomes

Cultural anthropology has a rich methodology, including participant observation

Similarly, I’m ignoring the article text linked to, so [[Diffusion (anthropology)|diffused]] becomes diffused. This is because article titles often include more information than the word used in natural prose. In the future, it may be possible to use this additional information to create a parse which can disambiguate terms which have multiple uses (such as “diffused” in this example).

Next, XML tags are their own special beast. Many formatting tags, such as <small> and <i> can simply be ignored and stripped from the article – keeping them may eventually help with entity recognition but that’s a later problem to solve. Tags like <p> and <br> indicate that the article has a line break that wasn’t literally entered as a newline, so I need to treat those as new paragraphs. Several XML tags – such as <math> and <hiero> communicate information in prose even though their content isn’t easily parseable (I’ll cover exactly what I’m doing for those in a later post). Some tags indicate their contents simply aren’t prose, such as <gallery> or <timeline>.

Wikipedia has a lot of special table types for structured data. On a cursory analysis these are annoying to identify completely, so I’m automatically excluding text between curly braces to be safe. In code these look similar to:

{| class="wikitable"
! Name
! Dates
! In opposition to:
|-
| [[Arnulf, Duke of Bavaria|Arnulf the Bad]] || 919–921 || [[Henry the Fowler]]

While they occasionally have prose text (for example, List of Episode articles), it is safe to ignore them for now as they do not make up the bulk of prose text on Wikipedia.

I’m excluding the Bibliography sections of all articles – while these sometimes have natural language, this is rare. Eventually it may be possible to use information like the cited journals to discern the topic of an article.

I’m currently debating whether to exclude parenthetical text (such as this). Simply ignoring the parentheses often breaks grammatical structure, and can sometime form their own prose-like thoughts:

Others, such as Claude Lévi-Strauss (who was influenced both by American cultural anthropology and by French Durkheimian sociology), have argued that apparently similar patterns of development reflect fundamental similarities in the structure of human thought (see structuralism).

Update: I will be discarding parenthetical statements. They’re noisy and if I don’t have enough data I can care about that problem later.

Using my code

If you’ve followed the instructions in my previous post, you can get your own cleaned corpus of Wikipedia by running:

go run cmd/clean-wikipedia/clean-wikipedia.go \
  path/to/input/dir \
  path/to/output/dir

The result should be about 20GB, about 25% of the original XML extraction. By focusing the corpus down to only the information we care about, we greatly speed up our processing speed since machines need to read less information from disk.

Sample article before and after. Note that the cleaning process is not perfect, but I’ll be improving it as I find mistakes (or at least, ones which are worth my time to fix).

Further, since most of the text is reasonably-clean prose, you can immediately begin training models on it. There are some hiccups, but I’ll get to that in my next post, which will be on n-gram analyses.

How to download and extract Wikipedia

Note: This is specific to English wikipedia. For non-English wikipedia downloads, you should be able to follow roughly the same steps.

Using dumps of Wikipedia is notoriously annoying to do without a guide. Here I’ve condensed down the steps as much as possible, and written a program to do the extraction.

1. Download the Dump

These can be a bit tricky to navigate to. Wikimedia recommends using a mirror such as https://wikimedia.bytemark.co.uk/enwiki/ which caches the dump. Each entry at that link represents the date of the Wikipedia dumps stored there (it may be a few months behind the official source, but is good enough for most uses).

You want to download these two files, “pages-articles-multistream” and “pages-articles-multistream-index”. Note that the date the dump was created is part of the file name.

While it is technically possible to directly extract pages-articles-multistream.xml.bz2 on its own, this is not recommended as it expands to a single ~100 GB file. This is a major pain to read and iterate through, and is difficult to efficiently parallelize. Instead, we use the index to cut out slices of the compressed file for easier use.

2. Extract the Index

Using your favorite tool for extracting .bz2 files, extract the index. (NOT the core dump file)

3. Extract Wikipedia

Install the Go programming language. Clone my repository, then point the extract-wikipedia program to:

  1. The location of the -multistream.xml.bz2 file
  2. The location of the -multistream-index.txt file
  3. Where to write the extracted XML files to

For example:

$ git clone https://github.com/willbeason/wikipedia.git
$ cd wikipedia
$ go run cmd/extract-wikipedia/extract-wikipedia.go \
  path/to/pages-articles-multistream.xml.bz2 \
  path/to/pages-articles-multistream-index.txt \
  path/to/output/dir

On my machine this takes about 30 minutes to run. The time it takes will depend mostly on the speed of your machine and whether you’re reading from/writing-to an SSD. By default extract-wikipedia tries to use all available logical processors, so set `–parallel=1` (or some other low value) if you’re okay with the extraction taking much longer but still want to be able to do other things with your computer.

Note that the result is slightly altered from the compressed file, as the individual parts are not valid XML documents. To make the XML valid, extract-wikipedia takes a different action based on the file.

  1. The first file (000000.txt). Example. extract-wikipedia appends </mediawiki> to the end of the file to close the dangling <mediawiki> at the beginning of the file.
  2. The last file. This is the numerically-last file in the numerically-last folder. This file has a dangling </mediawiki> tag at the end, so extract-wikipedia adds <mediawiki> at the top of the file.
  3. All other files. For easier processing, extract-wikipedia prepends <mediawiki> and appends </mediawiki> so that the file is interpreted as a single XML document rather than a series of documents.