Cleaning the Wikipedia Corpus: Articles and Text

In Natural Language Processing, it is rare that a corpus only includes information we care about. Often they include lots of low-quality data that we can’t (or shouldn’t) learn much from. The procedure is specific to the analysis being performed, so rather than simply repeat what I’ve done (or use my code, available on GitHub), I recommend reading this to understand the reasoning for including or excluding Wikipedia articles. Focus on the methods I employ, so you can modify them should you determine you have different needs.

My Goals

Before I can continue, I need to set my goals with analyzing the Wikipedia corpus.

I want to build a classifier which analyzes Wikipedia “content” articles and determines the Library of Congress classification (LCC) of the article. For example, Cayetano Heredia University should be classified as LE, while Peter B. Bennett would be CT. I may eventually try to get numerical classifications (e.g. LE-68), but not until after I’ve got good general subject classifiers.

Wikipedia “content” articles, roughly, identify a topic and summarize it. This excludes things like Wikipedia user pages and categories, but includes “List of” articles. The linked article explains in exhaustive detail what is or is not an article and I’ll generally be following that.

I’m specifically interested in using “prose” text for analysis. That is, text similar to this blog post rather than metadata in XML tags or structured data, like tables and charts. Most of the text in Wikipedia is prose, and individual types of charts need their own parsers, so I’m not interested in that until I’ve made a lot of progress on the more general problem.

This is a “fun educational project” for me. I’ll be doing it as long as I’m having fun and learning new things. If I stop doing work for it, I may come back to it later.

And as with any open-ended project like this, I expect my goals will become more refined as I research into related topics.

Metadata to Exclude

Each Wikipedia article comes with a chunk of metadata:

<mediawiki>
  <page>
    <title>Abstract (law)</title>
    <ns>0</ns>
    <id>766</id>
    <revision>
      <id>982995708</id>
      <parentid>940438615</parentid>
      <timestamp>2020-10-11T16:47:04Z</timestamp>
      <contributor>
        <ip>REDACTED</ip>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="3382" xml:space="preserve">

For my analyses I care about the title and content of articles, not who wrote them or when. I also care about the unique identifier, which can be more convenient for representing a map of all articles than the titles themselves.

Articles to Exclude

The August 2021 dump I’m using has 21,409,406 pages.

There are ten types of articles I’m automatically universally excluding. They are the “Namespaced” articles which either discuss the nature of Wikipedia (Such as “What is an article?” above) or are really helper articles (Each file uploaded to Wikipedia gets a page).

Article NamespaceNumber of Articles
Category2,100,543
Wikipedia1,170,425
File915,410
Template592,437
Portal93,146
Draft59,584
Module12,598
MediaWiki2,212
TimedText1,352
Help957
Total4,948,664

Next, I’ll be ignoring any redirect articles (example). These have no content and are just an alias for an actual article. There are 10,111,832 redirect pages.

If I stopped here, this would leave me with 6,348,910 articles, the number of English Wikipedia articles at the time of the August 2021 dump.

Lastly, I’ll be discarding disambiguation pages (example). They don’t really fit into a LCC classification and have the potential to add a lot of noise to any models I’m training even though they are traditionally considered “content” articles. This excludes a further 55,798 articles.

These are the classes of articles I already know I want to exclude. In the future, I may identify other types of articles, but these will require more thorough analyses.

Text to Exclude

I consider that text is a link to be metadata and not part of prose. So

Cultural anthropology has a rich [[methodology]], including [[participant observation]]

becomes

Cultural anthropology has a rich methodology, including participant observation

Similarly, I’m ignoring the article text linked to, so [[Diffusion (anthropology)|diffused]] becomes diffused. This is because article titles often include more information than the word used in natural prose. In the future, it may be possible to use this additional information to create a parse which can disambiguate terms which have multiple uses (such as “diffused” in this example).

Next, XML tags are their own special beast. Many formatting tags, such as <small> and <i> can simply be ignored and stripped from the article – keeping them may eventually help with entity recognition but that’s a later problem to solve. Tags like <p> and <br> indicate that the article has a line break that wasn’t literally entered as a newline, so I need to treat those as new paragraphs. Several XML tags – such as <math> and <hiero> communicate information in prose even though their content isn’t easily parseable (I’ll cover exactly what I’m doing for those in a later post). Some tags indicate their contents simply aren’t prose, such as <gallery> or <timeline>.

Wikipedia has a lot of special table types for structured data. On a cursory analysis these are annoying to identify completely, so I’m automatically excluding text between curly braces to be safe. In code these look similar to:

{| class="wikitable"
! Name
! Dates
! In opposition to:
|-
| [[Arnulf, Duke of Bavaria|Arnulf the Bad]] || 919–921 || [[Henry the Fowler]]

While they occasionally have prose text (for example, List of Episode articles), it is safe to ignore them for now as they do not make up the bulk of prose text on Wikipedia.

I’m excluding the Bibliography sections of all articles – while these sometimes have natural language, this is rare. Eventually it may be possible to use information like the cited journals to discern the topic of an article.

I’m currently debating whether to exclude parenthetical text (such as this). Simply ignoring the parentheses often breaks grammatical structure, and can sometime form their own prose-like thoughts:

Others, such as Claude Lévi-Strauss (who was influenced both by American cultural anthropology and by French Durkheimian sociology), have argued that apparently similar patterns of development reflect fundamental similarities in the structure of human thought (see structuralism).

Update: I will be discarding parenthetical statements. They’re noisy and if I don’t have enough data I can care about that problem later.

Using my code

If you’ve followed the instructions in my previous post, you can get your own cleaned corpus of Wikipedia by running:

go run cmd/clean-wikipedia/clean-wikipedia.go \
  path/to/input/dir \
  path/to/output/dir

The result should be about 20GB, about 25% of the original XML extraction. By focusing the corpus down to only the information we care about, we greatly speed up our processing speed since machines need to read less information from disk.

Sample article before and after. Note that the cleaning process is not perfect, but I’ll be improving it as I find mistakes (or at least, ones which are worth my time to fix).

Further, since most of the text is reasonably-clean prose, you can immediately begin training models on it. There are some hiccups, but I’ll get to that in my next post, which will be on n-gram analyses.

1 Comment

Leave a Reply