How to download and extract Wikipedia

Note: This is specific to English wikipedia. For non-English wikipedia downloads, you should be able to follow roughly the same steps.

Using dumps of Wikipedia is notoriously annoying to do without a guide. Here I’ve condensed down the steps as much as possible, and written a program to do the extraction.

1. Download the Dump

These can be a bit tricky to navigate to. Wikimedia recommends using a mirror such as https://wikimedia.bytemark.co.uk/enwiki/ which caches the dump. Each entry at that link represents the date of the Wikipedia dumps stored there (it may be a few months behind the official source, but is good enough for most uses).

You want to download these two files, “pages-articles-multistream” and “pages-articles-multistream-index”. Note that the date the dump was created is part of the file name.

While it is technically possible to directly extract pages-articles-multistream.xml.bz2 on its own, this is not recommended as it expands to a single ~100 GB file. This is a major pain to read and iterate through, and is difficult to efficiently parallelize. Instead, we use the index to cut out slices of the compressed file for easier use.

2. Extract the Index

Using your favorite tool for extracting .bz2 files, extract the index. (NOT the core dump file)

3. Extract Wikipedia

Install the Go programming language. Clone my repository, then point the extract-wikipedia program to:

The location of the -multistream.xml.bz2 file
The location of the -multistream-index.txt file
Where to write the extracted XML files to

For example:

$ git clone https://github.com/willbeason/wikipedia.git
$ cd wikipedia
$ go run cmd/extract-wikipedia/extract-wikipedia.go \
  path/to/pages-articles-multistream.xml.bz2 \
  path/to/pages-articles-multistream-index.txt \
  path/to/output/dir

On my machine this takes about 30 minutes to run. The time it takes will depend mostly on the speed of your machine and whether you’re reading from/writing-to an SSD. By default extract-wikipedia tries to use all available logical processors, so set `–parallel=1` (or some other low value) if you’re okay with the extraction taking much longer but still want to be able to do other things with your computer.

Note that the result is slightly altered from the compressed file, as the individual parts are not valid XML documents. To make the XML valid, extract-wikipedia takes a different action based on the file.

The first file (000000.txt). Example. extract-wikipedia appends </mediawiki> to the end of the file to close the dangling <mediawiki> at the beginning of the file.
The last file. This is the numerically-last file in the numerically-last folder. This file has a dangling </mediawiki> tag at the end, so extract-wikipedia adds <mediawiki> at the top of the file.
All other files. For easier processing, extract-wikipedia prepends <mediawiki> and appends </mediawiki> so that the file is interpreted as a single XML document rather than a series of documents.

Cyber Net It

Math, Programming, and the World

How to download and extract Wikipedia

1. Download the Dump

2. Extract the Index

3. Extract Wikipedia

Like this:

One thought on “How to download and extract Wikipedia”

Leave a ReplyCancel reply

1. Download the Dump

2. Extract the Index

3. Extract Wikipedia

Share this:

Like this:

One thought on “How to download and extract Wikipedia”

Leave a ReplyCancel reply

Discover more from Cyber Net It