Note: This is specific to English wikipedia. For non-English wikipedia downloads, you should be able to follow roughly the same steps.
Using dumps of Wikipedia is notoriously annoying to do without a guide. Here I’ve condensed down the steps as much as possible, and written a program to do the extraction.
1. Download the Dump
These can be a bit tricky to navigate to. Wikimedia recommends using a mirror such as https://wikimedia.bytemark.co.uk/enwiki/ which caches the dump. Each entry at that link represents the date of the Wikipedia dumps stored there (it may be a few months behind the official source, but is good enough for most uses).
You want to download these two files, “pages-articles-multistream” and “pages-articles-multistream-index”. Note that the date the dump was created is part of the file name.
While it is technically possible to directly extract
pages-articles-multistream.xml.bz2 on its own, this is not recommended as it expands to a single ~100 GB file. This is a major pain to read and iterate through, and is difficult to efficiently parallelize. Instead, we use the index to cut out slices of the compressed file for easier use.
2. Extract the Index
Using your favorite tool for extracting
.bz2 files, extract the index. (NOT the core dump file)
3. Extract Wikipedia
- The location of the
- The location of the
- Where to write the extracted XML files to
$ git clone https://github.com/willbeason/wikipedia.git $ cd wikipedia $ go run cmd/extract-wikipedia/extract-wikipedia.go \ path/to/pages-articles-multistream.xml.bz2 \ path/to/pages-articles-multistream-index.txt \ path/to/output/dir
On my machine this takes about 30 minutes to run. The time it takes will depend mostly on the speed of your machine and whether you’re reading from/writing-to an SSD. By default
extract-wikipedia tries to use all available logical processors, so set `–parallel=1` (or some other low value) if you’re okay with the extraction taking much longer but still want to be able to do other things with your computer.
Note that the result is slightly altered from the compressed file, as the individual parts are not valid XML documents. To make the XML valid, extract-wikipedia takes a different action based on the file.
- The first file (
</mediawiki>to the end of the file to close the dangling
<mediawiki>at the beginning of the file.
- The last file. This is the numerically-last file in the numerically-last folder. This file has a dangling
</mediawiki>tag at the end, so
<mediawiki>at the top of the file.
- All other files. For easier processing,
</mediawiki>so that the file is interpreted as a single XML document rather than a series of documents.