A Scam by any other Name: Inscriptive Agency and Online Safety

My instinctual response to rage at a system is an urge to systematically deconstruct its power.

The Scam

March 17th, 2022

While it may have begun sooner, the “look who just died” scam entered the Internet’s consciousness on March 17th, 2022 with a Reddit post on r/answers. The user reported that they received a message on Facebook Messenger claiming someone they knew died with a link to a site trying to steal their Facebook password. Users of the forum responded initially with advice for them to contact their friend to let them know their account was hacked, or to block the person if they were a stranger.

March 20th, 2022

The post then sat silent for three days before another user reported the same occurrence, having received such a message from a coworker. Unlike the initial poster, they clicked on the link and entered their login information. They soon realized their account was compromised when they saw their Facebook Messenger account had sent the same message to nearly all of their Facebook friends.

March 30th, 2022

Ten days later, another user reported their account was compromised by the same scam. They realized that not only had the scammer sent the message to all of their Facebook friends, but all of their Facebook contacts, including people they had unfriended. This the first occurrence of a user increasing their account security in response to being scammed – they enabled 2-factor authentication and revoked session cookies for active logins to be sure.To this point, there was no clear goal of the scammers. Why did they want control of people’s Facebook accounts?

April 5th, 2022

Another user reported experiencing the scam. But this time, they did not escape with only injuries to their pride. The scammers collected the payment information from this user’s account and stole $2,000 from their debit cards.Thus began a deluge of other Facebook users flocking to the Reddit post, with dozens of other reports and over a hundred total comments until July 2022.

May 9, 2022

On May 9, 2022 the online safety and fact-checking website ThatsNonsense.com reported on the scam. In the article, they point out that the message has several features which internet users should learn to look for that indicate the message is illegitimate. They also give advice for actions to take in the case a user has become a victim of the scam themselves.

(Image from snopes.com)

Days later, cable news began covering the scam, with WGAL – Lancaster/Harrisburg being one of the first to report on it. DuckDuckGo has indexed an MSN news report on the same day, but the article has since been deleted, most likely due to MSN changing the structure of its site. Broken link.

Interlude

Over the following months, dozens of news agencies reported on this scam, the money people lost, and how to spot it in the wild.


Evolution

The scam evolved into at least one other form in the coming months. Rather than reaching out via Facebook Messenger, the scammers took a screenshot of a paused video from the ABC News YouTube channel showing a wrecked car on the side of the road in Georgia, with debris scattered in the grass. They edited the metadata of the site to display a fake view count – implying that the video was highly viewed.

(screenshot by Will Beason)

To ensure the post would be prioritized in people’s feeds in the algorithm, the scammers did several things known to make sure posts appear in people’s feeds.

  • They made sure that the generated link going to the malicious site had a thumbnail. Facebook prioritizes links which have valid thumbnails.
  • They re-shared their own post. Facebook prioritizes “shared” posts over originals, even when a user shares their own post.
  • They tagged exactly nine other people. This is a “sweet spot” where the notification of the tagging still shows up in people’s Facebook notifications, and it doesn’t trigger anti-spam measures. Further, notifications prompt people to view the post in order to get rid of the notification. While users may delete notifications without viewing the post, they have no means of knowing what they’re missing before clicking to view.
  • They made the post public so that the post could potentially appear in anyone’s feed – even people they have not “friended” on Facebook.

Together, this nearly guarantees at least nine other people (those tagged) will see the post. The other measures mean the post itself is likely to be highly-ranked in all of their friends’ feeds and will likely be seen by many dozens of people. For those tagged – the friends who interact with them often are likely to see the post as this is another feature Facebook heavily weights in ranking posts to show on user feeds.

Facebook’s Responsibility

This scam has persisted for over a year despite national news coverage, many hacked accounts, and the simple, predictable nature of these interactions. In the 20 such posts above, all of them linked to the same domain and had an identical thumbnail. All of the posts in a similar period of time use identical wording. In the past few weeks, it has been “Just happened an accident” but the wording has differed over time. It would not be technically difficult to detect these posts: I am able to easily find dozens of such posts by searching for the wording-of-the-day using Facebook’s already-extant search feature. Presumably, Facebook has means of searching posts by the domain linked to and the number of people tagged, which would make identifying these posts even easier.

And yet, this scam goes on.

Reports

On August 11th, in a flurry of data collection (rage?), I conducted a search as described above, immediately finding 25 such scam posts. I went through the report process for each post:

  1. Three dots -> Report post
  2. Please select a problem -> Spam
  3. Submit

This creates an entry in the Facebook Support Inbox tied to my account.

The message assures that Facebook will use this information to identify spam, and more information about how Facebook deals with spam content.

Response

Facebook regularly responds to reports of spam. These do not happen on a regular timeframe – it isn’t clear how these are prioritized. I’ve had reports of this scam sit in my support inbox for months, and others are answered within days.In every case for this scam I have reported where I have gotten a response (~10 so far from August 11th reports), Facebook has responded with the following.

Sometimes the message gives an option to request a “re-review” of the content. Other times, the only options available are to block the compromised account or to delete the message from my support inbox. There isn’t a clear pattern here.

Example where I was able to request a re-review:

In no cases in the several months I have reported these scams has Facebook reported that one of these posts was taken down. I’ve reported about 40 in total.
Note: It is possible that Facebook removes messages from the support inbox – I have not verified that all of my reports still have entries.

Wherefore art thou, scam?

It is clear to users that these are scam posts. On many of these posts I’ve reported, friends of the person warn others not to click the link, and tell the person that their account has been compromised. Regular news coverage reports the impact on those who have lost money from the scam. Coverage on online safety sites have detailed information about the scam: how to spot it and how to recover.

But none of that matters because Facebook decides what is, or is not, a scam on Facebook.

~~ Fin ~~

Stopping here because I’m exhausted. I’m sure there’s neat implications to write about later. I hope you enjoyed my little weekend project.

Don’t count on language transformers citing their work

An example of a language model generating incorrect text and then attempting to cite the false claim. As of this writing, TLS 1.4 does not exist and Wikipedia makes no mention of TLS 1.4.

“Truth” for language transformers

Language transformers do not have a concept of whether text they’ve generated is “true” – they are designed to predict likely continuations of a piece of text. The algorithms behind this are sophisticated, but they aren’t trying to do anything but recreate textual patterns in the corpus they were trained on. What matters to the transformer is “Given the preceding text, what should come next?” To the transformer, it is immaterial whether the generated text is factual information or misinformation.

Consider these two prompts:

  1. “A recent study by the Harvard Medical School found that the new Novavax vaccine _”
  2. “My 2yo was never the same after she got her shot. And then she got sick anyway. Clearly, the vaccine _”

Suppose you saw two comments on the Internet one beginning with each of the prompts. What would you expect to come next, based on what you have read on the Internet? (For this exercise, rely only on your memory)

Even though you probably personally disagree with what one of the continuations says, you would not be wrong in providing contradictory continuations of these two prompts. You, taking the role of the language transformer, haven’t actually made these claims – in this exercise your goal was not to inform whoever entered the prompt; it was to continue the text. In the same way, the language transformer is not evaluating the text it is generating in this way – it isn’t “taking a side” or even, in some sense, “saying” these things.

Not-Really-Citations

Consider possible responses you may have had to the first prompt. You could have said “… was 85% effective in preventing COVID infection 2 weeks after the vaccine was administered” or “… was found to be more effective in people who had never had COVID before.” Even though neither of these is (necessarily) supported by research, they are plausible continuations. Conversely, you are unlikely to have said something like “… caused people to spontaneously grow two additional eyes.” To the language transformer, those first two continuations would be rated at a much higher likelihood than the third and so it would tend to generate responses like those.

One way an application could cite text generated by a Transformer.

Language transformers don’t reference documents at query time and then generate text based on those documents. They are trained long before, and attempt to generate text that is a continuation of whatever they’ve been supplied. After the transformer has generated text, different algorithms analyze the text, and then look for the documents “most relevant” to the related text. Similarly – you could search the Internet for reliable sources that say similar things to how you continued the first prompt, and then cite them. You – and the transformer – aren’t really “citing” where you got this information, you made a guess and then looked for documents that seemed to support your guess.

It may be possible to attempt to determine claims generated text makes, and then re-prompt the transformer if it generated incorrect information, but these are hard problems. Identifying claims a text makes is not straightforward, and determining whether two texts say the same thing is difficult – even for humans. If the transformer generates “the vaccine was very effective” but the research says “the vaccine prevented 85% of cases in adults and 75% in children”, is the generated text “factual”? The citer may reference that research, but a reasonable person may disagree that a vaccine if “very effective” if it only has a 75% effectiveness in children.

Conclusion

Together, these factors mean that applications which use transformer-generated text will be unable to reliably cite sources which back up the text. In the example at the beginning of this post – there is no TLS version 1.4. While NeevaAI cites the TLS Wikipedia Page, as of this writing the text “1.4” does not appear anywhere on that page. Most likely, the transformer generated “is widely used in applications such as email, instant messaging, and voice over IP” because this text appears verbatim in the second sentence of the Wikipedia article. The citer seems to generate one citation per sentence, and so it likely chose Wikipedia as the source for that sentence because it found a high degree of textual similarity between that sentence and the text on Wikipedia. The citer didn’t catch that the generated sentence also makes a claim about TLS 1.4 but not the Wikipedia article.

Rolling Advantage with m dice of n sides each

Matt Parker is wrong on the internet and someone must rise to correct him. Alas, today is my day. I shall double reverse-Derek the one and only Matt Parker.

In a recent video, Matt Parker explored a beautiful, geometric way of understanding a problem known to many RPG players. In many RPGs (such as Dungeons and Dragons), players roll a 20-sided die (also called d20, for short) to determine the outcome of an action. In some cases, the dungeon master (who interprets the rules) may decide to give a player “advantage” on a roll, allowing them two roll two d20s (written as 2d20) instead of one and taking the higher of the two. The question that Matt Parker explores is: what is the expected value of taking advantage – how much better is it than just rolling one d20?

He then extends his answer to a more general form:

And this is where he is wrong.

First, consider the case of rolling md1. That is, rolling some number of 1-sided dice. Matt’s formula incorrectly gives:

Obviously since a d1 can only ever return 1 (and so the above should simplify to 1 for all m), something is amiss.

Here I need to define some mathematical notation for clarity. Define a random variable +mdn as the result of taking advantage when rolling m dice, each with n sides.

We can similarly check the case of rolling md2 since this is easy enough to check by hand. The probability of +md2 evaluating to 1 is the probability of only rolling 1s, and so we get:

This is definitely not in line with Matt’s formula. So, what is the actual formula?

The below proof relies on the fact that the calculation for the expectation of a variable can be partitioned into mutually exclusive cases. That is:

Now, suppose that we had some way of representing E(+mdn) and we want to find E(+mdn+1). That is, we want to increase the number of faces of our dice. As we can split our expectation, calculation, we can see:

The first term is the probability that we have no faces of value n+1 times the expected value of taking advantage with mdn. The second term is the probability of having at least one face of value n+1 times that value, as when taking advantage only the largest value is counted.

And now for md3. Trivially, the probability of obtaining zero 3s when rolling m d3s is .

This pattern is suspicious, so let’s try this general form and see if using the iteration pattern preserves it:

Therefore, the expected value of rolling advantage with mdn is:

Through symmetry, the expected value of rolling with disadvantage (taking the lowest die) is:

Using Degree Centrality to Break Wikipedia Category Cycles

I’m training machine learning models to classify articles on Wikipedia. To do this well, I need a lot of training data. I’d originally intended to use Wikipedia categories to automatically classify articles. Unfortunately, I can’t assume articles in Category:Philosophy or one of its subcategories should all be “Philosophy” as it turns out categories aren’t used consistently enough to use them out-of-the-box.

Category Cycles

A category cycle is when a category is its own ancestor. For example (as of this writing) Category:Quintephones is a subcategory of Category:Electronic musical instruments, which is a subcategory of Category:Quintephones. This specific example isn’t problematic, but if such a cycle exists which includes many very general topics, it can make classifying articles by their category difficult.

In principle, Wikipedia categories should contain no cycles – a parent category should be strictly broader than its children. In practice, this is not the case. These cycles cause problems when trying to use categories as information in machine learning models and for labeling training data.

There are too many relationships (“category A is in category B”) to consider one-by-one. I need some way to narrow down category relationships to the ones which cause the most problems. So, given a strongly-connected directed graph of Wikipedia category relationships I’d like to try to automatically identify the most problematic relationships (the ones which shouldn’t exist), manually examine the results, and then remove the relationships I can tell are bad.

Centrality Measures

The first tool we’ll be using is centrality. Roughly, centrality measures which entities are “most important” in a network of related things. The most important entities don’t necessarily have problematic relationships, but they are the ones most likely to. There are several types of centrality, each with their own strengths and weaknesses. We’ll be using several of them to analyze the various strongly-connected digraphs we found last time.

For this post, We’ll be studying the second-largest, which contains 77 categories mainly dealing with Californian geographical features. We’ll call this the California subgraph.

Degree Centrality

For directed graphs, there are actually two types of degree centrality – indegree and outdegree. For our case, these respectively correspond to:

  • Indegree centrality: The number of subcategories of “A”.
    • For example, “Science Fiction” and “Fantasy” are subcategories of “Fiction by Genre” as they are more specific forms of “Fiction”.
  • Outdegree centrality: The number of pages “A” lists as a parent category.
    • For example, “Fiction” is a parent category of “Fiction by Genre” as it includes more broad topics than just lists of fictional works, organized by genre.
The subcategories and parent categories of “Category:Shasta Cascade”.

Category:Shasta Cascade has three child categories in our graph (that is, only considering child categories in this list), so its indegree centrality is 3:

The page has one parent category which is in our graph, Category:Sacramento Valley, so its outdegree centrality is 1.

Below is a graph plot of the categories. Each dot (or “node”) is a single category from the California subgraph, where arrows point from a child category to one of its parents. Larger dots represent categories with more subcategories.

Unfortunately there’s no great way to label individual nodes in a graph with this many nodes. The two large nodes are Category:San Joaquin Valley (left) and Category:Sacramento Valley (right).

The category with the highest indegree and outdegree centrality is Category:Sacramento Valley, with 9 parent categories in this subgraph and 11 child categories. Second place is Category:San Joaquin Valley. These are immediately, obviously the problem pages in this graph. While parts of these valleys are in various Californian counties, the fact that parts of them extend into the counties are not the “essential—defining—characteristics” of those topics. By this definition, they belong in none of the other categories in this subgraph. What happens if we remove the outgoing edges from these nodes?

After Removing the Problematic Relationships

The same category network as before, but with outgoing edges for “Category:San Joaquin Valley” and “Category:Sacramento Valley” removed. Note that there are no longer any cycles, and the most specific categories are easily identifiable, splayed into an outer ring.

Removing the outgoing edges from “Category:San Joaquin Valley” and “Category:Sacramento Valley” completely eliminates all cycles in the California subgraph. These seventeen edges are the minimum required to do so. The result looks like a much more sensible graph of category parent/child relationships.

Conclusion

Since this is an easy problem to fix, I’m going to go ahead and make these changes to the Sacramento Valley and San Joaquin Valley categories myself, and start a discussion on the talk pages pointing to this post for explanation.

I’m still going to explore other types of centrality measures to get a feel for their strengths and weaknesses – I’ll likely need to use multiple techniques to fix the largest graph, as it has nearly 12,000 categories.

The Largest Connected Digraph of Wikipedia Categories

I’ve been playing with Wikipedia corpora for a few months now, and it’s in a good enough state that I’ve even constructed a basic Bayesian classifier which classifies Wikipedia articles into the top 19 Library of Congress Classifications I care about. The problem I’m having now is one of training data – to be reasonably good, most machine learning models need lots of high-quality labeled data.

I need some way to label a large number (at minimum, thousands) of training examples from a wide variety of Wikipedia pages. Enter: Wikipedia categories.

Note: for all data in this post, I used the September 1st, 2021 dump of Wikipedia pages and articles.

Wikipedia Categories

Categories are listed at the bottom of many Wikipedia articles. They form a sort-of-hierarchy of pages. Per the Wikipedia:Categorization article, these sound like exactly what I’m looking for:

“The central goal of the category system is to provide navigational links to Wikipedia pages in a hierarchy of categories

“Category chains formed by parent–child relationships should never form closed loops; that is, no category should be contained as a subcategory of one of its own subcategories.”

Seen above, Star Wars is a “Science Fiction” story, so it falls in the “Science Fiction” category. The Science Fiction category lists all pages (and categories) which list themselves as being part of the “Science Fiction” category. So good so far.

Further, the Science Fiction category itself is a member of the Fiction by Genre category. This makes sense. Naively, we would think that this could form a reasonable hierarchy – any article in “Fiction by Genre” or one of its subcategories should fall under the Literature classification.

This is where the strangeness of crowdsourced data comes in. Given any category, we could keep following its parents until we reach one which is manually labeled. This is one potential path, starting from “Science Fiction” and successively choosing parent categories:

  • Science Fiction
  • Fiction by Genre
  • Fiction
  • The arts
  • Culture
  • Human activities
  • Humans
  • Human biology
  • Branches of biology
  • Subfields by academic discipline
  • Categories by topic
  • Categories by parameter
  • Wikipedia categories
  • Wikipedia categorization
  • Wikipedia content administration
  • Wikipedia administration
  • Wikipedia
  • Online encyclopedias
  • Free content
  • Copyright law
  • Law by issue
  • Categories by issue
  • Categories by parameter

There are two immediate problems. First – we’ve entered a cycle! The Categories by parameter category is it’s own parent/child. Second – we’ve clearly gone way off topic. Star Wars should not be classified as a “Law” or “Biology” article, even though it falls into “Law” and “Biology” categories.

This is unfortunate, contradicts the intent of Wikipedia categories, and I’ll have to work around it. Before doing so, I’d like to study it and see how extensive this problem really is. Enter: Connected digraphs.

Connected Digraphs

A directed graph, commonly “digraph” represents a set of elements which are connected with one-way relationships. Your biological family tree forms a digraph – you are the offspring of your parents; they are not your offspring (assuming no time-travel incest shenanigans). Digraphs aren’t only limited to one-way connections; relationships may go two ways. A common example of digraphs is that of a city’s downtown. Some intersections are connected with two-way roads, and others by one-way roads. For example:

In this city, we say that “Intersection A is linked to Intersection B”, but also “B is not linked to A” since the road from A to B is one-way. In a well-designed downtown, it should be possible to eventually get from any intersection to any other intersection – it should be “strongly connected”. In the downtown above, unfortunately, anyone who gets to F can never escape. Similarly, once someone gets to C they can’t get back to A, B, D, or E – they may only proceed to F and never return. We would say that the digraph representing the intersections of the downtown above is “weakly connected” as there is not a path from each intersection to every other intersection (without breaking the law).

If we were to turn the road connecting E to F into a two-way road, this would resolve the issue – everyone who got stuck at F would be free to get to the rest of this city. Such a digraph would be “strongly connected”.

To determine the scope of the problem with the Wikipedia category hierarchy, I want to find all such strongly connected graphs, as they will cause problems for any labeling scheme based on Wikipedia Categories.

<insert coding montage>

The Largest Strongly-Connected Wikipedia Category Digraph

I found it. This massive strongly-connected graph contains 11,895 of all 2,102,570 categories. That is, 0.57% of all categories belong to this single strongly-connected digraph. This single digraph contains as disparate topics as:

The problem is that since these topics are so broad, once a category (or article) gets classified into one of these, it immediately inherits all of them. Using any category in this group would add lots of noise to my model.

The second-largest is 77 geography-related categories in California.

The third-largest is 70 categories related to wars in the Middle East.

The fourth-largest is 53 geography-related categories in the southern US and central America.

There are several hundred smaller strongly-connected graphs, for a total of 356 distinct digraphs. Most of these are obvious mistakes, such as:

Future Work

There’s three things I’m going to do with this information.

  1. Many of the smaller (but still sizeable) categories all fit into the same LCC classification. I can easily use these as training data.
  2. Rather than assume that Wikipedia articles are transitively of the same topic as their parent categories, I should reduce how much I weight that information based on how many steps it takes to get to a classified category.
  3. I can use centrality measures to find which categories are most problematic, and both fix those in my own data set and possibly recommend these changes to Wikipedia.

What are the longest n-grams on Wikipedia?

TL;DR

I found the longest n-gram on Wikipedia which appears more than 1,000 times. It is 103 words long. It is:

The township is governed by a three-member board of trustees, who are elected in November of odd-numbered years to a four-year term beginning on the following January 1. Two are elected in the year after the presidential election and one is elected in the year before it. There is also an elected township fiscal officer, who serves a four-year term beginning on April 1 of the year after the election, which is held in November of the year before the presidential election. Vacancies in the fiscal officership or on the board of trustees are filled by the remaining trustees.

It appears 1,275 times (exactly, word-for-word), only in articles about small towns in Ohio. It’s part of the Ohio Revised Code which as far as I can tell applies to all Ohioan townships.

What is an n-gram?

An n-gram is a sequence of words which appears often in a text, or in a corpus of texts. Finding these is helpful for document classification as repeated sequences of words often have a meaning different than their individual words would suggest. An “Academy Award” isn’t just an award from an academy, but a specific award from the Academy of Motion Picture Arts and Sciences. Letting our models know during training that “Academy Award” may mean something different than “academy” and “award” can make it easier to identify fine-grained topics like this.

How I found this

After cleaning my Wikipedia corpus, I took several additional steps to normalize it:

  • I made everything lowercase
  • Sequences of numbers became “_num_
  • A “_num_” followed by a percent symbol is replaced with “_percent_
  • Certain recognized date formats become “_date_“, for example “10 January 1995

That is, for the purposes of my analysis I consider numbers to be “semantically identical”. The phrase “he was 29 years old” doesn’t have a significantly different meaning from “he was 81 years old”. For first-pass machine learning models the noise and variance introduced by treating every number as a completely different “word” is rarely useful.

Beyond the above, I define a “word” to be any sequence of characters matching the regex “[\w']+“, and I define words to be “distinct” if, and only if their lowercase forms exactly match.

To find the n-grams, I employed an iterative algorithm which takes as input all currently-known n-grams. On first iteration it only finds all 2-grams. The second iteration, it finds all 3-4-grams. Each iteration, the maximum n-gram length it finds doubles. You can find the successive commands I ran to do this in my GitHub repository.

Other long n-grams

The longest n-gram appearing more than 10,000 times is:

There were _num_ households out of which _percent_ had children under the age of _num_ living with them, _percent_ were married couples living together, _percent_ had a female householder with no husband present, and _percent_ were non families. _percent_ of all households were made up of individuals and _percent_ had someone living alone who was _num_ years of age or older. The average household size was _num_ and the average family size was _num_.

The longest (and most-frequent) n-gram appearing more than 100,000 times is:

from _num_ to _num_

which is recognizable as a range of numbers expressed in natural language.

I mean, technically it’s “_num_ _num_ _num_ _num_” because sequences of numbers appear so often, but I don’t like it since it doesn’t have any “real words”.

The longest n-gram appearing more than 100,000 times which only contains alphabetic words is “a member of the“. Most n-grams of this length serve common structural purposes such as this.

In general, the most common n-grams of length longer than 5 are from templates, where article content has been imported from another source, or where it is very clear that the same author uses identical language in a long list of articles. For example:

4,966: ... is a moth in the family Crambidae it was described by ...
3,935: In the CDP the population was spread out, with _percent_ under the age of _num_ _percent_ ...
4,907: ... together _percent_ had a female householder with no ...
4,003: ... is a protein that in humans is encoded by the ...

Charts

I made graphs of the longest n-gram of a given frequency. I’ve dropped lengths for n-grams which appeared less often than longer n-grams. This first graph includes the special handling I use for numbers, dates, and percentages.

This second graph is for the “more traditional” n-grams which only include alphabetic words.

Conclusion

This is going to let me get rid of a lot of noise when I begin training models on the Wikipedia corpus. In this case, looking for n-grams allowed me to find a lot of duplicated, stilted, not-really-human language which would have messed up my model’s ability to understand how words are used.

Rather than consider all uses of words like “elected” as the same, I can consider it a special case when it appears in the 103-word monstrosity I mentioned at the beginning. Models are sensitive to frequency, and lots of cases of a word being used in an identical context incorrectly teaches the model that this is common or expected. In reality, a human composed the passage once, and it was likely automatically duplicated (or manually copy+pasted) over a thousand times. There isn’t much to be learned from such usage. Instead, for cases like this the 103-word n-gram provides strong evidence that an article is about a very specific type of thing – a township in Iowa, which is exactly what I want my model to learn from this.

On a smaller scale, being able to treat n-grams like “the United States” as entities distinct from words like “states” and “united” means that the model won’t be tricked into confusing “United Airlines” or “States Records” with “the United States”.

This is the sort of analysis it is almost always worth doing on a corpus – while going through this work I iterated dozens of times on the code I used to clean the corpus, as each time I found cases where there were bits of formatting or common human mistakes that caused this analysis to give strange results. Now that I can get results like this, I can be confident that overall my data is clean enough for more complex work.

Cleaning the Wikipedia Corpus: Articles and Text

In Natural Language Processing, it is rare that a corpus only includes information we care about. Often they include lots of low-quality data that we can’t (or shouldn’t) learn much from. The procedure is specific to the analysis being performed, so rather than simply repeat what I’ve done (or use my code, available on GitHub), I recommend reading this to understand the reasoning for including or excluding Wikipedia articles. Focus on the methods I employ, so you can modify them should you determine you have different needs.

My Goals

Before I can continue, I need to set my goals with analyzing the Wikipedia corpus.

I want to build a classifier which analyzes Wikipedia “content” articles and determines the Library of Congress classification (LCC) of the article. For example, Cayetano Heredia University should be classified as LE, while Peter B. Bennett would be CT. I may eventually try to get numerical classifications (e.g. LE-68), but not until after I’ve got good general subject classifiers.

Wikipedia “content” articles, roughly, identify a topic and summarize it. This excludes things like Wikipedia user pages and categories, but includes “List of” articles. The linked article explains in exhaustive detail what is or is not an article and I’ll generally be following that.

I’m specifically interested in using “prose” text for analysis. That is, text similar to this blog post rather than metadata in XML tags or structured data, like tables and charts. Most of the text in Wikipedia is prose, and individual types of charts need their own parsers, so I’m not interested in that until I’ve made a lot of progress on the more general problem.

This is a “fun educational project” for me. I’ll be doing it as long as I’m having fun and learning new things. If I stop doing work for it, I may come back to it later.

And as with any open-ended project like this, I expect my goals will become more refined as I research into related topics.

Metadata to Exclude

Each Wikipedia article comes with a chunk of metadata:

<mediawiki>
  <page>
    <title>Abstract (law)</title>
    <ns>0</ns>
    <id>766</id>
    <revision>
      <id>982995708</id>
      <parentid>940438615</parentid>
      <timestamp>2020-10-11T16:47:04Z</timestamp>
      <contributor>
        <ip>REDACTED</ip>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="3382" xml:space="preserve">

For my analyses I care about the title and content of articles, not who wrote them or when. I also care about the unique identifier, which can be more convenient for representing a map of all articles than the titles themselves.

Articles to Exclude

The August 2021 dump I’m using has 21,409,406 pages.

There are ten types of articles I’m automatically universally excluding. They are the “Namespaced” articles which either discuss the nature of Wikipedia (Such as “What is an article?” above) or are really helper articles (Each file uploaded to Wikipedia gets a page).

Article NamespaceNumber of Articles
Category2,100,543
Wikipedia1,170,425
File915,410
Template592,437
Portal93,146
Draft59,584
Module12,598
MediaWiki2,212
TimedText1,352
Help957
Total4,948,664

Next, I’ll be ignoring any redirect articles (example). These have no content and are just an alias for an actual article. There are 10,111,832 redirect pages.

If I stopped here, this would leave me with 6,348,910 articles, the number of English Wikipedia articles at the time of the August 2021 dump.

Lastly, I’ll be discarding disambiguation pages (example). They don’t really fit into a LCC classification and have the potential to add a lot of noise to any models I’m training even though they are traditionally considered “content” articles. This excludes a further 55,798 articles.

These are the classes of articles I already know I want to exclude. In the future, I may identify other types of articles, but these will require more thorough analyses.

Text to Exclude

I consider that text is a link to be metadata and not part of prose. So

Cultural anthropology has a rich [[methodology]], including [[participant observation]]

becomes

Cultural anthropology has a rich methodology, including participant observation

Similarly, I’m ignoring the article text linked to, so [[Diffusion (anthropology)|diffused]] becomes diffused. This is because article titles often include more information than the word used in natural prose. In the future, it may be possible to use this additional information to create a parse which can disambiguate terms which have multiple uses (such as “diffused” in this example).

Next, XML tags are their own special beast. Many formatting tags, such as <small> and <i> can simply be ignored and stripped from the article – keeping them may eventually help with entity recognition but that’s a later problem to solve. Tags like <p> and <br> indicate that the article has a line break that wasn’t literally entered as a newline, so I need to treat those as new paragraphs. Several XML tags – such as <math> and <hiero> communicate information in prose even though their content isn’t easily parseable (I’ll cover exactly what I’m doing for those in a later post). Some tags indicate their contents simply aren’t prose, such as <gallery> or <timeline>.

Wikipedia has a lot of special table types for structured data. On a cursory analysis these are annoying to identify completely, so I’m automatically excluding text between curly braces to be safe. In code these look similar to:

{| class="wikitable"
! Name
! Dates
! In opposition to:
|-
| [[Arnulf, Duke of Bavaria|Arnulf the Bad]] || 919–921 || [[Henry the Fowler]]

While they occasionally have prose text (for example, List of Episode articles), it is safe to ignore them for now as they do not make up the bulk of prose text on Wikipedia.

I’m excluding the Bibliography sections of all articles – while these sometimes have natural language, this is rare. Eventually it may be possible to use information like the cited journals to discern the topic of an article.

I’m currently debating whether to exclude parenthetical text (such as this). Simply ignoring the parentheses often breaks grammatical structure, and can sometime form their own prose-like thoughts:

Others, such as Claude Lévi-Strauss (who was influenced both by American cultural anthropology and by French Durkheimian sociology), have argued that apparently similar patterns of development reflect fundamental similarities in the structure of human thought (see structuralism).

Update: I will be discarding parenthetical statements. They’re noisy and if I don’t have enough data I can care about that problem later.

Using my code

If you’ve followed the instructions in my previous post, you can get your own cleaned corpus of Wikipedia by running:

go run cmd/clean-wikipedia/clean-wikipedia.go \
  path/to/input/dir \
  path/to/output/dir

The result should be about 20GB, about 25% of the original XML extraction. By focusing the corpus down to only the information we care about, we greatly speed up our processing speed since machines need to read less information from disk.

Sample article before and after. Note that the cleaning process is not perfect, but I’ll be improving it as I find mistakes (or at least, ones which are worth my time to fix).

Further, since most of the text is reasonably-clean prose, you can immediately begin training models on it. There are some hiccups, but I’ll get to that in my next post, which will be on n-gram analyses.

How to download and extract Wikipedia

Note: This is specific to English wikipedia. For non-English wikipedia downloads, you should be able to follow roughly the same steps.

Using dumps of Wikipedia is notoriously annoying to do without a guide. Here I’ve condensed down the steps as much as possible, and written a program to do the extraction.

1. Download the Dump

These can be a bit tricky to navigate to. Wikimedia recommends using a mirror such as https://wikimedia.bytemark.co.uk/enwiki/ which caches the dump. Each entry at that link represents the date of the Wikipedia dumps stored there (it may be a few months behind the official source, but is good enough for most uses).

You want to download these two files, “pages-articles-multistream” and “pages-articles-multistream-index”. Note that the date the dump was created is part of the file name.

While it is technically possible to directly extract pages-articles-multistream.xml.bz2 on its own, this is not recommended as it expands to a single ~100 GB file. This is a major pain to read and iterate through, and is difficult to efficiently parallelize. Instead, we use the index to cut out slices of the compressed file for easier use.

2. Extract the Index

Using your favorite tool for extracting .bz2 files, extract the index. (NOT the core dump file)

3. Extract Wikipedia

Install the Go programming language. Clone my repository, then point the extract-wikipedia program to:

  1. The location of the -multistream.xml.bz2 file
  2. The location of the -multistream-index.txt file
  3. Where to write the extracted XML files to

For example:

$ git clone https://github.com/willbeason/wikipedia.git
$ cd wikipedia
$ go run cmd/extract-wikipedia/extract-wikipedia.go \
  path/to/pages-articles-multistream.xml.bz2 \
  path/to/pages-articles-multistream-index.txt \
  path/to/output/dir

On my machine this takes about 30 minutes to run. The time it takes will depend mostly on the speed of your machine and whether you’re reading from/writing-to an SSD. By default extract-wikipedia tries to use all available logical processors, so set `–parallel=1` (or some other low value) if you’re okay with the extraction taking much longer but still want to be able to do other things with your computer.

Note that the result is slightly altered from the compressed file, as the individual parts are not valid XML documents. To make the XML valid, extract-wikipedia takes a different action based on the file.

  1. The first file (000000.txt). Example. extract-wikipedia appends </mediawiki> to the end of the file to close the dangling <mediawiki> at the beginning of the file.
  2. The last file. This is the numerically-last file in the numerically-last folder. This file has a dangling </mediawiki> tag at the end, so extract-wikipedia adds <mediawiki> at the top of the file.
  3. All other files. For easier processing, extract-wikipedia prepends <mediawiki> and appends </mediawiki> so that the file is interpreted as a single XML document rather than a series of documents.

Improve your Runge-Kutta-Nyström algorithms with this one weird trick

In my last post, I discussed Runge-Kutta-Nyström (RKN) methods, and a generic strategy for turning any Runge-Kutta method into an RKN method. The focus was more on the simplest way to do so, rather than the best way.

Today I found that there’s a way of improving cumulative error from O(h^3) to O(h^4), with very little additional compute required.

A Poincaré map of a Duffing Oscillator simulated with the algorithm in this post.

Let’s examine a simplification I glossed over – when calculating k_2, we know y’_n and y’_1. We can get an estimate for y_1 by integrating the polynomial which meets the conditions f(0)=y’_n and f(t_f)=y’_1. This is just the trapezoidal rule.

This takes two obvious pieces of information – we know the slope of y at the beginning and end of each interval for which we’re estimating the change in y. However, it ignores that we also know the acceleration of y. That is, we have a third initial condition:

Thus, we need a quadratic to represent the general function which meets all of these criteria:

Integrating gives:

So all we need to do is replace our use of the trapezoidal rule with this quadratic rule. Recall that by definition, k_1 = y”_0, so it appears in every y_i estimate.

We repeat this pattern to calculate k_3:

and k_4:

Then, we finish our estimation of y_n+1 and y’_n+1 as before:

The impact on the accuracy of RKN algorithms is staggering. Using the same differential equation as last post, I’ve re-run the calculation of cumulative error. In the below table, “Before” denotes the algorithm from the previous post, with the trapezoidal rule, and “After” is with the quadratic rule.

StepsBefore Error(y)After Error(y)Before Error(y’)After Error(y’)
11.484 E-33.438 E-31.593 E-24.681 E-4
107.797 E-63.529 E-72.282 E-57.484 E-8
1008.967 E-93.486 E-112.366 E-87.914 E-12
10009.086 E-122.664 E-152.375 E-115.773 E-15
Note that with errors on the order of 10^-15, we’re approaching the limit of the ability for 64-bit floating points to detect differences in numbers.

We’ve improved the cumulative error of the algorithm from O(h^3) to O(h^4) – an entire factor of h! The additional performance cost is tiny – at most an increase of 10% for trivial second-order differential equations – so this modification to the original algorithm should always be worth it.

You can recreate the results of this blog post by checking out v0.1.3 of my diffeq-go repository and running:

diffeq-go$ go run cmd/validate/validate.go

Introduction to Runge-Kutta-Nyström Methods

Alternative Title: How to turn any Runge-Kutta method into a Runge-Kutta-Nyström algorithm.

Update: If you’re going to use this algorithm, I highly recommend making the modifications in this post to reduce cumulative error by another factor of h.

Differential equations describe everything from how populations grow with time to how waves flow through the ocean. While a few have nice analytic expressions, many do not – think the three-body problem or the Navier-Stokes equations. This isn’t a matter of the algebra being difficult – for many cases it can be shown that no such “nice” solution exists. In these cases we turn to computers and numerical methods.

Runge-Kutta methods are a family of numerical methods for solving first-order differential equations:

Poincaré map of a Duffing oscillator. My work on this simulation inspired this post.

They offer startling performance and quality gains over the Euler Method. A fourth-order Runge-Kutta method can easily produce solutions many orders of magnitude more precise, all while requiring a fraction of a percent of the computing power the Euler Method demands.

However, Runge-Kutta methods have several limitations for studying complex systems:

  1. They are limited to first-order differential equations.
  2. They produce solutions for differential equations of a single variable.

I’m going to tackle the first of these two problems with a different family of solvers – Runge-Kutta-Nyström (RKN) methods.

Runge-Kutta-Nyström methods have a single sentence on Wikipedia describing them – a mere 20 word side-note in an article of over 3,000 words. There are several thousand articles on RKN methods on Google Scholar, many of them specialized and with little explanation of the basics – such as how to implement them or use them. While specialized RKN methods are very useful in the use cases they were developed for, they are not illustrative.

Fortunately, we don’t need to perform (as much of) the work required to develop RKN methods – we can take any existing Runge-Kutta method and adapt it to be an RKN method.

Let’s begin with a well-known Runge-Kutta method – RK4. However, we’ll write it as if we were solving a differential equation for y” in terms of y’. We want to produce an algorithm which solves this equation:

The modification for the expression for k_1 is straightforward – expand it to include the initial value of y:

Now, we define our estimate for y’ which we’ll use to compute k_2, y’_1 and then use the trapezoidal rule to extrapolate an estimate for y for estimating k_2.

Similarly for k_3:

And finishing the pattern, we get

And then plug in as we would for RK4 to produce our results:

Importantly, note that we use the RK4-weighting on estimating both y’ and y – the trapezoidal rule was only for calculating our k-values.

That’s it. This RKN4 algorithm produces results with cumulative O(h^3) error instead of the normal O(h^4) we expect for RK4, but this is unsurprising as we’re integrating slope, which has step-wise O(h^4) error.

The following table records the decrease in cumulative error in while evaluating y”(t) = 0.5 y'(t) + 0.5 y(t) with initial conditions y'(0) = y(0) = 1 and t_f = 1.0.

StepsError(y)Error(y’)
11.484 E-31.593 E-2
107.797 E-62.282 E-5
1008.967 E-92.366 E-8
10009.086 E-122.375 E-11

Performance-wise, for trivial differential equations RKN4 requires about 50% more compute time than a similar first-order equation evaluated with RK4. As RKN4 requires the same number of function evaluations as RK4, this difference is smaller for more involved equations.

I’ve published the code I developed for this blog post on GitHub. My results can be recreated by checking out tag v0.1.2 and running “go run cmd/validate/validate.go”.