What are the longest n-grams on Wikipedia?

TL;DR

I found the longest n-gram on Wikipedia which appears more than 1,000 times. It is 103 words long. It is:

The township is governed by a three-member board of trustees, who are elected in November of odd-numbered years to a four-year term beginning on the following January 1. Two are elected in the year after the presidential election and one is elected in the year before it. There is also an elected township fiscal officer, who serves a four-year term beginning on April 1 of the year after the election, which is held in November of the year before the presidential election. Vacancies in the fiscal officership or on the board of trustees are filled by the remaining trustees.

It appears 1,275 times (exactly, word-for-word), only in articles about small towns in Ohio. It’s part of the Ohio Revised Code which as far as I can tell applies to all Ohioan townships.

What is an n-gram?

An n-gram is a sequence of words which appears often in a text, or in a corpus of texts. Finding these is helpful for document classification as repeated sequences of words often have a meaning different than their individual words would suggest. An “Academy Award” isn’t just an award from an academy, but a specific award from the Academy of Motion Picture Arts and Sciences. Letting our models know during training that “Academy Award” may mean something different than “academy” and “award” can make it easier to identify fine-grained topics like this.

How I found this

After cleaning my Wikipedia corpus, I took several additional steps to normalize it:

  • I made everything lowercase
  • Sequences of numbers became “_num_
  • A “_num_” followed by a percent symbol is replaced with “_percent_
  • Certain recognized date formats become “_date_“, for example “10 January 1995

That is, for the purposes of my analysis I consider numbers to be “semantically identical”. The phrase “he was 29 years old” doesn’t have a significantly different meaning from “he was 81 years old”. For first-pass machine learning models the noise and variance introduced by treating every number as a completely different “word” is rarely useful.

Beyond the above, I define a “word” to be any sequence of characters matching the regex “[\w']+“, and I define words to be “distinct” if, and only if their lowercase forms exactly match.

To find the n-grams, I employed an iterative algorithm which takes as input all currently-known n-grams. On first iteration it only finds all 2-grams. The second iteration, it finds all 3-4-grams. Each iteration, the maximum n-gram length it finds doubles. You can find the successive commands I ran to do this in my GitHub repository.

Other long n-grams

The longest n-gram appearing more than 10,000 times is:

There were _num_ households out of which _percent_ had children under the age of _num_ living with them, _percent_ were married couples living together, _percent_ had a female householder with no husband present, and _percent_ were non families. _percent_ of all households were made up of individuals and _percent_ had someone living alone who was _num_ years of age or older. The average household size was _num_ and the average family size was _num_.

The longest (and most-frequent) n-gram appearing more than 100,000 times is:

from _num_ to _num_

which is recognizable as a range of numbers expressed in natural language.

I mean, technically it’s “_num_ _num_ _num_ _num_” because sequences of numbers appear so often, but I don’t like it since it doesn’t have any “real words”.

The longest n-gram appearing more than 100,000 times which only contains alphabetic words is “a member of the“. Most n-grams of this length serve common structural purposes such as this.

In general, the most common n-grams of length longer than 5 are from templates, where article content has been imported from another source, or where it is very clear that the same author uses identical language in a long list of articles. For example:

4,966: ... is a moth in the family Crambidae it was described by ...
3,935: In the CDP the population was spread out, with _percent_ under the age of _num_ _percent_ ...
4,907: ... together _percent_ had a female householder with no ...
4,003: ... is a protein that in humans is encoded by the ...

Charts

I made graphs of the longest n-gram of a given frequency. I’ve dropped lengths for n-grams which appeared less often than longer n-grams. This first graph includes the special handling I use for numbers, dates, and percentages.

This second graph is for the “more traditional” n-grams which only include alphabetic words.

Conclusion

This is going to let me get rid of a lot of noise when I begin training models on the Wikipedia corpus. In this case, looking for n-grams allowed me to find a lot of duplicated, stilted, not-really-human language which would have messed up my model’s ability to understand how words are used.

Rather than consider all uses of words like “elected” as the same, I can consider it a special case when it appears in the 103-word monstrosity I mentioned at the beginning. Models are sensitive to frequency, and lots of cases of a word being used in an identical context incorrectly teaches the model that this is common or expected. In reality, a human composed the passage once, and it was likely automatically duplicated (or manually copy+pasted) over a thousand times. There isn’t much to be learned from such usage. Instead, for cases like this the 103-word n-gram provides strong evidence that an article is about a very specific type of thing – a township in Iowa, which is exactly what I want my model to learn from this.

On a smaller scale, being able to treat n-grams like “the United States” as entities distinct from words like “states” and “united” means that the model won’t be tricked into confusing “United Airlines” or “States Records” with “the United States”.

This is the sort of analysis it is almost always worth doing on a corpus – while going through this work I iterated dozens of times on the code I used to clean the corpus, as each time I found cases where there were bits of formatting or common human mistakes that caused this analysis to give strange results. Now that I can get results like this, I can be confident that overall my data is clean enough for more complex work.

Cleaning the Wikipedia Corpus: Articles and Text

In Natural Language Processing, it is rare that a corpus only includes information we care about. Often they include lots of low-quality data that we can’t (or shouldn’t) learn much from. The procedure is specific to the analysis being performed, so rather than simply repeat what I’ve done (or use my code, available on GitHub), I recommend reading this to understand the reasoning for including or excluding Wikipedia articles. Focus on the methods I employ, so you can modify them should you determine you have different needs.

My Goals

Before I can continue, I need to set my goals with analyzing the Wikipedia corpus.

I want to build a classifier which analyzes Wikipedia “content” articles and determines the Library of Congress classification (LCC) of the article. For example, Cayetano Heredia University should be classified as LE, while Peter B. Bennett would be CT. I may eventually try to get numerical classifications (e.g. LE-68), but not until after I’ve got good general subject classifiers.

Wikipedia “content” articles, roughly, identify a topic and summarize it. This excludes things like Wikipedia user pages and categories, but includes “List of” articles. The linked article explains in exhaustive detail what is or is not an article and I’ll generally be following that.

I’m specifically interested in using “prose” text for analysis. That is, text similar to this blog post rather than metadata in XML tags or structured data, like tables and charts. Most of the text in Wikipedia is prose, and individual types of charts need their own parsers, so I’m not interested in that until I’ve made a lot of progress on the more general problem.

This is a “fun educational project” for me. I’ll be doing it as long as I’m having fun and learning new things. If I stop doing work for it, I may come back to it later.

And as with any open-ended project like this, I expect my goals will become more refined as I research into related topics.

Metadata to Exclude

Each Wikipedia article comes with a chunk of metadata:

<mediawiki>
  <page>
    <title>Abstract (law)</title>
    <ns>0</ns>
    <id>766</id>
    <revision>
      <id>982995708</id>
      <parentid>940438615</parentid>
      <timestamp>2020-10-11T16:47:04Z</timestamp>
      <contributor>
        <ip>REDACTED</ip>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="3382" xml:space="preserve">

For my analyses I care about the title and content of articles, not who wrote them or when. I also care about the unique identifier, which can be more convenient for representing a map of all articles than the titles themselves.

Articles to Exclude

The August 2021 dump I’m using has 21,409,406 pages.

There are ten types of articles I’m automatically universally excluding. They are the “Namespaced” articles which either discuss the nature of Wikipedia (Such as “What is an article?” above) or are really helper articles (Each file uploaded to Wikipedia gets a page).

Article NamespaceNumber of Articles
Category2,100,543
Wikipedia1,170,425
File915,410
Template592,437
Portal93,146
Draft59,584
Module12,598
MediaWiki2,212
TimedText1,352
Help957
Total4,948,664

Next, I’ll be ignoring any redirect articles (example). These have no content and are just an alias for an actual article. There are 10,111,832 redirect pages.

If I stopped here, this would leave me with 6,348,910 articles, the number of English Wikipedia articles at the time of the August 2021 dump.

Lastly, I’ll be discarding disambiguation pages (example). They don’t really fit into a LCC classification and have the potential to add a lot of noise to any models I’m training even though they are traditionally considered “content” articles. This excludes a further 55,798 articles.

These are the classes of articles I already know I want to exclude. In the future, I may identify other types of articles, but these will require more thorough analyses.

Text to Exclude

I consider that text is a link to be metadata and not part of prose. So

Cultural anthropology has a rich [[methodology]], including [[participant observation]]

becomes

Cultural anthropology has a rich methodology, including participant observation

Similarly, I’m ignoring the article text linked to, so [[Diffusion (anthropology)|diffused]] becomes diffused. This is because article titles often include more information than the word used in natural prose. In the future, it may be possible to use this additional information to create a parse which can disambiguate terms which have multiple uses (such as “diffused” in this example).

Next, XML tags are their own special beast. Many formatting tags, such as <small> and <i> can simply be ignored and stripped from the article – keeping them may eventually help with entity recognition but that’s a later problem to solve. Tags like <p> and <br> indicate that the article has a line break that wasn’t literally entered as a newline, so I need to treat those as new paragraphs. Several XML tags – such as <math> and <hiero> communicate information in prose even though their content isn’t easily parseable (I’ll cover exactly what I’m doing for those in a later post). Some tags indicate their contents simply aren’t prose, such as <gallery> or <timeline>.

Wikipedia has a lot of special table types for structured data. On a cursory analysis these are annoying to identify completely, so I’m automatically excluding text between curly braces to be safe. In code these look similar to:

{| class="wikitable"
! Name
! Dates
! In opposition to:
|-
| [[Arnulf, Duke of Bavaria|Arnulf the Bad]] || 919–921 || [[Henry the Fowler]]

While they occasionally have prose text (for example, List of Episode articles), it is safe to ignore them for now as they do not make up the bulk of prose text on Wikipedia.

I’m excluding the Bibliography sections of all articles – while these sometimes have natural language, this is rare. Eventually it may be possible to use information like the cited journals to discern the topic of an article.

I’m currently debating whether to exclude parenthetical text (such as this). Simply ignoring the parentheses often breaks grammatical structure, and can sometime form their own prose-like thoughts:

Others, such as Claude Lévi-Strauss (who was influenced both by American cultural anthropology and by French Durkheimian sociology), have argued that apparently similar patterns of development reflect fundamental similarities in the structure of human thought (see structuralism).

Update: I will be discarding parenthetical statements. They’re noisy and if I don’t have enough data I can care about that problem later.

Using my code

If you’ve followed the instructions in my previous post, you can get your own cleaned corpus of Wikipedia by running:

go run cmd/clean-wikipedia/clean-wikipedia.go \
  path/to/input/dir \
  path/to/output/dir

The result should be about 20GB, about 25% of the original XML extraction. By focusing the corpus down to only the information we care about, we greatly speed up our processing speed since machines need to read less information from disk.

Sample article before and after. Note that the cleaning process is not perfect, but I’ll be improving it as I find mistakes (or at least, ones which are worth my time to fix).

Further, since most of the text is reasonably-clean prose, you can immediately begin training models on it. There are some hiccups, but I’ll get to that in my next post, which will be on n-gram analyses.

How to download and extract Wikipedia

Note: This is specific to English wikipedia. For non-English wikipedia downloads, you should be able to follow roughly the same steps.

Using dumps of Wikipedia is notoriously annoying to do without a guide. Here I’ve condensed down the steps as much as possible, and written a program to do the extraction.

1. Download the Dump

These can be a bit tricky to navigate to. Wikimedia recommends using a mirror such as https://wikimedia.bytemark.co.uk/enwiki/ which caches the dump. Each entry at that link represents the date of the Wikipedia dumps stored there (it may be a few months behind the official source, but is good enough for most uses).

You want to download these two files, “pages-articles-multistream” and “pages-articles-multistream-index”. Note that the date the dump was created is part of the file name.

While it is technically possible to directly extract pages-articles-multistream.xml.bz2 on its own, this is not recommended as it expands to a single ~100 GB file. This is a major pain to read and iterate through, and is difficult to efficiently parallelize. Instead, we use the index to cut out slices of the compressed file for easier use.

2. Extract the Index

Using your favorite tool for extracting .bz2 files, extract the index. (NOT the core dump file)

3. Extract Wikipedia

Install the Go programming language. Clone my repository, then point the extract-wikipedia program to:

  1. The location of the -multistream.xml.bz2 file
  2. The location of the -multistream-index.txt file
  3. Where to write the extracted XML files to

For example:

$ git clone https://github.com/willbeason/wikipedia.git
$ cd wikipedia
$ go run cmd/extract-wikipedia/extract-wikipedia.go \
  path/to/pages-articles-multistream.xml.bz2 \
  path/to/pages-articles-multistream-index.txt \
  path/to/output/dir

On my machine this takes about 30 minutes to run. The time it takes will depend mostly on the speed of your machine and whether you’re reading from/writing-to an SSD. By default extract-wikipedia tries to use all available logical processors, so set `–parallel=1` (or some other low value) if you’re okay with the extraction taking much longer but still want to be able to do other things with your computer.

Note that the result is slightly altered from the compressed file, as the individual parts are not valid XML documents. To make the XML valid, extract-wikipedia takes a different action based on the file.

  1. The first file (000000.txt). Example. extract-wikipedia appends </mediawiki> to the end of the file to close the dangling <mediawiki> at the beginning of the file.
  2. The last file. This is the numerically-last file in the numerically-last folder. This file has a dangling </mediawiki> tag at the end, so extract-wikipedia adds <mediawiki> at the top of the file.
  3. All other files. For easier processing, extract-wikipedia prepends <mediawiki> and appends </mediawiki> so that the file is interpreted as a single XML document rather than a series of documents.

Improve your Runge-Kutta-Nyström algorithms with this one weird trick

In my last post, I discussed Runge-Kutta-Nyström (RKN) methods, and a generic strategy for turning any Runge-Kutta method into an RKN method. The focus was more on the simplest way to do so, rather than the best way.

Today I found that there’s a way of improving cumulative error from O(h^3) to O(h^4), with very little additional compute required.

A Poincaré map of a Duffing Oscillator simulated with the algorithm in this post.

Let’s examine a simplification I glossed over – when calculating k_2, we know y’_n and y’_1. We can get an estimate for y_1 by integrating the polynomial which meets the conditions f(0)=y’_n and f(t_f)=y’_1. This is just the trapezoidal rule.

This takes two obvious pieces of information – we know the slope of y at the beginning and end of each interval for which we’re estimating the change in y. However, it ignores that we also know the acceleration of y. That is, we have a third initial condition:

Thus, we need a quadratic to represent the general function which meets all of these criteria:

Integrating gives:

So all we need to do is replace our use of the trapezoidal rule with this quadratic rule. Recall that by definition, k_1 = y”_0, so it appears in every y_i estimate.

We repeat this pattern to calculate k_3:

and k_4:

Then, we finish our estimation of y_n+1 and y’_n+1 as before:

The impact on the accuracy of RKN algorithms is staggering. Using the same differential equation as last post, I’ve re-run the calculation of cumulative error. In the below table, “Before” denotes the algorithm from the previous post, with the trapezoidal rule, and “After” is with the quadratic rule.

StepsBefore Error(y)After Error(y)Before Error(y’)After Error(y’)
11.484 E-33.438 E-31.593 E-24.681 E-4
107.797 E-63.529 E-72.282 E-57.484 E-8
1008.967 E-93.486 E-112.366 E-87.914 E-12
10009.086 E-122.664 E-152.375 E-115.773 E-15
Note that with errors on the order of 10^-15, we’re approaching the limit of the ability for 64-bit floating points to detect differences in numbers.

We’ve improved the cumulative error of the algorithm from O(h^3) to O(h^4) – an entire factor of h! The additional performance cost is tiny – at most an increase of 10% for trivial second-order differential equations – so this modification to the original algorithm should always be worth it.

You can recreate the results of this blog post by checking out v0.1.3 of my diffeq-go repository and running:

diffeq-go$ go run cmd/validate/validate.go

Introduction to Runge-Kutta-Nyström Methods

Alternative Title: How to turn any Runge-Kutta method into a Runge-Kutta-Nyström algorithm.

Update: If you’re going to use this algorithm, I highly recommend making the modifications in this post to reduce cumulative error by another factor of h.

Differential equations describe everything from how populations grow with time to how waves flow through the ocean. While a few have nice analytic expressions, many do not – think the three-body problem or the Navier-Stokes equations. This isn’t a matter of the algebra being difficult – for many cases it can be shown that no such “nice” solution exists. In these cases we turn to computers and numerical methods.

Runge-Kutta methods are a family of numerical methods for solving first-order differential equations:

Poincaré map of a Duffing oscillator. My work on this simulation inspired this post.

They offer startling performance and quality gains over the Euler Method. A fourth-order Runge-Kutta method can easily produce solutions many orders of magnitude more precise, all while requiring a fraction of a percent of the computing power the Euler Method demands.

However, Runge-Kutta methods have several limitations for studying complex systems:

  1. They are limited to first-order differential equations.
  2. They produce solutions for differential equations of a single variable.

I’m going to tackle the first of these two problems with a different family of solvers – Runge-Kutta-Nyström (RKN) methods.

Runge-Kutta-Nyström methods have a single sentence on Wikipedia describing them – a mere 20 word side-note in an article of over 3,000 words. There are several thousand articles on RKN methods on Google Scholar, many of them specialized and with little explanation of the basics – such as how to implement them or use them. While specialized RKN methods are very useful in the use cases they were developed for, they are not illustrative.

Fortunately, we don’t need to perform (as much of) the work required to develop RKN methods – we can take any existing Runge-Kutta method and adapt it to be an RKN method.

Let’s begin with a well-known Runge-Kutta method – RK4. However, we’ll write it as if we were solving a differential equation for y” in terms of y’. We want to produce an algorithm which solves this equation:

The modification for the expression for k_1 is straightforward – expand it to include the initial value of y:

Now, we define our estimate for y’ which we’ll use to compute k_2, y’_1 and then use the trapezoidal rule to extrapolate an estimate for y for estimating k_2.

Similarly for k_3:

And finishing the pattern, we get

And then plug in as we would for RK4 to produce our results:

Importantly, note that we use the RK4-weighting on estimating both y’ and y – the trapezoidal rule was only for calculating our k-values.

That’s it. This RKN4 algorithm produces results with cumulative O(h^3) error instead of the normal O(h^4) we expect for RK4, but this is unsurprising as we’re integrating slope, which has step-wise O(h^4) error.

The following table records the decrease in cumulative error in while evaluating y”(t) = 0.5 y'(t) + 0.5 y(t) with initial conditions y'(0) = y(0) = 1 and t_f = 1.0.

StepsError(y)Error(y’)
11.484 E-31.593 E-2
107.797 E-62.282 E-5
1008.967 E-92.366 E-8
10009.086 E-122.375 E-11

Performance-wise, for trivial differential equations RKN4 requires about 50% more compute time than a similar first-order equation evaluated with RK4. As RKN4 requires the same number of function evaluations as RK4, this difference is smaller for more involved equations.

I’ve published the code I developed for this blog post on GitHub. My results can be recreated by checking out tag v0.1.2 and running “go run cmd/validate/validate.go”.

Fear of Introspective Art

How we see art is a reflection of ourselves. We can only feel things that are, in some way, part of ourselves. Much of art resonates not merely due to it’s beauty, but its ability to show us … us.

Sometimes that can be scary. Just out of personal interest — not for trolling — I am a member of several conservative meme/lifestyle groups. Occasionally an art piece pops up that deeply disturbs most members of the group (McJesus, for a recent example). The common, but not overwhelming, response is scorn. “That is horrible and blasphemous.” “Please take it down.” “They will regret [making that].” Very little is said why the piece is “horrible;” it is simply repeated that it is.

Jani Leinonen, creator of McJesus. Pirje Mykkänen, valokuvaaja / photographer • CC BY-SA 4.0

It’s not easy allowing ourselves such mental vulnerability. It feels like an attack – like they’re saying we’re bad. Like whoever made that art hates us. If I were a Christian, it would probably feel deeply unsettling to see Ronald McDonald nailed to a cross. To take something I took for granted and make it into what superficially feels like a mockery of who I am. For most people thought ends there. The mind shuts off and the shouting begins.

People suck at self-reflection. Especially when instigated by a faceless “other.” Part of this is growing a thicker skin. Did I really just want to yell at a sculpture? That’s ridiculous. Does some art make you feel silly? Alone and insignificant? Good. It’s doing its job. What is the art trying to say?

And what does it say about you that a piece makes you feel what it does?

WorldProc

Back in high school I created worlds. My parents had gotten me Vue, a procedural environment rendering program. Over a year I learned procedural texturing and developed my own algorithms for generating (what felt to me) like photo-realistic planets. I submitted my works to the Texas TSA art competition, won a few categories, and then mostly forgot about it when I began college.

59997_440951067421_1169309_n

One of my favorite high school pieces. I cheated on the clouds – it’s actually a cloud map from NOAA that I used as a cloud layer – not a real simulation.

A couple years ago I picked Vue back up and tried again. This time, I found the program too limiting – even though it has far more features now (including a Python scripting engine!). Mainly – I can’t run simulations on textures within Vue. At least, not efficiently.

A few weeks ago I returned with a different approach – this time I would write my own software for generating and simulating the worlds. And I’d use Go since it’s what we write Kubernetes in. I’ve done a fair bit of work, and now with my Google Copyright Waiver in-hand, I can open source the project.

At some point I’ll scope out what all features I want to implement, but these few should keep me for months. As I develop features, when it strikes my fancy I’ll write about the things I found interesting along the way.

  • Fractal noise generation
  • Hydrology (incl. erosion)
  • Weather (incl. clouds)
  • Climate
  • Tectonics

So here goes. I call it WorldProc.

Make Art Anyway

From the Art Corps: https://twitter.com/artscorps/status/1016449391844921344

Image stolen from the Art Corps twitter feed.

Maladies in the War of Art

I recently read The War of Art, a self-help book on overcoming the internal forces preventing us from doing our life’s work. It got a lot of things right. Foremost: for most people, our greatest enemy is ourselves.

It got one thing wrong: it says that many maladies are imagined and exist purely in our heads, and do not really exist. The specific examples it uses – Seasonal Affective Disorder and Attention-Deficit/Hyperactivity Disorder – are real phenomena that impact people’s ability to be productive. Marginalizing those disorders isn’t useful; it does not make those things less real. Pretending they aren’t real is futile. Instead, the below is how I approach those – and similar – psychological problems. It also applies to physical problems.

The only possible universes in which you produce the art you want to make are the ones where you have every malady in your path and you produce your art. It’s not fair that you’re going through whatever you’re experiencing. Some people have it better than you. Some people don’t feel a weight on their chest every morning keeping them from even rising to take a shower in the morning. That sucks. The possibilities where you produce art and you don’t have your issues aren’t worth considering – they aren’t real and can never be.

In the words of a dear friend:

“Make art anyway.”

“Liberals Don’t Understand Gun Control”

On Enabling Negative Characterizations

We’ve done a terrible job debating gun control. We’ve allowed gun-control opponents to steal the show because we haven’t done the basic legwork to even know what we’re saying.

AK-47_type_II_Part_DM-ST-89-01131

  • “I can’t imagine why anyone would need a semi-auto for self defense.”

  • “ARs should be banned.”

  • “The shooter used a high power rifle.”

If you’ve said one of these, you’ve expressed an opinion you (probably) haven’t researched. Anyone with a reasonable understanding of guns will spot this immediately. This discredits anything further you have to say on the topic – even if it is compelling statistics backed by studies. Alternatively, they’ll take you at your word and you’ll find yourself defending a ridiculous or untenable position.

If you’re going to debate gun control, respect your conversational partners enough to know what you’re saying. Imagine how you’d feel if a marriage equality opponent mixed up the terms “trans” and “gay” or constantly assumed all gay men are effeminate. In the realm of gun control, you are on their turf and must use their terminology to be taken seriously. Words have meaning and we can’t expect to have meaningful, serious discussions without agreeing on definitions.

Language shapes how we perceive the things it describes, and knowledge of less-known parts of it can become part of your identity. Failing to learn the basics before spouting an opinion says that you don’t take the other side seriously enough to even learn to speak with them and understand what they’re saying. It voids any possibility of common ground and makes you susceptible to making a public fool of yourself. It kills your ability to show empathy – one of the strongest tools for building compromise.

Gun control is a complex topic. Have the personal responsibility to learn about it before speaking. If there’s a requirement for constructive discourse, it’s being able to communicate. If someone uses a term you don’t know – ask – showing humility is a great way to build rapport with an opponent. I’ve included a list below of several terms you should know to be an informed citizen.

Commonly Misused Terms

Don’t use these unless you mean them. If you don’t care enough about the topic to learn these and use them properly, stay out of gun control debates – you’ll do far more harm to your side than good.

AR – ArmaLite Rifle, a series of rifles produced by ArmaLite, a small arms engineering company.

High Power Rifle – a type of scored shooting sport.

Semi-automatic – a mode of fire that chambers a new round with each pull of the trigger, but only fires one round per trigger pull.

Automatic – a mode of fire that chambers and fires rounds continuously as long as the trigger is held down. (Also: full-auto, fully-automatic)

Burst mode – a mode of fire like automatic, but stops after a predetermined number of rounds. Usually 2-5, depending on the firearm.

Selective-fire – a firearm which allows switching between two or more modes of fire (such as the above).

Assault rifle – a selective-fire rifle with a detachable magazine. Not necessarily automatic. Uses an intermediate cartridge.

Manual operation – firearms that require an additional action from the user to chamber a new round (not just pulling the trigger).

There are many more you should know, but these are the ones I see misused by gun control advocates most often.

Weapons of Math Destruction in Hiring

wmdsAbout a month ago I read Dr. Cathy O’Neil‘s book, Weapons of Math Destruction. It is a call to arms for data scientists and anyone in the automation field to handle their data responsibly. She lays out a framework for a code of conduct for people designing systems that handle people’s data, and how to avoid or at least become aware of the harm they cause society. Below is a draft of a speech I’ve prepared to give for my Toastmaster’s group.

Imagining weapons of mass destruction is something of an American pastime. The dangers are obvious – millions dead and large-scale infrastructure overwhelmed or simply annihilated. These effects grip modern culture, seeping into irresponsibly speculative news and defining Michael Bay and Tom Cruise movies. Fun, but not a useful discussion to have in your daily life. Instead, I want to cover the automated systems already harming us – what Dr. Cathy O’Neil terms “Weapons of Math Destruction.” A Weapon of Math Destruction is any scalable system that uses a model with the potential to harm society. In her book, Weapons of Math Destruction, Dr. O’Neil defines three types of WMDs: invisible systems, opaque systems, and obviously harmful systems. I’m going to define those three types of WMDs, giving illustrative examples from her book and from my experience in the hiring automation field.

Invisible Systems

First, invisible systems. Who here has been given a personality test while being considered for a job? How many of those employers told you that your response could automatically disqualify you? That’s an example of an invisible system – you don’t even know it’s there. On the surface, sure, no one wants to work with jerks so just screen them out. The problem is two-fold: since you’re not aware of this, you can’t appeal the decision and don’t know what went wrong. The second is that “jerk” is not a well-defined term (and there aren’t any tests for it). In this case, employers misuse common personality tests (OCEAN, MBTI, etc.) for something they weren’t designed for: candidate filtering. In fact, these tests were designed to help teams work together by helping members understand what makes each individual tick.

This specific issue has a history dating back to discrimination against people with mental illnesses such as depression, PTSD, and anxiety. Unable to legally directly filter out people with mental illnesses, employers fall back to the poorly-correlated results of personality tests. Systems like these have obvious potential for abuse. Since they’re invisible, there is no public accountability and no way to correct the harm these practices cause.

Opaque Systems

But what if a system is visible, but the owner doesn’t want to reveal how it works? This is an opaque system. Opaque systems are very prevalent in software, especially in artificial intelligence development. There’s a lot of startups that promise automated systems that match candidates to job openings, aiding or even eliminating the role of recruiters. On the surface level this seems like a great idea – by making it easier for companies to hire people, it will be easier for people to get hired. What you’ll note is that none of these services reveal how they match candidates – it may be proprietary logic or a machine learning system that obfuscates the logic even from the company using it. Candidates who sign up for these systems know that there is a matching algorithm, but they aren’t let in on how it reasons about them. Since candidates don’t know about the strengths and limitations of these systems, they can’t tailor their resumes or profiles. This is worsened further since recruiters usually have limited understanding of the skills they’re hiring for, and can reason neither about the system they’re using or the skills they’re looking for. The system could be biased by race or gender, and the software’s developers may not even know.

By choosing to make their candidate matching system opaque, these services discriminate arbitrarily against candidates who haven’t optimized their profiles them recruiters, and there isn’t a way for candidates to learn how to improve their odds. The system is unappealable and outsiders can’t reason about how it behaves, so they are powerless.

Harmful Systems

But what if a system is visible and the implementers are transparent about how it works? You’re still not out of the woods. Many companies do a credit check before making a hiring decision. This system is visible and transparent – you know they are checking your credit history and that they have some minimum bar they’ll use to make a decision. Again, this initially seems like a good choice – if someone isn’t able to handle their finances responsibly, how can you expect them to be responsible with their job?

Financial irresponsibility isn’t the only way to end up with bad credit. Someone could steal your identity. A hospital might balance-bill you for tens of thousands over what your insurance covers. Maybe you’re still recovering from the financial crisis. But even if you are at fault for your financial history, systemically denying you a job will only make things worse for you and people like you. This is a simple feedback loop: people with bad credit get fewer and worse jobs, so their credit score gets worse. This is one of the many systems contributing to the cycle of poverty in the US.

Conclusion

The companies using and profiting on these systems – these WMDs – rarely look at their impact on the world and are unlikely to share if they do know. As citizens we have to be aware that these types of automated systems exist and influence much of our lives. Invisible and opaque systems are unaccountable, and we have to push back because usually we only learn they exist and hurt people when they’ve reached a monstrous size.

As the designers of systems, we have to make sure we aren’t falling into these pitfalls. If you’re making something with the potential to improve the lives of millions and have any sort of professional ethic, how can you live not knowing whether it is actually having that impact? Do you really take pride in your system if you’re not willing to let someone independently verify its effects?

These WMDs are already hurting you and the people around you. I’ve only given examples in hiring, but imagine the collective effect of thousands of these systems across every industry – real estate, medicine, finance – each impacting millions of lives. You probably participate in several, and may even be building one for work. Algorithms to automate systems will only become more prevalent with time. We have to be ready, and responsible.