Problem 2.1. Deduplicating a Linked List with Maps

May 21, 2025May 21, 2025 / Will Beason / Leave a comment

This problem is also from Cracking the Coding Interview, from Chapter 2.

Write code to remove duplicates from an unsorted linked list. How would you solve this problem if a temporary buffer is not allowed?

This is one of the classics that I’ve been very interested to see if there are significant performance improvements to be made over the naive solutions. The restriction in the second sentence of the question eliminates the potential usage of interesting data structures, but still has interesting optimization properties so I’ll handle that in a separate post.

Initial Clarifications

Linked lists don’t inherently contain information at each node, but we can reasonably infer from the problem state that in this case “duplicates” refers to two nodes with the same information.

I’ll use integers as the data type. In practice, if the data held by each node were complex, or if it were otherwise time-consuming to determine equivalence, then it would be worth considering optimizing to minimize the number of comparisons. For now I’ll just use integers as the data type since that comparison takes very little time.

For the linked list itself we don’t need anything complex. I’ve added a convenience method to compute the length of a list as it’s used in many obvious optimizations. Linked List Code.

type Node[T comparable] struct {  
  Value T  
  Next  *Node[T]  
}  

func (n *Node[T]) Length() int {  
  length := 0  
  for ; n != nil; n = n.Next {  
   length++  
  }  
  return length  
}

The Naive Map Solution

First – and the expected solution in interviews – is one that uses a map to record the set of seen integers. In the benchmark I’m assuming effectively no duplicates to make this a sort of worst-case scenario (rand.Int to produce random positive int). Test and Benchmark Code.

// RemoveDuplicatesMap removes entries from a linked list with the same value.
// Uses a map to record seen values.  
func RemoveDuplicatesMap(head *Node[int]) {  
  if head == nil {  
   return  
  }  

  seen := make(map[int]bool, head.Length())  

  for ; head != nil; head = head.Next {  
   seen[head.Value] = true  

   for head.Next != nil && seen[head.Next.Value] {  
    head.Next = head.Next.Next  
   }  
  }  
}

The Manual Hash Set Solution

Go’s maps are far more robust than is required for this problem, potentially meaning they have expensive overhead handling things like general-purpose hash functions and collision-handling logic.

So what if we create our own implementation that doesn’t do any of that? In our case, since we know integers are bitwise-random, we can use a modulus as the hash function. If we limit the map size to only powers of 2, this is equivalent to the (much faster) & operation with a bit mask. The backing array holds the seen integers, and on hash collisions we increment to find the first available slot. Finally, we make the HashSet about twice as large as we need so collisions are infrequent (so 2048 keys for lists of size 1000, not 1024). HashSet Code. Solution Code.

// HashSet is a statically sized hash set of integers which uses
// the ending bits as a hash.  
type HashSet struct {  
  mask   int  
  values []int  
}  

// NewHashSet constructs a HashSet which can contain up to 2^size entries
// and initializes it with value.  
func NewHashSet(size, value int) *HashSet {  
  result := &HashSet{  
   mask:   1<<size - 1,  
   values: make([]int, 1<<size),  
  }  
  result.values[value&result.mask] = value  

  return result  
}  

// Insert adds the value to the set.
// Returns true if the value is already present.  
// Does not handle the case where the set is full and a new value is added.  
func (m *HashSet) Insert(value int) bool {  
  idx := value & m.mask  
  for m.values[idx] != 0 {  
   if m.values[idx] == value {  
    return true  
   }  
   idx = (idx + 1) & m.mask  
  }  
  m.values[idx] = value  

  return false  
}

Comparing the two via benchmark yields dramatic results – our manual implementation of a hash set is about 4x faster! Unsurprisingly (found via the “-benchmem” flag), HashSet only has one memory allocation to Map’s 33, and uses about 10% less memory despite preemptively allocating twice the number of keys.

	Map	HashSet
List Size	ns/op	ns/op	Change
100	5,610	1,495	73.35%
1,000	52,836	9,664	81.71%
10,000	562,210	131,242	76.66%

There’s some interesting sub-linear behavior around 1,000 elements for both solutions, but I don’t have a good explanation for it. It’s nowhere near a cache size boundary and the memory requirement is too large for a single page on my machine.

Conclusion

As is generally the case, a brittle manual solution wins out over a general-purpose one. You wouldn’t want to use this HashSet implementation in production code – you’d likely want additional guardrails (such as tracking collisions to know when to automatically increase in size), which would cost additional processing power. It is still neat to see a recreation of the idea of what hash maps are doing, and it’s surprising that using maps in Go incurs so much overhead. As always, don’t try something like this unless you’ve actually benchmarked and measured that it is a problem for your use case!

This was fun, and I may explore some alternative solutions to this problem that I messed around with in the future.

Optimization Problem 1.3: String Permutations

May 8, 2025 / Will Beason / Leave a comment

This next problem also comes from Cracking the Coding Interview.

Problem 3: Given two strings, write a method to decide if one is a permutation of the other.

Easy Solution

The expected solution here is, of course, straightforward as this is meant as an easy interview question. For each string, create a map from each character in the string to the number of times the character appears. Then, compare the maps.

Naturally, the AI autocomplete gives an incomplete solution.

// Solution3A uses a straightforward solution using character counts.
// Autocompleted solution.  
func Solution3A(left, right string) bool {  
  leftCounts := make(map[rune]int)  
  for _, r := range left {  
   leftCounts[r]++  
  }  

  rightCounts := make(map[rune]int)  
  for _, r := range right {  
   rightCounts[r]++  
  }  

  // Autocomplete begins here.  
  for r, lc := range leftCounts {  
   rc, ok := rightCounts[r]  
   if !ok || lc != rc {  
    return false  
   }  
  }  

  return true  
}

While at first glance it isn’t obvious what is wrong, a careful check should reveal that this code does not handle the case where there are characters present in right which are not present in left. We can verify this bug with tests:

func TestSolution3(t *testing.T) {  
  tt := []Problem3TestCase{  
   {name: "empty", left: "", right: "", want: true},  
   {name: "left empty", left: "", right: "a", want: false},  
   {name: "right empty", left: "a", right: "", want: false},  
  }  

  for _, solution := range solution3s {  
   for _, tc := range tt {  
    if got := solution.F(tc.left, tc.right); got != tc.want {  
     t.Errorf("%s(%q, %q) = %v, want %v", solution.Name, tc.left, tc.right,
       got, tc.want)  
    }  
   }  
  }  
}

The fix is to ensure this can’t happen by checking the character sets.

// If the lengths of the character counts are unequal, then a character is
// missing from one string.  
if len(leftCounts) != len(rightCounts) {  
  return false  
}

Together, verifying the key sets are of equal size, and checking the equivalence of all counts of keys one set are equivalent to the other is sufficient to ensure the strings are permutations. (We don’t need to do a symmetric check.)

There’s an even faster possibility, though.

// Strings of different lengths cannot have the same counts of characters.  
if len(left) != len(right) {  
  return false  
}

We can exit early when strings are of unequal lengths. This covers all cases for the key set size check, is faster, and exits before we create the maps so it is obviously better.

Benchmarking

But … how much better can we do? As with any optimization problem, we should define what “normal” input looks like before committing to this. As with Problem 1, we’ll make the initial simplifying assumption that all characters are ASCII.

We need to decide how we’re going to benchmark this.

It isn’t interesting to benchmark the case of strings of unequal lengths since it’s just a string length check and there’s nothing to optimize, so for our benchmarks .

The case of nearly equal strings will look significantly different from the case where strings are completely different. So we should have one benchmark for strings we know are permutations, and one where it they aren’t (or at least, it is unlikely).

Map iteration in Go is random, so sometimes nearly equal strings will still be identified as such immediately, and for others it may take until the last iteration. We can’t really do anything about this so long as solutions involve maps, but it is a source of randomness that we should be aware of.

As we’ve seen, autocomplete does tend to get things reasonably right. In this case autocomplete did successfully figure out that I wanted to generate slices of data, incorporating functions I’d written in the same package (RandomString and RandomASCII). It just – amusingly – didn’t create a source of randomness, instead invalidly passing nil.

lengths := []int{1, 2, 5, 10, 20, 50, 100}
// Begin autocomplete
data := make([]LengthData, len(lengths))  
for j, length := range lengths {  
  data[j] = LengthData{length: length, data: make([]string, n)}  
  for i := 0; i < n; i++ {  
   data[j].data[i] = RandomString(nil, length, RandomASCII)  
  }  
}

The naive solutions are actually surprisingly slow:

BenchmarkS3Different/Solution3A2/1-16         10935601           100.3 ns/op
BenchmarkS3Different/Solution3A2/2-16          9316386           121.7 ns/op
BenchmarkS3Different/Solution3A2/5-16          5554606           209.3 ns/op
BenchmarkS3Different/Solution3A2/10-16         1032976          1103 ns/op
BenchmarkS3Different/Solution3A2/20-16          431102          2719 ns/op
BenchmarkS3Different/Solution3A2/50-16          185488          6323 ns/op
BenchmarkS3Different/Solution3A2/100-16          84954         13117 ns/op
BenchmarkS3Permutation/Solution3A2/1-16       10739696           105.8 ns/op
BenchmarkS3Permutation/Solution3A2/2-16        8149945           147.3 ns/op
BenchmarkS3Permutation/Solution3A2/5-16        3648794           321.4 ns/op
BenchmarkS3Permutation/Solution3A2/10-16        919902          1307 ns/op
BenchmarkS3Permutation/Solution3A2/20-16        379916          3124 ns/op
BenchmarkS3Permutation/Solution3A2/50-16        142298          7322 ns/op
BenchmarkS3Permutation/Solution3A2/100-16        78962         14562 ns/op

We’re taking around 100 ns to process each character, which is about 10 MB/s. So this solution could easily be a performance bottleneck in a system.

There’s also a few surprising properties the benchmarks reveal:

There is a huge jump from 5 character strings to 10. This is likely due to needing to reallocate the array underlying the map when it grows beyond 8 keys.
There isn’t a huge difference between checking strings which are permutations from strings which are not. The largest difference is at strings of length 5, a 50% increase, but is otherwise 5%-10%.

The Array Solution

Given the massive success of arrays for Problem 1, does a similar optimization here also work?

func Solution3B(left, right string) bool {  
  if len(left) != len(right) {  
   return false  
  }  

  counts := make([]int, 128)  
  for _, r := range left {  
   // Keep track of character counts.  
   counts[r]++  
  }  
  for _, r := range right {  
   // If we see a character and there isn't at least one unaccounted for,
   // we can exit immediately.  
   if counts[r] < 1 {  
    return false  
   }  
   // One instance of this character in left is now accounted for in right.  
   counts[r]--  
  }  

  return true  
}

Instead of tracking the two sets of keys and counts in separate maps, we track counts in a single array that can handle any character from 7-bit ASCII. This has the advantage of not needing to use maps, and requires a single 1kb memory allocation.

With memory allocations, what really matters is the number of times you request memory rather than the amount, when making requests smaller than a – normally 4kb – page of memory. Further, we’re only referencing a single heap-allocated object – the counts slice – rather than two maps.

These results are much better – from 5x faster for small strings to about 100x faster for long strings.

BenchmarkS3Different/Solution3B/1-16         63929632            18.74 ns/op
BenchmarkS3Different/Solution3B/2-16         59162731            19.68 ns/op
BenchmarkS3Different/Solution3B/5-16         51334929            22.14 ns/op
BenchmarkS3Different/Solution3B/10-16        39485048            27.82 ns/op
BenchmarkS3Different/Solution3B/20-16        29531452            37.35 ns/op
BenchmarkS3Different/Solution3B/50-16        14044939            82.91 ns/op
BenchmarkS3Different/Solution3B/100-16        8684762           136.2 ns/op
BenchmarkS3Permutation/Solution3B/1-16       61517580            18.67 ns/op
BenchmarkS3Permutation/Solution3B/2-16       58104522            20.36 ns/op
BenchmarkS3Permutation/Solution3B/5-16       44546454            26.14 ns/op
BenchmarkS3Permutation/Solution3B/10-16      30614641            35.46 ns/op
BenchmarkS3Permutation/Solution3B/20-16      19843489            55.30 ns/op
BenchmarkS3Permutation/Solution3B/50-16       8272191           142.4 ns/op
BenchmarkS3Permutation/Solution3B/100-16      4962151           243.1 ns/op

These results are around 1-2.5 ns/character, so we’re at a processing speed that is significantly faster than data transfer or disk reading.

I tried some other possible optimizations – such as using a raw array instead of a slice for counts, but this didn’t cause any noticeable improvement. This is unsurprising as we allocate the same amount of memory for all cases. The cpu profile also doesn’t suggest any obvious avenues of improvement:

There aren’t obvious interesting changes to make without moving to Unicode, so this seems like a good place to stop.

Problem 1.1

May 2, 2025 / Will Beason / Leave a comment

I’m preparing to go back on the software job market, and to sharpen my skills I’m going back through books I initially worked through as a novice. I’ve got over a decade more experience now, and so I want to approach these problems as a seasoned developer – to try and find something more interesting than just solutions to problems that could be uncovered in half an hour.

To begin, I’m starting with Cracking the Coding Interview, by Gayle Laakmann McDowell.

“Problem 1.1 Implement an algorithm to determine if a string has all unique characters. What if you cannot use additional data structures?”

The first part of this question is straightforward to do with a hash map. In fact, it’s so straightforward that generative AI can get you most of the way towards the answer.

// Solution1A is a straightforward attempt to solve the problem with a hashmap.
func Solution1A(s string) bool {  
  seen := make(map[rune]bool)  
  for _, r := range s {  
   if seen[r] {  
    return false  
   }  
   seen[r] = true  
  }  
  return true  
}

Benchmarking

This wouldn’t be a fun exploration of the problem if we didn’t include some benchmarks. Suppose this uniqueness-detecting algorithm were in a location critical to performance. How well do we do, and can we do any better?

So we need some test data. Ideally test data for benchmarking reflects actual call patterns. Since the input data are strings, the main properties of the input that matter for performance are:

the distribution of string lengths,
the set of possible characters, and
the distribution of characters.

Possible length distributions:

All the same
Uniform in a range
Poisson distribution (English word lengths appear to have this distribution)
1. https://www.sciencedirect.com/science/article/abs/pii/0378375886901692

Possible character sets:

All-caps letters (26 possibilities)
Upper and lowercase letters (52 possibilities)
All printable ASCII (95 possibilities)
All Unicode characters as of Unicode 16.0 (155,063 possibilities)

Possible character distributions:

Uniform distribution
Pareto distribution (similar to actual text?)

Like with any good optimization problem, different choices here may result in different optimal solutions. Some may even completely eliminate certain solutions.

To start, we’ll go with:

All strings are of length 12
All characters are printable ASCII
All characters are equally likely

Twelve is a nice number of characters to begin with as approximately half of all strings will have a repeated character. Using the birthday problem formula, we can see:

95!
---        x 95^12 ~= 0.484
(95-12)!

For the benchmark we’ll generate 1,024 random strings, and cycle through evaluating them.

const (  
  n     = 1 << 10  
  nMask = n - 1  
)  

func BenchmarkSolution1(b *testing.B) {  
  rng := rand.New(rand.NewSource(time.Now().UnixNano()))  

  randomStrings := make([]string, n)  
  for i := 0; i < n; i++ {  
   randomStrings[i] = RandomString(rng, 12, RandomASCII)  
  }  

  for solution, f := range solutions {  
   b.Run(solution, func(b *testing.B) {  
    for i := 0; i < b.N; i++ {  
     f(randomStrings[i&nMask])  
    }  
   })  
  }  
}

Hash Map Solution

As it turns out, while our initial solution works, it’s pretty slow for what it does:

BenchmarkSolution1/Solution1_A
BenchmarkSolution1/Solution1_A-64              1958344           602.7 ns/op

So … why is it slow? For this, we turn to the handy builtin profiler that allows is automatically enabled during testing.

go test ./chapter1/ -run=NONE -bench=BenchmarkSolution1/Solution1A -cpuprofile=cpu.out

The big things to notice here are:

Only 89% of the time is Solution1A running (chapter1.Solution1A)
68% of the time is creating new entries in the map (runtime.mapassign_fast32)
16% of the time is accessing map entries (`runtime.mapaccess1_fast32)
2% of the time is unavoidable runtime calls (runtime.rand)

Note that the call to runtime.rand is not related to the benchmark setup code – you can see this for yourself if you remove the random string generation, the call is still present in the profile.

So in total, 84%/89%, or 94% of our time is spent dealing with the hash map. Even if we made all other parts of our logic free – taking zero time – at best we could save 36ns out of over 600ns. If there is an improvement, it’s either in how we’re using the hash map or just the fact that we are using a hash map.

For better ways of using the hash map – we could try reducing the size of the keys, or preallocate the map so it doesn’t ever need to be reallocated. In the first case, the idea is to reduce the size of the map’s keys, possibly saving time in memory allocation and in the runtime of the hash algorithm. For the latter we preallocate the map at size 12 (which it should never grow beyond, as strings have exactly 12 characters), saving that time. Implemented:

// Solution1A1 is a slight modification to Solution1A where runes are reduced to
// bytes as all are assumed to be ASCII.  
func Solution1A1(s string) bool {  
  seen := make(map[byte]bool)  
  for _, r := range s {  
   if seen[byte(r)] {  
    return false  
   }  
   seen[byte(r)] = true  
  }  
  return true  
}  

// Solution1A2 is a slight modification to Solution1A where maps are initialized
// to size 12 to avoid needing to reallocate the underlying array.  
func Solution1A2(s string) bool {  
  seen := make(map[rune]bool, 12)  
  for _, r := range s {  
   if seen[r] {  
    return false  
   }  
   seen[r] = true  
  }  
  return true  
}

Trying these with our benchmark:

BenchmarkSolution1/Solution1A-64              2068266           617.7 ns/op
BenchmarkSolution1/Solution1A1-64             1332189           939.7 ns/op
BenchmarkSolution1/Solution1A2-64             2088864           572.0 ns/op

We can see that there is a notable regression for converting runes to bytes, and a slight improvement for avoiding reallocating the map. (I tried other values, such as 11, and saw no further improvement)

There could be other improvements using Go’s native hash map, but I’m not aware of them.

Array Solution

Notice that we have a fairly small number of possible characters – 95. Also notice that their byte values range from 32 to 126, all under 128. Thus, if we initialized a bit array of length 128, we could store in that whether we’ve seen a character corresponding to each value.

// Solution1B is like the above but uses a bit array.
func Solution1B(s string) bool {  
  seen := make([]bool, 1<<7)  
  for _, r := range s {  
   if seen[r] {  
    return false  
   }  
   seen[r] = true  
  }  
  return true  
}

We can immediately see a drastic improvement:

BenchmarkSolution1/Solution1A-64              1892926           626.3 ns/op
BenchmarkSolution1/Solution1A1-64             1231104           948.8 ns/op
BenchmarkSolution1/Solution1A2-64             2023808           573.0 ns/op
BenchmarkSolution1/Solution1B-64             82965651            13.96 ns/op

Our bit array is about 50x faster than our hash map solution. Looking at the CPU profile, we can see that nearly all of the time is spent in the solution logic (0.94s of 0.96s) – any potential gains are likely in the code of that function.

But what gains are even possible?

One possibility is avoiding the heap allocation incurred by defining a slice instead of using an array directly.

// Solution1B1 is like the above but uses an array directly instead of a slice.
func Solution1B1(s string) bool {  
  var seen [1 << 7]bool  
  for _, r := range s {  
   if seen[r] {  
    return false  
   }  
   seen[r] = true  
  }  
  return true  
}

This results in a slight (but reliable) slowdown:

BenchmarkSolution1/Solution1B-64             89289081            13.75 ns/op
BenchmarkSolution1/Solution1B1-64            84424354            13.90 ns/op

Another possibility is to avoid allocating the slice with each call, instead copying empty data into it.

var (  
  seenC  = make([]bool, 1<<7)  
  seenC2 = make([]bool, 1<<7)  
)  

// Solution1B2 is like the above but uses an array directly instead of a slice.
func Solution1B2(s string) bool {  
  copy(seenC, seenC2)  
  for _, r := range s {  
   if seenC[r] {  
    return false  
   }  
   seenC[r] = true  
  }  
  return true  
}

This option also results in a slowdown, larger than before:

BenchmarkSolution1/Solution1B-64             89082355            13.53 ns/op
BenchmarkSolution1/Solution1B1-64            88801987            13.77 ns/op
BenchmarkSolution1/Solution1B2-64            76817073            15.82 ns/op

Alternatively, we could even hard code the number of iterations (since we know every string is of length 12). The idea here is that it avoids a quirk of how Go iterates over strings.

// Solution1B3 is like 1B but iterates a static number of times.
func Solution1B3(s string) bool {  
  seen := make([]bool, 1<<7)  
  for i := 0; i < 12; i++ {  
   r := s[i]  
   if seen[r] {  
    return false  
   }  
   seen[r] = true  
  }  
  return true  
}

This does result in a noticeable speedup, at the cost of making the code more brittle. It isn’t obvious to me why there is a speedup. It could be that creating the (index, rune) tuple is expensive, or it could have to do with how Go checks if it is done iterating.

BenchmarkSolution1/Solution1B-64             85514907            14.02 ns/op
BenchmarkSolution1/Solution1B1-64            83047393            14.11 ns/op
BenchmarkSolution1/Solution1B2-64            74333500            16.27 ns/op
BenchmarkSolution1/Solution1B3-64            97826607            12.03 ns/op

However, we don’t actually have to make the code brittle to realize the same gain.

// Solution1B4 is like 1B but uses the string length to set the number of iterations.
func Solution1B4(s string) bool {  
  seen := make([]bool, 1<<7)  
  for i := 0; i < len(s); i++ {  
   r := s[i]  
   if seen[r] {  
    return false  
   }  
   seen[r] = true  
  }  
  return true  
}

In fact, this is even faster than the hard coded length. I suspect there’s a compiler optimization, since it would seem like recalculating the length of s every iteration would take longer than just comparing to a compile-time constant.

BenchmarkSolution1/Solution1B-64             86110658            14.15 ns/op
BenchmarkSolution1/Solution1B1-64            86076312            14.01 ns/op
BenchmarkSolution1/Solution1B2-64            72450708            15.67 ns/op
BenchmarkSolution1/Solution1B3-64            99203356            11.65 ns/op
BenchmarkSolution1/Solution1B4-64            104699756           11.10 ns/op

That’s about as far as I want to take the bit array solution. What about the other part of the question – the part I’ve ignored until now?

No Additional Data Structures

“What if you cannot use additional data structures?“

For Go in particular, this is tricky as strings are immutable. This means that a natural solution – to sort the characters of the string and then check for duplicates – isn’t possible.

If you iterate through a string, you’ll notice that you get the 4-byte rune type rather than the 1-byte byte type you get by indexing a string. In this case, it doesn’t matter as for now we’re assuming all characters are ASCII. Which means we can just index the string, even though that would be problematic if the system we’re designing for might ever encounter non-ASCII.

// Solution1C avoids using other data structures by directly comarping pairs of characters.
func Solution1C(s string) bool {  
  for i := 0; i < len(s); i++ {  
   for j := i + 1; j < len(s); j++ {  
    if s[i] == s[j] {  
     return false  
    }  
   }  
  }  
  
  return true  
}

Surprisingly this solution isn’t terrible compared to the other solutions:

BenchmarkSolution1/Solution1B4-64 103588170 11.66 ns/op
BenchmarkSolution1/Solution1C-64 39652282 28.48 ns/op

In fact, this solution *always* beats the hash map solution, despite technically having O(n^2) complexity. At 100-character strings:

BenchmarkSolution1/Solution1A2-64 1279503 940.2 ns/op
BenchmarkSolution1/Solution1B4-64 84820020 13.66 ns/op
BenchmarkSolution1/Solution1C-64 31667222 37.01 ns/op

At 1,000-character strings:

BenchmarkSolution1/Solution1A2-64 1247774 958.0 ns/op
BenchmarkSolution1/Solution1B4-64 79115478 15.28 ns/op
BenchmarkSolution1/Solution1C-64 31556229 38.59 ns/op

This is because of a fundamental limit of our problem – there are at most 95 unique characters, and so despite being an O(n^2) solution, comparing every pair of characters is faster as we exit early.

Pareto Distribution

Suppose characters occur with a frequency following a Pareto distribution. There’s a lot of debate on how to model this properly, so here I’ll pick parameters that seem to fit a letter distribution chart. What matters more is seeing what happens when there’s a small number of characters that are chosen more often than others, but still a good number of the less frequent. If you’re interested in replicating, I’ve gone with X_m=3.0 and alpha=0.45, then offset the results by -3. The results aren’t too awful and result in a more long tail distribution than actual letters, which means any measurements here are likely to be overestimates.

Perhaps unsurprisingly, all solutions perform better in this case. Given that some characters are much more common, there’s a significant likelihood of a common character appearing as the first or second character, and then again later in the string – so generally the logic allowing the algorithms to exit early triggers sooner than if characters are distributed uniformly.

BenchmarkSolution1/Solution1A2-64             2115301           563.4 ns/op
BenchmarkSolution1/Solution1B4-64            106364142           11.02 ns/op
BenchmarkSolution1/Solution1C-64             38778128            29.50 ns/op
BenchmarkSolution1P/Solution1A2-64            3728235           327.9 ns/op
BenchmarkSolution1P/Solution1B4-64           193359403            5.491 ns/op
BenchmarkSolution1P/Solution1C-64            226532450            5.123 ns/op

As a surprise, the apparently O(n^2) solution is actually the fastest compared to the others when given data that better approximates actual text! This should reinforce both that:

the time-complexity of an algorithm should not be the sole determiner of which algorithm is best, and
measuring, especially with simulations of real world data, can change which algorithm is best for an application.

In Context

So we’re evaluating these strings for repeated characters, but where do these strings come from? At 5 or 11 ns/op, how likely is this the bottleneck?

Let’s hand wave away any sort of parsing (e.g. JSON parsing is very slow) – we’re either getting this as a raw byte stream from a connection, or we’re reading raw text from a file on disk.

Consider the slower option – uniformly distributed ASCII characters in strings of length 12, since that gives a nice 1 byte/ns processed. A quick conversion shows this is equivalent to about 1 GB/s. For comparison, a mid-range SSD reads about 500 MB/s, and a gigabit connection is only 125 MB/s. So at that speed, it’s no longer a significant bottleneck.

Considering the hash map solution – even in the more lenient case of the non-uniform distribution, we get 12 bytes / 328 ns, which is about 36.6 MB/s. This is slow enough that it could reasonably be a bottleneck if almost nothing else is happening in the application.

Conclusion

There was a lot here, just looking at this problem from the perspective of Go. The O(n^2) solution being the best for normal-looking data surprised me – it strikes me as the kind of finding that’s interesting to bring up in an interview setting.

I doubt there’s more to be gained from this particular perspective at the problem. Next time I may consider other languages (like Python or Rust), or consider what happens if the characters could be any valid Unicode 16.0. Python and Rust may allow trying the sorting algorithm approach, and Unicode likely would make the byte array solution invalid. But validating that is what benchmarking is for!

A C-minus Flame Cantrip

April 24, 2025 / Will Beason / Leave a comment

It took about ten days after the invention of magic for the world to end.

Not particularly fast as far as things usually go, but still too fast for the majority of dead humans to really understand what killed them. You know, one minute it’s “I’ll have a cappuccino with almond milk,” and the next it’s “Huh, the sky is so bright everything around me is spontaneously catching fire/melting/oh god my eyes.”

Molly Richards was an ordinary person until she just so happened to waggle her fingers in precisely the right way while spouting nonsense. Suddenly, the fire pit she’s looking at sparks and catches fire, and her astonished friends drop their beers. Some drunken experiments later, she’s figured out the exact phrasing and finger waggles, and her friends have joined her as the Earth’s first fire mages.

A quick drive, a couple YouTube shorts, a few hours, and now several million English speakers are also setting tiny fires in their living rooms. (And, thanks to the bilinguals, a rapidly growing number of non-English speakers). And then the physicists get involved.

Fuck physicists. Seriously fuuuuck them. They’re the first to really try to optimize magic, and they’re good at it. It doesn’t take long to admit maybe this whole thermodynamics thing is broken and start looking for exploits. How is waggling my fingers and saying gibberish lighting things on fire? How far away can I set something on fire? How hot does it get? Can we make it hotter? How does the spell know what to set on fire? Is a human required?

It turns out messing with the gluon binding energy – even briefly – to create thermal energy isn’t a great idea. Protons in the affected area decay into pions and positrons, and now physicists don’t need a particle accelerator to make antimatter. The attention of the majority of particle physicists on the planet leads to fast advances in widening the spell’s affected region and more efficient capture of antimatter into Penning-Malmberg taps.

Next thing you know, every government is in a negotiation trying to figure out (1) what the FUCK is going on, (2) how to negotiate some sort of mutually-assured destruction treaty to keep the world from ending. The more powerful nations are moderately successful at eliciting promises from everyone to play it safe in hopes of not immediately exploding the world. Naturally, more than a few of these governments are dirty liars, and not every one of their citizens is totally on board with the idea of “playing it safe.”

There’s a critical threshold for planetary species – if the required number of sentients cooperating to produce a biosphere-ending device becomes small enough, the end is all but inevitable. You cannot unring a bell.

The story repeats throughout the universe I’m watching for billions of species. Most reach an early level of spaceflight before utterly obliterating themselves. This particular Creator’s spell appears obscure enough that its discovery is a numbers game, and that number appears to be around a few billion sentients.

Surprisingly, some species actually do survive to become galactic powers. These are mostly non-hearing, but a few have sufficiently anomalous prehensile appendages unable to make the spell’s required dexterous movements. They grow through a graveyard of a universe, uncovering myriad long-dead civilizations that ended in catastrophic flame. At this stage, even when these species do uncover the fire spell – via deep investigation of the properties of the universe or chancing upon a soon-to-be-self-extinguished race – their presence is solid enough that they can survive early mistakes.

I sigh, end the universe, assign it a barely passing grade, and turn to the next one in the pile.

A Scam by any other Name: Inscriptive Agency and Online Safety

August 13, 2023August 13, 2023 / Will Beason / 1 Comment

My instinctual response to rage at a system is an urge to systematically deconstruct its power.

The Scam

March 17th, 2022

While it may have begun sooner, the “look who just died” scam entered the Internet’s consciousness on March 17th, 2022 with a Reddit post on r/answers. The user reported that they received a message on Facebook Messenger claiming someone they knew died with a link to a site trying to steal their Facebook password. Users of the forum responded initially with advice for them to contact their friend to let them know their account was hacked, or to block the person if they were a stranger.

March 20th, 2022

The post then sat silent for three days before another user reported the same occurrence, having received such a message from a coworker. Unlike the initial poster, they clicked on the link and entered their login information. They soon realized their account was compromised when they saw their Facebook Messenger account had sent the same message to nearly all of their Facebook friends.

March 30th, 2022

Ten days later, another user reported their account was compromised by the same scam. They realized that not only had the scammer sent the message to all of their Facebook friends, but all of their Facebook contacts, including people they had unfriended. This the first occurrence of a user increasing their account security in response to being scammed – they enabled 2-factor authentication and revoked session cookies for active logins to be sure.To this point, there was no clear goal of the scammers. Why did they want control of people’s Facebook accounts?

April 5th, 2022

Another user reported experiencing the scam. But this time, they did not escape with only injuries to their pride. The scammers collected the payment information from this user’s account and stole $2,000 from their debit cards.Thus began a deluge of other Facebook users flocking to the Reddit post, with dozens of other reports and over a hundred total comments until July 2022.

May 9, 2022

On May 9, 2022 the online safety and fact-checking website ThatsNonsense.com reported on the scam. In the article, they point out that the message has several features which internet users should learn to look for that indicate the message is illegitimate. They also give advice for actions to take in the case a user has become a victim of the scam themselves.

(Image from snopes.com)

Days later, cable news began covering the scam, with WGAL – Lancaster/Harrisburg being one of the first to report on it. DuckDuckGo has indexed an MSN news report on the same day, but the article has since been deleted, most likely due to MSN changing the structure of its site. Broken link.

Interlude

Over the following months, dozens of news agencies reported on this scam, the money people lost, and how to spot it in the wild.

Evolution

The scam evolved into at least one other form in the coming months. Rather than reaching out via Facebook Messenger, the scammers took a screenshot of a paused video from the ABC News YouTube channel showing a wrecked car on the side of the road in Georgia, with debris scattered in the grass. They edited the metadata of the site to display a fake view count – implying that the video was highly viewed.

(screenshot by Will Beason)

To ensure the post would be prioritized in people’s feeds in the algorithm, the scammers did several things known to make sure posts appear in people’s feeds.

They made sure that the generated link going to the malicious site had a thumbnail. Facebook prioritizes links which have valid thumbnails.
They re-shared their own post. Facebook prioritizes “shared” posts over originals, even when a user shares their own post.
They tagged exactly nine other people. This is a “sweet spot” where the notification of the tagging still shows up in people’s Facebook notifications, and it doesn’t trigger anti-spam measures. Further, notifications prompt people to view the post in order to get rid of the notification. While users may delete notifications without viewing the post, they have no means of knowing what they’re missing before clicking to view.
They made the post public so that the post could potentially appear in anyone’s feed – even people they have not “friended” on Facebook.

Together, this nearly guarantees at least nine other people (those tagged) will see the post. The other measures mean the post itself is likely to be highly-ranked in all of their friends’ feeds and will likely be seen by many dozens of people. For those tagged – the friends who interact with them often are likely to see the post as this is another feature Facebook heavily weights in ranking posts to show on user feeds.

Facebook’s Responsibility

This scam has persisted for over a year despite national news coverage, many hacked accounts, and the simple, predictable nature of these interactions. In the 20 such posts above, all of them linked to the same domain and had an identical thumbnail. All of the posts in a similar period of time use identical wording. In the past few weeks, it has been “Just happened an accident” but the wording has differed over time. It would not be technically difficult to detect these posts: I am able to easily find dozens of such posts by searching for the wording-of-the-day using Facebook’s already-extant search feature. Presumably, Facebook has means of searching posts by the domain linked to and the number of people tagged, which would make identifying these posts even easier.

And yet, this scam goes on.

Reports

On August 11th, in a flurry of data collection (rage?), I conducted a search as described above, immediately finding 25 such scam posts. I went through the report process for each post:

Three dots -> Report post
Please select a problem -> Spam
Submit

This creates an entry in the Facebook Support Inbox tied to my account.

The message assures that Facebook will use this information to identify spam, and more information about how Facebook deals with spam content.

Response

Facebook regularly responds to reports of spam. These do not happen on a regular timeframe – it isn’t clear how these are prioritized. I’ve had reports of this scam sit in my support inbox for months, and others are answered within days.In every case for this scam I have reported where I have gotten a response (~10 so far from August 11th reports), Facebook has responded with the following.

Sometimes the message gives an option to request a “re-review” of the content. Other times, the only options available are to block the compromised account or to delete the message from my support inbox. There isn’t a clear pattern here.

Example where I was able to request a re-review:

In no cases in the several months I have reported these scams has Facebook reported that one of these posts was taken down. I’ve reported about 40 in total.
Note: It is possible that Facebook removes messages from the support inbox – I have not verified that all of my reports still have entries.

Wherefore art thou, scam?

It is clear to users that these are scam posts. On many of these posts I’ve reported, friends of the person warn others not to click the link, and tell the person that their account has been compromised. Regular news coverage reports the impact on those who have lost money from the scam. Coverage on online safety sites have detailed information about the scam: how to spot it and how to recover.

But none of that matters because Facebook decides what is, or is not, a scam on Facebook.

~~ Fin ~~

Stopping here because I’m exhausted. I’m sure there’s neat implications to write about later. I hope you enjoyed my little weekend project.

Don’t count on language transformers citing their work

February 28, 2023March 28, 2023 / Will Beason / Leave a comment

*An example of a language model generating incorrect text and then attempting to cite the false claim. As of this writing, TLS 1.4 does not exist and Wikipedia makes no mention of TLS 1.4.*

“Truth” for language transformers

Language transformers do not have a concept of whether text they’ve generated is “true” – they are designed to predict likely continuations of a piece of text. The algorithms behind this are sophisticated, but they aren’t trying to do anything but recreate textual patterns in the corpus they were trained on. What matters to the transformer is “Given the preceding text, what should come next?” To the transformer, it is immaterial whether the generated text is factual information or misinformation.

Consider these two prompts:

“A recent study by the Harvard Medical School found that the new Novavax vaccine _”
“My 2yo was never the same after she got her shot. And then she got sick anyway. Clearly, the vaccine _”

Suppose you saw two comments on the Internet one beginning with each of the prompts. What would you expect to come next, based on what you have read on the Internet? (For this exercise, rely only on your memory)

Even though you probably personally disagree with what one of the continuations says, you would not be wrong in providing contradictory continuations of these two prompts. You, taking the role of the language transformer, haven’t actually made these claims – in this exercise your goal was not to inform whoever entered the prompt; it was to continue the text. In the same way, the language transformer is not evaluating the text it is generating in this way – it isn’t “taking a side” or even, in some sense, “saying” these things.

Not-Really-Citations

Consider possible responses you may have had to the first prompt. You could have said “… was 85% effective in preventing COVID infection 2 weeks after the vaccine was administered” or “… was found to be more effective in people who had never had COVID before.” Even though neither of these is (necessarily) supported by research, they are plausible continuations. Conversely, you are unlikely to have said something like “… caused people to spontaneously grow two additional eyes.” To the language transformer, those first two continuations would be rated at a much higher likelihood than the third and so it would tend to generate responses like those.

One way an application could cite text generated by a Transformer.

Language transformers don’t reference documents at query time and then generate text based on those documents. They are trained long before, and attempt to generate text that is a continuation of whatever they’ve been supplied. After the transformer has generated text, different algorithms analyze the text, and then look for the documents “most relevant” to the related text. Similarly – you could search the Internet for reliable sources that say similar things to how you continued the first prompt, and then cite them. You – and the transformer – aren’t really “citing” where you got this information, you made a guess and then looked for documents that seemed to support your guess.

It may be possible to attempt to determine claims generated text makes, and then re-prompt the transformer if it generated incorrect information, but these are hard problems. Identifying claims a text makes is not straightforward, and determining whether two texts say the same thing is difficult – even for humans. If the transformer generates “the vaccine was very effective” but the research says “the vaccine prevented 85% of cases in adults and 75% in children”, is the generated text “factual”? The citer may reference that research, but a reasonable person may disagree that a vaccine if “very effective” if it only has a 75% effectiveness in children.

Conclusion

Together, these factors mean that applications which use transformer-generated text will be unable to reliably cite sources which back up the text. In the example at the beginning of this post – there is no TLS version 1.4. While NeevaAI cites the TLS Wikipedia Page, as of this writing the text “1.4” does not appear anywhere on that page. Most likely, the transformer generated “is widely used in applications such as email, instant messaging, and voice over IP” because this text appears verbatim in the second sentence of the Wikipedia article. The citer seems to generate one citation per sentence, and so it likely chose Wikipedia as the source for that sentence because it found a high degree of textual similarity between that sentence and the text on Wikipedia. The citer didn’t catch that the generated sentence also makes a claim about TLS 1.4 but not the Wikipedia article.

Rolling Advantage with m dice of n sides each

July 14, 2022July 14, 2022 / Will Beason / 1 Comment

Matt Parker is wrong on the internet and someone must rise to correct him. Alas, today is my day. I shall double reverse-Derek the one and only Matt Parker.

In a recent video, Matt Parker explored a beautiful, geometric way of understanding a problem known to many RPG players. In many RPGs (such as Dungeons and Dragons), players roll a 20-sided die (also called d20, for short) to determine the outcome of an action. In some cases, the dungeon master (who interprets the rules) may decide to give a player “advantage” on a roll, allowing them two roll two d20s (written as 2d20) instead of one and taking the higher of the two. The question that Matt Parker explores is: what is the expected value of taking advantage – how much better is it than just rolling one d20?

He then extends his answer to a more general form:

And this is where he is wrong.

First, consider the case of rolling md1. That is, rolling some number of 1-sided dice. Matt’s formula incorrectly gives:

Obviously since a d1 can only ever return 1 (and so the above should simplify to 1 for all m), something is amiss.

Here I need to define some mathematical notation for clarity. Define a random variable +_md_n as the result of taking advantage when rolling m dice, each with n sides.

We can similarly check the case of rolling md2 since this is easy enough to check by hand. The probability of +_md₂ evaluating to 1 is the probability of only rolling 1s, and so we get:

This is definitely not in line with Matt’s formula. So, what is the actual formula?

The below proof relies on the fact that the calculation for the expectation of a variable can be partitioned into mutually exclusive cases. That is:

Now, suppose that we had some way of representing E(+_md_n) and we want to find E(+_md_n+1). That is, we want to increase the number of faces of our dice. As we can split our expectation, calculation, we can see:

The first term is the probability that we have no faces of value n+1 times the expected value of taking advantage with mdn. The second term is the probability of having at least one face of value n+1 times that value, as when taking advantage only the largest value is counted.

And now for md3. Trivially, the probability of obtaining zero 3s when rolling m d3s is .

This pattern is suspicious, so let’s try this general form and see if using the iteration pattern preserves it:

Therefore, the expected value of rolling advantage with mdn is:

Through symmetry, the expected value of rolling with disadvantage (taking the lowest die) is:

Using Degree Centrality to Break Wikipedia Category Cycles

September 11, 2021September 11, 2021 / Will Beason / Leave a comment

I’m training machine learning models to classify articles on Wikipedia. To do this well, I need a lot of training data. I’d originally intended to use Wikipedia categories to automatically classify articles. Unfortunately, I can’t assume articles in Category:Philosophy or one of its subcategories should all be “Philosophy” as it turns out categories aren’t used consistently enough to use them out-of-the-box.

Category Cycles

A category cycle is when a category is its own ancestor. For example (as of this writing) Category:Quintephones is a subcategory of Category:Electronic musical instruments, which is a subcategory of Category:Quintephones. This specific example isn’t problematic, but if such a cycle exists which includes many very general topics, it can make classifying articles by their category difficult.

In principle, Wikipedia categories should contain no cycles – a parent category should be strictly broader than its children. In practice, this is not the case. These cycles cause problems when trying to use categories as information in machine learning models and for labeling training data.

There are too many relationships (“category A is in category B”) to consider one-by-one. I need some way to narrow down category relationships to the ones which cause the most problems. So, given a strongly-connected directed graph of Wikipedia category relationships I’d like to try to automatically identify the most problematic relationships (the ones which shouldn’t exist), manually examine the results, and then remove the relationships I can tell are bad.

Centrality Measures

The first tool we’ll be using is centrality. Roughly, centrality measures which entities are “most important” in a network of related things. The most important entities don’t necessarily have problematic relationships, but they are the ones most likely to. There are several types of centrality, each with their own strengths and weaknesses. We’ll be using several of them to analyze the various strongly-connected digraphs we found last time.

For this post, We’ll be studying the second-largest, which contains 77 categories mainly dealing with Californian geographical features. We’ll call this the California subgraph.

Degree Centrality

For directed graphs, there are actually two types of degree centrality – indegree and outdegree. For our case, these respectively correspond to:

Indegree centrality: The number of subcategories of “A”.
- For example, “Science Fiction” and “Fantasy” are subcategories of “Fiction by Genre” as they are more specific forms of “Fiction”.
Outdegree centrality: The number of pages “A” lists as a parent category.
- For example, “Fiction” is a parent category of “Fiction by Genre” as it includes more broad topics than just lists of fictional works, organized by genre.

The subcategories and parent categories of “Category:Shasta Cascade”.

Category:Shasta Cascade has three child categories in our graph (that is, only considering child categories in this list), so its indegree centrality is 3:

The page has one parent category which is in our graph, Category:Sacramento Valley, so its outdegree centrality is 1.

Below is a graph plot of the categories. Each dot (or “node”) is a single category from the California subgraph, where arrows point from a child category to one of its parents. Larger dots represent categories with more subcategories.

Unfortunately there’s no great way to label individual nodes in a graph with this many nodes. The two large nodes are Category:San Joaquin Valley (left) and Category:Sacramento Valley (right).

The category with the highest indegree and outdegree centrality is Category:Sacramento Valley, with 9 parent categories in this subgraph and 11 child categories. Second place is Category:San Joaquin Valley. These are immediately, obviously the problem pages in this graph. While parts of these valleys are in various Californian counties, the fact that parts of them extend into the counties are not the “essential—defining—characteristics” of those topics. By this definition, they belong in none of the other categories in this subgraph. What happens if we remove the outgoing edges from these nodes?

After Removing the Problematic Relationships

The same category network as before, but with outgoing edges for “Category:San Joaquin Valley” and “Category:Sacramento Valley” removed. Note that there are no longer any cycles, and the most specific categories are easily identifiable, splayed into an outer ring.

Removing the outgoing edges from “Category:San Joaquin Valley” and “Category:Sacramento Valley” completely eliminates all cycles in the California subgraph. These seventeen edges are the minimum required to do so. The result looks like a much more sensible graph of category parent/child relationships.

Conclusion

Since this is an easy problem to fix, I’m going to go ahead and make these changes to the Sacramento Valley and San Joaquin Valley categories myself, and start a discussion on the talk pages pointing to this post for explanation.

I’m still going to explore other types of centrality measures to get a feel for their strengths and weaknesses – I’ll likely need to use multiple techniques to fix the largest graph, as it has nearly 12,000 categories.

The Largest Connected Digraph of Wikipedia Categories

September 10, 2021September 11, 2021 / Will Beason / 1 Comment

I’ve been playing with Wikipedia corpora for a few months now, and it’s in a good enough state that I’ve even constructed a basic Bayesian classifier which classifies Wikipedia articles into the top 19 Library of Congress Classifications I care about. The problem I’m having now is one of training data – to be reasonably good, most machine learning models need lots of high-quality labeled data.

I need some way to label a large number (at minimum, thousands) of training examples from a wide variety of Wikipedia pages. Enter: Wikipedia categories.

Note: for all data in this post, I used the September 1st, 2021 dump of Wikipedia pages and articles.

Wikipedia Categories

Categories are listed at the bottom of many Wikipedia articles. They form a sort-of-hierarchy of pages. Per the Wikipedia:Categorization article, these sound like exactly what I’m looking for:

“The central goal of the category system is to provide navigational links to Wikipedia pages in a hierarchy of categories“
“Category chains formed by parent–child relationships should never form closed loops; that is, no category should be contained as a subcategory of one of its own subcategories.”

Seen above, Star Wars is a “Science Fiction” story, so it falls in the “Science Fiction” category. The Science Fiction category lists all pages (and categories) which list themselves as being part of the “Science Fiction” category. So good so far.

Further, the Science Fiction category itself is a member of the Fiction by Genre category. This makes sense. Naively, we would think that this could form a reasonable hierarchy – any article in “Fiction by Genre” or one of its subcategories should fall under the Literature classification.

This is where the strangeness of crowdsourced data comes in. Given any category, we could keep following its parents until we reach one which is manually labeled. This is one potential path, starting from “Science Fiction” and successively choosing parent categories:

Science Fiction
Fiction by Genre
Fiction
The arts
Culture
Human activities
Humans
Human biology
Branches of biology
Subfields by academic discipline
Categories by topic
Categories by parameter
Wikipedia categories
Wikipedia categorization
Wikipedia content administration
Wikipedia administration
Wikipedia
Online encyclopedias
Free content
Copyright law
Law by issue
Categories by issue
Categories by parameter

There are two immediate problems. First – we’ve entered a cycle! The Categories by parameter category is it’s own parent/child. Second – we’ve clearly gone way off topic. Star Wars should not be classified as a “Law” or “Biology” article, even though it falls into “Law” and “Biology” categories.

This is unfortunate, contradicts the intent of Wikipedia categories, and I’ll have to work around it. Before doing so, I’d like to study it and see how extensive this problem really is. Enter: Connected digraphs.

Connected Digraphs

A directed graph, commonly “digraph” represents a set of elements which are connected with one-way relationships. Your biological family tree forms a digraph – you are the offspring of your parents; they are not your offspring (assuming no time-travel incest shenanigans). Digraphs aren’t only limited to one-way connections; relationships may go two ways. A common example of digraphs is that of a city’s downtown. Some intersections are connected with two-way roads, and others by one-way roads. For example:

In this city, we say that “Intersection A is linked to Intersection B”, but also “B is not linked to A” since the road from A to B is one-way. In a well-designed downtown, it should be possible to eventually get from any intersection to any other intersection – it should be “strongly connected”. In the downtown above, unfortunately, anyone who gets to F can never escape. Similarly, once someone gets to C they can’t get back to A, B, D, or E – they may only proceed to F and never return. We would say that the digraph representing the intersections of the downtown above is “weakly connected” as there is not a path from each intersection to every other intersection (without breaking the law).

If we were to turn the road connecting E to F into a two-way road, this would resolve the issue – everyone who got stuck at F would be free to get to the rest of this city. Such a digraph would be “strongly connected”.

To determine the scope of the problem with the Wikipedia category hierarchy, I want to find all such strongly connected graphs, as they will cause problems for any labeling scheme based on Wikipedia Categories.

The Largest Strongly-Connected Wikipedia Category Digraph

I found it. This massive strongly-connected graph contains 11,895 of all 2,102,570 categories. That is, 0.57% of all categories belong to this single strongly-connected digraph. This single digraph contains as disparate topics as:

The problem is that since these topics are so broad, once a category (or article) gets classified into one of these, it immediately inherits all of them. Using any category in this group would add lots of noise to my model.

The second-largest is 77 geography-related categories in California.

The third-largest is 70 categories related to wars in the Middle East.

The fourth-largest is 53 geography-related categories in the southern US and central America.

There are several hundred smaller strongly-connected graphs, for a total of 356 distinct digraphs. Most of these are obvious mistakes, such as:

Future Work

There’s three things I’m going to do with this information.

Many of the smaller (but still sizeable) categories all fit into the same LCC classification. I can easily use these as training data.
Rather than assume that Wikipedia articles are transitively of the same topic as their parent categories, I should reduce how much I weight that information based on how many steps it takes to get to a classified category.
I can use centrality measures to find which categories are most problematic, and both fix those in my own data set and possibly recommend these changes to Wikipedia.

What are the longest n-grams on Wikipedia?

August 8, 2021August 8, 2021 / Will Beason / Leave a comment

TL;DR

I found the longest n-gram on Wikipedia which appears more than 1,000 times. It is 103 words long. It is:

The township is governed by a three-member board of trustees, who are elected in November of odd-numbered years to a four-year term beginning on the following January 1. Two are elected in the year after the presidential election and one is elected in the year before it. There is also an elected township fiscal officer, who serves a four-year term beginning on April 1 of the year after the election, which is held in November of the year before the presidential election. Vacancies in the fiscal officership or on the board of trustees are filled by the remaining trustees.

It appears 1,275 times (exactly, word-for-word), only in articles about small towns in Ohio. It’s part of the Ohio Revised Code which as far as I can tell applies to all Ohioan townships.

What is an n-gram?

An n-gram is a sequence of words which appears often in a text, or in a corpus of texts. Finding these is helpful for document classification as repeated sequences of words often have a meaning different than their individual words would suggest. An “Academy Award” isn’t just an award from an academy, but a specific award from the Academy of Motion Picture Arts and Sciences. Letting our models know during training that “Academy Award” may mean something different than “academy” and “award” can make it easier to identify fine-grained topics like this.

How I found this

After cleaning my Wikipedia corpus, I took several additional steps to normalize it:

I made everything lowercase
Sequences of numbers became “_num_“
A “_num_” followed by a percent symbol is replaced with “_percent_“
Certain recognized date formats become “_date_“, for example “10 January 1995“

That is, for the purposes of my analysis I consider numbers to be “semantically identical”. The phrase “he was 29 years old” doesn’t have a significantly different meaning from “he was 81 years old”. For first-pass machine learning models the noise and variance introduced by treating every number as a completely different “word” is rarely useful.

Beyond the above, I define a “word” to be any sequence of characters matching the regex “[\w']+“, and I define words to be “distinct” if, and only if their lowercase forms exactly match.

To find the n-grams, I employed an iterative algorithm which takes as input all currently-known n-grams. On first iteration it only finds all 2-grams. The second iteration, it finds all 3-4-grams. Each iteration, the maximum n-gram length it finds doubles. You can find the successive commands I ran to do this in my GitHub repository.

Other long n-grams

The longest n-gram appearing more than 10,000 times is:

There were _num_ households out of which _percent_ had children under the age of _num_ living with them, _percent_ were married couples living together, _percent_ had a female householder with no husband present, and _percent_ were non families. _percent_ of all households were made up of individuals and _percent_ had someone living alone who was _num_ years of age or older. The average household size was _num_ and the average family size was _num_.

The longest (and most-frequent) n-gram appearing more than 100,000 times is:

from _num_ to _num_

which is recognizable as a range of numbers expressed in natural language.

I mean, technically it’s “_num_ _num_ _num_ _num_” because sequences of numbers appear so often, but I don’t like it since it doesn’t have any “real words”.

The longest n-gram appearing more than 100,000 times which only contains alphabetic words is “a member of the“. Most n-grams of this length serve common structural purposes such as this.

In general, the most common n-grams of length longer than 5 are from templates, where article content has been imported from another source, or where it is very clear that the same author uses identical language in a long list of articles. For example:

4,966: ... is a moth in the family Crambidae it was described by ...
3,935: In the CDP the population was spread out, with _percent_ under the age of _num_ _percent_ ...
4,907: ... together _percent_ had a female householder with no ...
4,003: ... is a protein that in humans is encoded by the ...

Charts

I made graphs of the longest n-gram of a given frequency. I’ve dropped lengths for n-grams which appeared less often than longer n-grams. This first graph includes the special handling I use for numbers, dates, and percentages.

This second graph is for the “more traditional” n-grams which only include alphabetic words.

Conclusion

This is going to let me get rid of a lot of noise when I begin training models on the Wikipedia corpus. In this case, looking for n-grams allowed me to find a lot of duplicated, stilted, not-really-human language which would have messed up my model’s ability to understand how words are used.

Rather than consider all uses of words like “elected” as the same, I can consider it a special case when it appears in the 103-word monstrosity I mentioned at the beginning. Models are sensitive to frequency, and lots of cases of a word being used in an identical context incorrectly teaches the model that this is common or expected. In reality, a human composed the passage once, and it was likely automatically duplicated (or manually copy+pasted) over a thousand times. There isn’t much to be learned from such usage. Instead, for cases like this the 103-word n-gram provides strong evidence that an article is about a very specific type of thing – a township in Iowa, which is exactly what I want my model to learn from this.

On a smaller scale, being able to treat n-grams like “the United States” as entities distinct from words like “states” and “united” means that the model won’t be tricked into confusing “United Airlines” or “States Records” with “the United States”.

This is the sort of analysis it is almost always worth doing on a corpus – while going through this work I iterated dozens of times on the code I used to clean the corpus, as each time I found cases where there were bits of formatting or common human mistakes that caused this analysis to give strange results. Now that I can get results like this, I can be confident that overall my data is clean enough for more complex work.