Problem 2.1. Deduplicating a Linked List with Maps

This problem is also from Cracking the Coding Interview, from Chapter 2.

Write code to remove duplicates from an unsorted linked list. How would you solve this problem if a temporary buffer is not allowed?

This is one of the classics that I’ve been very interested to see if there are significant performance improvements to be made over the naive solutions. The restriction in the second sentence of the question eliminates the potential usage of interesting data structures, but still has interesting optimization properties so I’ll handle that in a separate post.

Initial Clarifications

Linked lists don’t inherently contain information at each node, but we can reasonably infer from the problem state that in this case “duplicates” refers to two nodes with the same information.

I’ll use integers as the data type. In practice, if the data held by each node were complex, or if it were otherwise time-consuming to determine equivalence, then it would be worth considering optimizing to minimize the number of comparisons. For now I’ll just use integers as the data type since that comparison takes very little time.

For the linked list itself we don’t need anything complex. I’ve added a convenience method to compute the length of a list as it’s used in many obvious optimizations. Linked List Code.

type Node[T comparable] struct {  
  Value T  
  Next  *Node[T]  
}  

func (n *Node[T]) Length() int {  
  length := 0  
  for ; n != nil; n = n.Next {  
   length++  
  }  
  return length  
}

The Naive Map Solution

First – and the expected solution in interviews – is one that uses a map to record the set of seen integers. In the benchmark I’m assuming effectively no duplicates to make this a sort of worst-case scenario (rand.Int to produce random positive int). Test and Benchmark Code.

// RemoveDuplicatesMap removes entries from a linked list with the same value.
// Uses a map to record seen values.  
func RemoveDuplicatesMap(head *Node[int]) {  
  if head == nil {  
   return  
  }  

  seen := make(map[int]bool, head.Length())  

  for ; head != nil; head = head.Next {  
   seen[head.Value] = true  

   for head.Next != nil && seen[head.Next.Value] {  
    head.Next = head.Next.Next  
   }  
  }  
}

The Manual Hash Set Solution

Go’s maps are far more robust than is required for this problem, potentially meaning they have expensive overhead handling things like general-purpose hash functions and collision-handling logic.

So what if we create our own implementation that doesn’t do any of that? In our case, since we know integers are bitwise-random, we can use a modulus as the hash function. If we limit the map size to only powers of 2, this is equivalent to the (much faster) & operation with a bit mask. The backing array holds the seen integers, and on hash collisions we increment to find the first available slot. Finally, we make the HashSet about twice as large as we need so collisions are infrequent (so 2048 keys for lists of size 1000, not 1024). HashSet Code. Solution Code.

// HashSet is a statically sized hash set of integers which uses
// the ending bits as a hash.  
type HashSet struct {  
  mask   int  
  values []int  
}  

// NewHashSet constructs a HashSet which can contain up to 2^size entries
// and initializes it with value.  
func NewHashSet(size, value int) *HashSet {  
  result := &HashSet{  
   mask:   1<<size - 1,  
   values: make([]int, 1<<size),  
  }  
  result.values[value&result.mask] = value  

  return result  
}  

// Insert adds the value to the set.
// Returns true if the value is already present.  
// Does not handle the case where the set is full and a new value is added.  
func (m *HashSet) Insert(value int) bool {  
  idx := value & m.mask  
  for m.values[idx] != 0 {  
   if m.values[idx] == value {  
    return true  
   }  
   idx = (idx + 1) & m.mask  
  }  
  m.values[idx] = value  

  return false  
}

Comparing the two via benchmark yields dramatic results – our manual implementation of a hash set is about 4x faster! Unsurprisingly (found via the “-benchmem” flag), HashSet only has one memory allocation to Map’s 33, and uses about 10% less memory despite preemptively allocating twice the number of keys.

MapHashSet
List Sizens/opns/opChange
1005,6101,49573.35%
1,00052,8369,66481.71%
10,000562,210131,24276.66%

There’s some interesting sub-linear behavior around 1,000 elements for both solutions, but I don’t have a good explanation for it. It’s nowhere near a cache size boundary and the memory requirement is too large for a single page on my machine.

Conclusion

As is generally the case, a brittle manual solution wins out over a general-purpose one. You wouldn’t want to use this HashSet implementation in production code – you’d likely want additional guardrails (such as tracking collisions to know when to automatically increase in size), which would cost additional processing power. It is still neat to see a recreation of the idea of what hash maps are doing, and it’s surprising that using maps in Go incurs so much overhead. As always, don’t try something like this unless you’ve actually benchmarked and measured that it is a problem for your use case!

This was fun, and I may explore some alternative solutions to this problem that I messed around with in the future.

Leave a Reply