Overnormalization

Why is it hard for computers to understand language?

This question plagues many a developer of NLP (Natural Language Processing) systems. While there are certain aspects of language we don’t know how to process yet, often we oversimplify language to make it easier for computers at the expense of maintaining the meaning. This isn’t a problem with processing power, but a conceptual limitation of humans designing these machines. In trying to understand how to deal with language, there’s many common mistakes that plague systems that try to interpret English. These cause the sorts of problems that make the users of these systems think computers will never really be able to interact on a human level.

An Aside: Normalization

Before processing text, most NLP systems normalize the text in some way to make it easier for the computer to understand. This may include steps like correcting obvious spelling mistakes, lowercasing all letters, and removing superfluous spacing. The idea is that none of these modifications really changes the meaning of the text, and there’s no need to develop a machine that can (and thus, has to) learn that when a sentence has extra spaces in the middle that it rarely means anything interesting.

Overnormalization: Ignoring Capitalization

Making all text lowercase before processing makes sense. For example, there’s obviously not a significantly big difference between For (as it would appear at the start of a sentence) and for in the middle. By default, a machine would treat For and for as completely different words and not know that they are very related. By lowercasing all text, we eliminate this class of mistake. It also nearly halves the number of words the machine has to learn. This can be a massive help since most words rarely appear capitalized, and if the machine sees the capitalized form of the word for the first time in the wild (rather than in training), it will immediately connect it to the word it already understands. There might simply not be enough training data for a computer to learn that Lanthanide and lanthanide are the same word just from context. However, this simplification has unintended side effects.

Consider these three sentences and their intended meaning.

I saw it. = I saw [something referenced in another sentence].

I saw It. = I saw [the film It].

I saw IT. = I saw [the Information Technology department].

Any system that blindly lowercases everything will treat these sentences identically and appear comically inept. They have three completely different meanings and as humans we can see the distinction immediately. What’s worse is when low-level machine learning models are trained on them, and they are used to feed more sophisticated models.

A technique we use to teach computers relationships between words is word embedding, which is a type of model that can be thought of as a map containing every word: words closer to each other have “more similar” meanings than those far apart. The model “learns” from a bunch of sample sentences we feed it – usually hundreds of thousands. In this case, by lowercasing everything we told the machine that itIt, and IT all mean the same thing – they have the “same location”. This not only corrupts the computer’s understanding of those three words, but anything even tangentially related to them. Words related to IT like administrator and support will be incorrectly be considered similar to ones near it such as that and this. Now if that faulty word embedding is used to train an even more complex model, it will compound the problems. Consider that there are literally hundreds of examples where capitalization matters in English, and there will be many bits of language the computer will have trouble understanding.

Solution: Variable Granularity

We appear to have competing requirements:

  • We want our machine to ignore differences in capitalization when they don’t matter.
  • We want our machine to pay attention to differences in capitalization when they do matter.

I suggest adding a third requirement, one that suggests a solution.

  • We don’t want to have to tell our machine each case where capitalization matters.

We could literally enumerate all instances where people capitalized words in a non-standard way, but that isn’t practical and the system won’t automatically figure out new instances. If we could automatically detect when capitalization mattered, then the first two requirements become non-issues.

A word embedding needs about one hundred example word usages to “learn” what a word means and use it as an anchor point to understand similar words. While educated may appear in many hundreds of sentences, erudite may appear in only a few, but it will appear in contexts similar enough to educated that the machine will figure out that the words are very similar. We can leverage this limitation – by declaring that if a word appears fewer than 100 times then we accept that the machine will sometimes make mistakes with those words.

We can turn this threshold into a rule which determines whether to create an entry for the word’s word embedding:

  • If the exact capitalization occurs 100 or more times, make an entry for it.
  • If the exact capitalization occurs fewer than 100 times, use the entry for the most common capitalization (or create one if it does not exist)
Word Appearances Entry
rest 5            rest  
REST 3            rest  
Rest 1            rest  
reST 1            rest  

Even though most of these do have significantly different meanings, there wouldn’t be enough information for the computer to figure out the difference. Now suppose we collect significantly more data.

Word Appearances Entry
rest 500          rest  
REST 200          REST  
Rest 10           rest  
reST 5            rest  

Capitalization

There is now ample data for the machine to see that rest and REST are used very differently. Both should have their own entries on the word embedding. Until “Rest” and reST have enough training examples, they will be grouped under a default – probably rest as it is the most common. While this correctly labels Rest as identical to rest, it still incorrectly groups reST with them.

This method may initially seem to fail for highly common words:

Word Appearances Entry
the   10,000       the   
The   1,000        The   

In this case, the logic will unnecessarily create an entry for both the and The (I would argue there are meaningful distinctions, but that discussion would be its own post). This behavior only impacts incredibly frequent words, but since those words are very frequent then the machine will have enough information to learn that they are very similar. Processing time will be several percent slower since this increases vocabulary size by several thousand words, but when we’re dealing with hundreds of thousands of unique words this isn’t a major issue.

The main limitation of this algorithm is that if there is little data for a given capitalization, the machine will automatically assign it the meaning of the most common capitalization. However, this is obviously better than the behavior of most systems now which unconditionally assign all capitalizations the same meaning.

Conclusion

Understanding language is hard. Taking shortcuts can still produce cool results, but introduces additional limitations to anything that depends on it. Approximations make things easier – remember the spherical cow from Physics – but they always produce imperfect models. In the case of capitalization and language processing, getting rid of this approximation is relatively straightforward and we can immediately realize benefits while only paying a small computational cost.

In either case, as a layperson or someone consuming a product that promised “natural language understanding”, be aware that these approximations (and their associated problems) exist, and consider the harm that could be caused by neglecting them.

Leave a Reply