Using N-Grams to generate language
A single word in a text is a unigram. A sequence of two words is a bigram. We can have trigrams and so and so forth. So the sequence of words “ may I have” is a trigram. The sequence of words “may I” and “ I have” is a bigram and each individual word “may”, “I” and “have” is a unigram.
So as an example let’s say I got a language generating model (We will understand the concept of this in part 2) on my PC and I feed it this text:
“Once, Aunt Petunia, tired of Harry coming back from the barbers looking as though he hadn’t been at all, had taken a pair of kitchen scissors and cut his hair so short he was almost bald except for his bangs, which she left “to hide that horrible scar.” Dudley had laughed himself silly at Harry, who spent a sleepless night imagining school the next day, where he was already laughed at for his baggy clothes and taped glasses. Next morning, however, he had gotten up to find his hair exactly as it had been before Aunt Petunia had sheared it off. He had been given a week in his cupboard for this, even though he had tried to explain that he couldn’t explain how it had grown back so quickly.”
Extract from: Harry Potter and The Sorcerer’s Stone
How many unigrams, bigrams, and trigrams does this text have? It has as many unigrams as the number of words in the text. Bigrams are “ Once Aunt”, “Aunt Petunia” and so on. Trigrams are “ tired of Harry”, “of Harry coming” and so on and so forth.
Now I use the language model to generate random text based on this text. To generate random text I ask it to use unigrams first with the word “the” as the starting point and this is what it generates:
['aunt', 'taken', 'quickly', 'day', 'he', 'harry', 'left', 'she', 'as']
This is completely non-sensical and we can’t use it for anything. Now I ask the language model to consider using bigrams. This is what it generated:
['next', 'morning', 'however', 'he', 'was', 'already', 'laughed', 'at', 'for']
This sounds more sensible. Remember it is using “the” as the starting point. Can you pull out the bigrams from the main text it used to generate the above sentence?
“Once, Aunt Petunia, tired of Harry coming back from the barbers looking as though he hadn’t been at all, had taken a pair of kitchen scissors and cut his hair so short he was (4) almost bald except for his bangs, which she left “to hide that horrible scar.” Dudley had laughed himself silly at Harry, who spent a sleepless night imagining school the next (1) day, where he was already (5) laughed at for (6) his baggy clothes and taped glasses. Next morning (2), however, he (3) had gotten up to find his hair exactly as it had been before Aunt Petunia had sheared it off. He had been given a week in his cupboard for this, even though he had tried to explain that he couldn’t explain how it had grown back so quickly.”
Extract from: Harry Potter and The Sorcerer’s Stone
And the same exercise with a trigram.
['barbers', 'looking', 'as', 'though', 'he', 'had', 'been', 'given', 'a']
You can follow the same exercise as we did for the bigrams by trying to locate the trigrams in the main text. We will find the sentences are making a lot more sense, but also sound very similar to the original. So the larger the N-in an n-gram the more similar or plagiarised it may sound to the original text.
I hope the above text gives some flavor of the n-gram language model generator. But what is the actual probability calculation behind a sequence of words? Even if we don’t know to code, we can use simple probability calculations to find out probabilities of unigrams, bigrams, trigrams, and n-grams in a given text. The LM uses these probability distributions to guess the next phrase and becomes a basis behind the N-gram Language Generation.
We will explore this in the next part. See you there!