Subword Secrets: The Impact and Limitations of BPE Tokenization

The picture is sourced from Google Deepmind
The picture is sourced from Google Deepmind

Preface/Context

Dear readers,

Here I will explore the popular tokenization method Byte-Pair Encoding. Since, it’s becoming all the rage increasingly so, I thought it fit to give a brief understanding of what it does but mostly focus on one of the key limitations of this technique and where it would impact quite a bit (Computational Biology dealing with Genomics among many). Please consider this as an informative segment and I encourage all readers to think of a solution for it.

Understanding

So what is Byte-pair-encoding? Simply put, its a tokenization method to convert words from a text to vector representations which in-turn become inputs for the model one’s aiming to train. Some of you learned readers might point out the following:

While, there are many such tokenization methods, why use BPE at all?

K-mer Tokenization

To answer that, lets take one of the tokenization methods - K-mer tokenization and build our understanding from there.

This figure was taken from the paper “
This figure was taken from the paper “

The above shown image describes K-mer tokenization. Which essentially breaks the sequence down into k chunks in two ways.

  1. Overlapping tokens
  2. Non-overlapping tokens

In the overlapping case, given a token, the tokens surrounding it (one before and one after) will entirely leak the given token. Thereby serving no-purpose in the pre-training stage (if one is using Masked Language modeling).

In the non-overlapping case, although it solves the token leaking problem, it brings up new problems. For example:

  1. Removing one letter (lets say in a given word), changes the tokenized sequence dramatically.

Intuitively, it essentially says the sentence “I am hungry” is dramatically different from “am hungry”. Which doesn’t make sense and ergo makes it very hard for the Language model to learn effectively.

Byte Pair Encoding Tokenization

BPE encoding overcomes this problem in a very unique way. Let’s understand how it works briefly.

This figure was taken from “
This figure was taken from “

For our purposes, let’s define BPE tokenization.

Byte Pair Encoding (BPE) is a subword tokenization technique that iteratively merges the most frequent pair of adjacent characters or character sequences into a single new sequence. This process helps in handling unknown words or out-of-vocabulary words more effectively in NLP tasks.

BPE algorithm can be summarized by the following pseudo code (not exhaustive):

1. Define a function BPE_Tokenize(Vocabulary, Num_Merges):
    a. Input: Vocabulary (a dictionary of word frequencies), Num_Merges (the number of merge operations to perform)
    b. Output: The final vocabulary with merged tokens and the merge rules

2. Initialize a dictionary to keep track of merge operations (Merge_Rules)

3. For each word in the Vocabulary:
    a. Split the word into characters (or subwords if already partially merged) with a special end-of-word symbol (like </w>) to distinguish between identical sequences within and at the end of words

4. Repeat Num_Merges times:
    a. Calculate the frequency of all adjacent pairs of characters/subwords in the Vocabulary
    b. Find the most frequent pair (Pair_To_Merge)
    c. If no pairs can be merged, exit the loop
    d. Record the merge operation in Merge_Rules
    e. For each word in the Vocabulary:
        i. Replace occurrences of Pair_To_Merge with a new token (a merged version of the pair)
    f. Update the Vocabulary with the merged tokens

5. Return the modified Vocabulary and Merge_Rules

6. Define a function Apply_BPE_Tokenization(Text, Merge_Rules):
    a. Input: Text (the text to tokenize), Merge_Rules (the merge operations from BPE_Tokenize)
    b. Output: Tokenized text according to BPE

7. For each merge rule in Merge_Rules:
    a. For each word in Text:
        i. If the merge rule applies, replace the corresponding pair of characters/subwords with the merged token

8. Return the tokenized Text

Since, BPE operates on subword frequencies, there are many advantages of using this technique. Non-exhaustive list:

  1. Sequences of longer length can be encoded succinctly. Hence scalable.
  2. Overcomes token leakage.
  3. Efficiently handles unknown words (words that don’t occur in the vocabulary).
  4. Improved Efficiency for Large Datasets since the vocabulary size can be reduced significantly.
  5. Better Handling of Domain-Specific Language (medical, legal, genomics texts etc.).

Limitations

Now that we somewhat understand what BPE is and how it operates, let’s take a look at one of its limitations as I mentioned at the beginning of the blog.

For our purposes, I will consider the domain of genomics adapting the work from DNBERT-2 paper.

⚠️NOTE: No domain specific (genomics) knowledge is required nor assumed to understand the following.

In many cases in genomics, a smaller part of the DNA/RNA sequence is considered as a separate sequence of it’s own and has a functionality of itself. The correlation and token consistency becomes absolutely essential so that the Language Model can realize the correlation between the smaller segment and the original DNA/RNA sequence it is a sub-part of.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)

dna_sequence = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
part_of_dna = "CGACACTTGGTTATCGATCTACGAG"

dna_tokens = tokenizer(dna_sequence, return_tensors = 'pt')["input_ids"][0].tolist()
part_of_dna_tokens = tokenizer(part_of_dna, return_tensors = 'pt')["input_ids"][0].tolist()

print(f"DNA sequence tokens: \n{dna_tokens}")
print(f"Part of DNA sequence tokens: \n{part_of_dna_tokens}")

Output:

DNA sequence tokens: 
[1, 5, 194, 32, 757, 1239, 2092, 294, 24, 

⚠️NOTE: The tokens 1 & 2 are special tokens corresponding to [CLS] and [SEP] tokens.

As you can see, only a few tokens overlap [359, 88, 93] even though the entire part_of_dna sequence is sub-sequence of the original dna_sequence.

This is one of the key limitation of BPE tokenization method. Although, there are a couple of methods which can be employed to build on top of BPE tokenization to effectively eliminate/reduce this problem.

I encourage all curious readers to think about this problem and why this happens & perhaps you will come up with a solution for it.

Footnote

I request all readers to explore this further and perhaps point out any errors I might’ve made. I am a human and a curious thinker after all, one among you. Any and all comments are welcome & I hope to learn from you all. I will also attach the DNABERT-2 Paper referenced in this blog. Please feel free to read it.

Thanks for taking the time. Cheers.


Click here and spread the holy gospel of AI with your friends on X to be cool. Please?