Tokenization — BPE, WordPiece and Text Processing

Tokenization is a key process of text preparation for AI models that divides text into smaller units. Get acquainted with BPE and WordPiece algorithms and understand how text processing works in modern language models.

Tokenization in Modern NLP: From Words to Subwords¶

Tokenization is a fundamental step in every NLP system. While traditional approaches split text into words by spaces, modern models like GPT or BERT use more advanced techniques like Byte-Pair Encoding (BPE) and WordPiece. These algorithms can elegantly solve the out-of-vocabulary (OOV) word problem and efficiently represent extensive vocabularies.

Why Classical Tokenization Isn’t Enough¶

Imagine you’re training a model on English texts and encounter the word “unhappiness”. A classical word-based tokenizer would either add this word to the vocabulary (if it appears frequently enough), or mark it as an unknown token . Both approaches have fundamental disadvantages:

Large vocabulary takes more memory and slows down training
Unknown tokens cause information loss
The model cannot learn morphology and word formation

Modern subword tokenization solves these problems by dividing words into smaller meaningful units.

Byte-Pair Encoding (BPE)¶

BPE originally emerged as a compression algorithm but found application in NLP. The algorithm works as follows:

# Tokenization — BPE, WordPiece and Text Processing
def get_pairs(vocab):
    """Gets all pairs of adjacent symbols"""
    pairs = {}
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i + 1])
            pairs[pair] = pairs.get(pair, 0) + freq
    return pairs

def merge_vocab(pair, vocab):
    """Merges the most frequent symbol pair"""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)

    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pairs. For the word “unhappiness”, the process might look like this:

# Original state
"u n h a p p i n e s s"

# After several iterations
"un hap p i ness"

# Final tokenization
["un", "hap", "pi", "ness"]

WordPiece Algorithm¶

WordPiece, used in BERT, is similar to BPE but with an important difference. Instead of selecting the most frequent pairs, it selects pairs that maximize the likelihood of training data. It also uses a special prefix “##” to mark sub-tokens that don’t indicate word beginnings:

# WordPiece tokenization
"unhappiness" → ["un", "##hap", "##pi", "##ness"]
"playing" → ["play", "##ing"]
"hello" → ["hello"]  # frequent word remains whole

Practical Implementation with Hugging Face¶

In practice, we usually use ready-made implementations. Hugging Face Transformers provides a simple API:

from transformers import AutoTokenizer

# GPT-2 uses BPE
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is important for NLP"
tokens = gpt2_tokenizer.tokenize(text)
print(tokens)
# ['Token', 'ization', ' is', ' important', ' for', ' NL', 'P']

# BERT uses WordPiece
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = bert_tokenizer.tokenize("unhappiness")
print(tokens)
# ['un', '##hap', '##pi', '##ness']

Training Custom BPE Tokenizer¶

For specialized domains, we often need our own tokenizer. SentencePiece is a popular library for this purpose:

import sentencepiece as spm

# Training BPE model
spm.SentencePieceTrainer.train(
    input='training_data.txt',
    model_prefix='my_bpe',
    vocab_size=32000,
    model_type='bpe',
    character_coverage=0.9995,
    split_by_unicode_script=True,
    split_by_number=True
)

# Loading and usage
sp = spm.SentencePieceProcessor(model_file='my_bpe.model')
tokens = sp.encode('Custom tokenizer for English data')
print(tokens)
print(sp.decode(tokens))

Optimization for Specific Languages¶

Different languages have specifics that affect tokenization. Rich morphology, diacritics, and relatively free word order require attention:

# Example problems with morphologically rich languages
text = "Programming, I program, I programmed"

# Poorly configured tokenizer might create:
# ["Program", "ming", ",", " I", " program", ",", " I", " program", "med"]

# Better tokenization would recognize the root:
# ["program", "##ming", ",", " I", " program", ",", " I", " program", "##med"]

For better results with morphologically rich languages, we recommend:

Higher character_coverage (0.9999) due to diacritics
Pre-trained models for specific languages when available
Preprocessing for diacritic normalization if the task allows

Performance and Memory Requirements¶

Vocabulary size choice is a trade-off between performance and quality. Larger vocabulary means:

Shorter token sequences → faster inference
Larger embedding matrix → higher memory requirements
More parameters → slower training

# Tokenization analysis
def analyze_tokenization(tokenizer, texts):
    total_tokens = 0
    total_chars = 0

    for text in texts:
        tokens = tokenizer.tokenize(text)
        total_tokens += len(tokens)
        total_chars += len(text)

    compression_ratio = total_chars / total_tokens
    print(f"Compression ratio: {compression_ratio:.2f} chars/token")
    return compression_ratio

# Comparing different tokenizers
gpt2_ratio = analyze_tokenization(gpt2_tokenizer, sample_texts)
bert_ratio = analyze_tokenization(bert_tokenizer, sample_texts)

Summary¶

Tokenization is a critical first step in every NLP pipeline. BPE and WordPiece algorithms elegantly solve the OOV word problem and enable efficient text representation. When choosing a tokenizer, consider target language specifics, data size, and performance requirements. For morphologically rich languages, we recommend using pre-trained models or carefully setting parameters when training custom tokenizers.

tokenizacebpenlp

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

Tokenization — BPE, WordPiece and Text Processing

Tokenization in Modern NLP: From Words to Subwords¶

Why Classical Tokenization Isn’t Enough¶

Byte-Pair Encoding (BPE)¶

WordPiece Algorithm¶

Practical Implementation with Hugging Face¶

Training Custom BPE Tokenizer¶

Optimization for Specific Languages¶

Performance and Memory Requirements¶

Summary¶

CORE SYSTEMS team

More know-how

NLP in Practice — BERT, GPT, and Processing Czech Texts

AI-powered enterprise search — beyond keyword matching

Real-World Asset Tokenization — From Real Estate to Bonds

Chunking Strategies for RAG