Tokenization is a key process of text preparation for AI models that divides text into smaller units. Get acquainted with BPE and WordPiece algorithms and understand how text processing works in modern language models.
Tokenization in Modern NLP: From Words to Subwords¶
Tokenization is a fundamental step in every NLP system. While traditional approaches split text into words by spaces, modern models like GPT or BERT use more advanced techniques like Byte-Pair Encoding (BPE) and WordPiece. These algorithms can elegantly solve the out-of-vocabulary (OOV) word problem and efficiently represent extensive vocabularies.
Why Classical Tokenization Isn’t Enough¶
Imagine you’re training a model on English texts and encounter the word “unhappiness”. A classical word-based tokenizer would either add this word to the vocabulary (if it appears frequently enough), or mark it as an unknown token
- Large vocabulary takes more memory and slows down training
- Unknown tokens cause information loss
- The model cannot learn morphology and word formation
Modern subword tokenization solves these problems by dividing words into smaller meaningful units.
Byte-Pair Encoding (BPE)¶
BPE originally emerged as a compression algorithm but found application in NLP. The algorithm works as follows:
# Simple BPE implementation
def get_pairs(vocab):
"""Gets all pairs of adjacent symbols"""
pairs = {}
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pair = (symbols[i], symbols[i + 1])
pairs[pair] = pairs.get(pair, 0) + freq
return pairs
def merge_vocab(pair, vocab):
"""Merges the most frequent symbol pair"""
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word in vocab:
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = vocab[word]
return new_vocab
BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pairs. For the word “unhappiness”, the process might look like this:
# Original state
"u n h a p p i n e s s"
# After several iterations
"un hap p i ness"
# Final tokenization
["un", "hap", "pi", "ness"]
WordPiece Algorithm¶
WordPiece, used in BERT, is similar to BPE but with an important difference. Instead of selecting the most frequent pairs, it selects pairs that maximize the likelihood of training data. It also uses a special prefix “##” to mark sub-tokens that don’t indicate word beginnings:
# WordPiece tokenization
"unhappiness" → ["un", "##hap", "##pi", "##ness"]
"playing" → ["play", "##ing"]
"hello" → ["hello"] # frequent word remains whole
Practical Implementation with Hugging Face¶
In practice, we usually use ready-made implementations. Hugging Face Transformers provides a simple API:
from transformers import AutoTokenizer
# GPT-2 uses BPE
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is important for NLP"
tokens = gpt2_tokenizer.tokenize(text)
print(tokens)
# ['Token', 'ization', ' is', ' important', ' for', ' NL', 'P']
# BERT uses WordPiece
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = bert_tokenizer.tokenize("unhappiness")
print(tokens)
# ['un', '##hap', '##pi', '##ness']
Training Custom BPE Tokenizer¶
For specialized domains, we often need our own tokenizer. SentencePiece is a popular library for this purpose:
import sentencepiece as spm
# Training BPE model
spm.SentencePieceTrainer.train(
input='training_data.txt',
model_prefix='my_bpe',
vocab_size=32000,
model_type='bpe',
character_coverage=0.9995,
split_by_unicode_script=True,
split_by_number=True
)
# Loading and usage
sp = spm.SentencePieceProcessor(model_file='my_bpe.model')
tokens = sp.encode('Custom tokenizer for English data')
print(tokens)
print(sp.decode(tokens))
Optimization for Specific Languages¶
Different languages have specifics that affect tokenization. Rich morphology, diacritics, and relatively free word order require attention:
# Example problems with morphologically rich languages
text = "Programming, I program, I programmed"
# Poorly configured tokenizer might create:
# ["Program", "ming", ",", " I", " program", ",", " I", " program", "med"]
# Better tokenization would recognize the root:
# ["program", "##ming", ",", " I", " program", ",", " I", " program", "##med"]
For better results with morphologically rich languages, we recommend:
- Higher character_coverage (0.9999) due to diacritics
- Pre-trained models for specific languages when available
- Preprocessing for diacritic normalization if the task allows
Performance and Memory Requirements¶
Vocabulary size choice is a trade-off between performance and quality. Larger vocabulary means:
- Shorter token sequences → faster inference
- Larger embedding matrix → higher memory requirements
- More parameters → slower training
# Tokenization analysis
def analyze_tokenization(tokenizer, texts):
total_tokens = 0
total_chars = 0
for text in texts:
tokens = tokenizer.tokenize(text)
total_tokens += len(tokens)
total_chars += len(text)
compression_ratio = total_chars / total_tokens
print(f"Compression ratio: {compression_ratio:.2f} chars/token")
return compression_ratio
# Comparing different tokenizers
gpt2_ratio = analyze_tokenization(gpt2_tokenizer, sample_texts)
bert_ratio = analyze_tokenization(bert_tokenizer, sample_texts)
Summary¶
Tokenization is a critical first step in every NLP pipeline. BPE and WordPiece algorithms elegantly solve the OOV word problem and enable efficient text representation. When choosing a tokenizer, consider target language specifics, data size, and performance requirements. For morphologically rich languages, we recommend using pre-trained models or carefully setting parameters when training custom tokenizers.