Tokenization for Complex Scripts

This repository proposes a generic merge-based tokenization scheme, including concatenative and non-concatenative language structures. It therefore allows for more fitting tokenization for complex scripts (such as SignWriting and Chinese) by decomposing words into smaller units, and representing them in various graph structures.

Usage

Install:

pip install complex-tokenization

Train a tokenizer:

from complex_tokenization import BPETokenizer

tokenizer = BPETokenizer()
tokenizer.train(["the teacher teaches the thick thing"], num_merges=5)
print(tokenizer.get_merges())
# [(' ', 't'), ('h', 'e'), (' t', 'he'), (' t', 'e'), (' te', 'a')]

Tokenizer Variants

All tokenizers accept units, pretokenizer, and variant-specific parameters:

from complex_tokenization import BPETokenizer, BNETokenizer, BoundlessBPETokenizer, SuperBPETokenizer

# BPE: standard byte-pair encoding (merge_size=2, word boundaries)
tok = BPETokenizer()

# BNE: byte-ngram encoding (merge up to n tokens at once)
tok = BNETokenizer(n=4)

# Boundless BPE: merges across word boundaries
tok = BoundlessBPETokenizer()

# Super BPE: intra-word merges first, then cross-word merges
tok = SuperBPETokenizer(disconnected_merges=50)

Pretokenization

By default, text is split using the GPT pretokenization regex pattern. You can pass any HuggingFace PreTokenizer:

from complex_tokenization import BPETokenizer
from tokenizers import Regex
from tokenizers.pre_tokenizers import Split, Whitespace

# Default: GPT regex pattern
tok = BPETokenizer()

# Whitespace splitting
tok = BPETokenizer(pretokenizer=Whitespace())

# Custom regex
tok = BPETokenizer(pretokenizer=Split(Regex(r"\w+|\S"), behavior="isolated"))

Units

Units are the basic blocks we operate on. We implement three base units, plus language-specific decompositions via the script registry:

from complex_tokenization import BPETokenizer

# UTF-8 grapheme clusters (default) — one node sequence per cluster
tok = BPETokenizer(units="utf8_clusters")

# Raw UTF-8 bytes — one node per byte
tok = BPETokenizer(units="utf8")

# Characters — one node per Unicode character
tok = BPETokenizer(units="characters")

Language-Specific Units

Register script-specific handlers for structural decomposition:

from complex_tokenization import BPETokenizer
from complex_tokenization.languages.hebrew.decompose import decompose_cluster
from complex_tokenization.languages.chinese.graph import chinese_character_to_graph

tok = BPETokenizer()
tok.register_script("Hebrew", decompose_cluster)  # nikkud/dagesh as FullyConnectedGraph
tok.register_script("Han", chinese_character_to_graph)  # IDS tree decomposition
tok.train(texts, num_merges=100)

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025complex,
  title={Tokenization for Complex Scripts},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign-language-processing/complex-tokenization}},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benchmarks		benchmarks
complex_tokenization		complex_tokenization
examples		examples
fast		fast
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenization for Complex Scripts

Usage

Tokenizer Variants

Pretokenization

Units

Language-Specific Units

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tokenization for Complex Scripts

Usage

Tokenizer Variants

Pretokenization

Units

Language-Specific Units

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages