Skip to content

unicodedata Module

The unicodedata module provides access to the Unicode Character Database (UCD), including character names, categories, normalization, and digit/decimal values.

Complexity Reference

Operation Time Space Notes
name(ch) O(1) O(1) Lookup by code point; raises ValueError if unnamed
lookup(name) O(1) O(1) Lookup by name
category(ch) O(1) O(1) General category
bidirectional(ch) O(1) O(1) Bidi class
combining(ch) O(1) O(1) Canonical combining class
decimal(ch) / digit(ch) / numeric(ch) O(1) O(1) Numeric properties
normalize(form, s) O(n) O(n) n = string length
is_normalized(form, s) O(n) O(1) Checks normalization

Character Properties

import unicodedata

# Basic properties
ch = "é"
print(unicodedata.name(ch))       # LATIN SMALL LETTER E WITH ACUTE
print(unicodedata.category(ch))   # Ll
print(unicodedata.combining(ch))  # 0
print(unicodedata.bidirectional(ch))  # L

# Numeric properties
print(unicodedata.decimal("٢"))   # 2
print(unicodedata.digit("②"))     # 2
print(unicodedata.numeric("Ⅷ"))   # 8.0

Name Lookup

import unicodedata

# Lookup by name
ch = unicodedata.lookup("GREEK SMALL LETTER MU")  # "μ"

# Safe name lookup with default
name = unicodedata.name("Ω", "UNKNOWN")  # "GREEK CAPITAL LETTER OMEGA"
missing = unicodedata.name("😀", None)    # Name exists; returns string

Normalization

import unicodedata

text = "cafe\u0301"  # "e" + combining acute

# Normalize to NFC/NFD/NFKC/NFKD
nfc = unicodedata.normalize("NFC", text)
nfd = unicodedata.normalize("NFD", text)

print(text == nfc)  # False
print(text == nfd)  # True

# Check normalization
print(unicodedata.is_normalized("NFC", text))  # False
print(unicodedata.is_normalized("NFD", text))  # True