Skip to content

Optimize name table in unicodedata by excluding names derived by rule NR2 #144882

@serhiy-storchaka

Description

@serhiy-storchaka

For most ideographs, the Name property value is derived by concatenating a script-specific prefix string to the code point, expressed in uppercase hexadecimal, with the usual 4- to 6-digit convention (see rule NR2 in chapter 4.8.1 of Unicode 17.0.0 spec).

Thus, names for Hangul syllables and most Han and Tangut ideographic characters are not explicitly listed in UnicodeData.txt. They are generated algorithmically in unicodedata. See #80667. But ideographic characters for scripts other than Han and Tangut, as well as Egyptian hieroglyphs, have their names listed explicitly in UnicodeData.txt, even when their names are derived by rule NR2. We can reduce the name table if exclude names derived by rule NR2 and generate them using existing code.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.15new features, bugs and security fixesextension-modulesC modules in the Modules dirperformancePerformance or resource usagetopic-unicodetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions