fix Unicode Normalization: replace NFKC to NFC (!941) · Merge requests · Graphics / Okular

Wendi Gan requested to merge gwdx/okular:fix-unicode-normalization into master Mar 05, 2024

Replace all instances of Unicode Normalization from NFKC to NFC throughout the code.

The use of NFKC may inadvertently alter characters when they are copied or exported. For example, under NFKC normalization, the character ⑥ copied from a PDF, would be pasted as 6.

Example output of NFC, NFD, NFKC, NFKD:

origin:         ：，！⑥         b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFC:            ：，！⑥         b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFD:            ：，！⑥         b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFKC:           :,!6            b':,!6'
NFKD:           :,!6            b':,!6'

Python script to generate the above output

import unicodedata

s = '：，！⑥'

nfc = unicodedata.normalize('NFC', s)
nfd = unicodedata.normalize('NFD', s)
nfkc = unicodedata.normalize('NFKC', s)
nfkd = unicodedata.normalize('NFKD', s)

print(f"origin:\t\t{s}\t\t{s.encode('utf-8')}")
print(f"NFC:\t\t{nfc}\t\t{nfc.encode('utf-8')}")
print(f"NFD:\t\t{nfd}\t\t{nfd.encode('utf-8')}")
print(f"NFKC:\t\t{nfkc}\t\t{nfkc.encode('utf-8')}")
print(f"NFKD:\t\t{nfkd}\t\t{nfkd.encode('utf-8')}")

Therefore, this change transitions all normalization processes to NFC, which impacts operations related to copying, exporting, finding text, and others. However, it is worth mentioning that the necessity of replacing all instances of NFKC with NFC requires further discussion and verification.

References:

Unicode Standard 15: Unicode Normalization Forms
The Evince document viewer has encountered similar issues: issue 384 and issue 1085. They replace one G_NORMALIZE_NFKC to G_NORMALIZE_NFC.

BUG: 466521
CCBUG: 473495

fix Unicode Normalization: replace NFKC to NFC

Merge request reports