fix Unicode Normalization: replace NFKC to NFC
Replace all instances of Unicode Normalization from NFKC to NFC throughout the code.
The use of NFKC may inadvertently alter characters when they are copied or exported. For example, under NFKC normalization, the character ⑥
copied from a PDF, would be pasted as 6
.
Example output of NFC, NFD, NFKC, NFKD:
origin: :,!⑥ b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFC: :,!⑥ b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFD: :,!⑥ b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFKC: :,!6 b':,!6'
NFKD: :,!6 b':,!6'
Python script to generate the above output
import unicodedata
s = ':,!⑥'
nfc = unicodedata.normalize('NFC', s)
nfd = unicodedata.normalize('NFD', s)
nfkc = unicodedata.normalize('NFKC', s)
nfkd = unicodedata.normalize('NFKD', s)
print(f"origin:\t\t{s}\t\t{s.encode('utf-8')}")
print(f"NFC:\t\t{nfc}\t\t{nfc.encode('utf-8')}")
print(f"NFD:\t\t{nfd}\t\t{nfd.encode('utf-8')}")
print(f"NFKC:\t\t{nfkc}\t\t{nfkc.encode('utf-8')}")
print(f"NFKD:\t\t{nfkd}\t\t{nfkd.encode('utf-8')}")
Therefore, this change transitions all normalization processes to NFC, which impacts operations related to copying, exporting, finding text, and others. However, it is worth mentioning that the necessity of replacing all instances of NFKC with NFC requires further discussion and verification.
References:
- Unicode Standard 15: Unicode Normalization Forms
- The Evince document viewer has encountered similar issues: issue 384 and issue 1085. They replace one
G_NORMALIZE_NFKC
toG_NORMALIZE_NFC.