Skip to content

fix Unicode Normalization: replace NFKC to NFC

Wendi Gan requested to merge gwdx/okular:fix-unicode-normalization into master

Replace all instances of Unicode Normalization from NFKC to NFC throughout the code.

The use of NFKC may inadvertently alter characters when they are copied or exported. For example, under NFKC normalization, the character copied from a PDF, would be pasted as 6.

Example output of NFC, NFD, NFKC, NFKD:

origin:         :,!⑥         b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFC:            :,!⑥         b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFD:            :,!⑥         b'\xef\xbc\x9a\xef\xbc\x8c\xef\xbc\x81\xe2\x91\xa5'
NFKC:           :,!6            b':,!6'
NFKD:           :,!6            b':,!6'
Python script to generate the above output
import unicodedata

s = ':,!⑥'

nfc = unicodedata.normalize('NFC', s)
nfd = unicodedata.normalize('NFD', s)
nfkc = unicodedata.normalize('NFKC', s)
nfkd = unicodedata.normalize('NFKD', s)

print(f"origin:\t\t{s}\t\t{s.encode('utf-8')}")
print(f"NFC:\t\t{nfc}\t\t{nfc.encode('utf-8')}")
print(f"NFD:\t\t{nfd}\t\t{nfd.encode('utf-8')}")
print(f"NFKC:\t\t{nfkc}\t\t{nfkc.encode('utf-8')}")
print(f"NFKD:\t\t{nfkd}\t\t{nfkd.encode('utf-8')}")

Therefore, this change transitions all normalization processes to NFC, which impacts operations related to copying, exporting, finding text, and others. However, it is worth mentioning that the necessity of replacing all instances of NFKC with NFC requires further discussion and verification.

References:

BUG: 466521
CCBUG: 473495

Merge request reports