Fix 3 and 4 byte UTF-8 sequences (!17) · Merge requests · Utilities / Okteta

Ben Wagner requested to merge bungeman/okteta:fix_utf8 into 0.26 May 04, 2023

Code points encoded with three UTF-8 code units provide four bits from the leading code unit. Code points which require four UTF-8 code units provide three bits from the leading code unit. The tests on the leading code unit to determine how many trailing code units are required adjusted for this, but the code point calculation did not. Instead, the code point calculations for three and four code unit encodings used five bits from the leading code unit (which is correct for encodings which use two code units). Fix this by correcting the masks to use the correct number of bits.

Fix 3 and 4 byte UTF-8 sequences

Merge request reports