Skip to content

Fix 3 and 4 byte UTF-8 sequences

Ben Wagner requested to merge bungeman/okteta:fix_utf8 into 0.26

Code points encoded with three UTF-8 code units provide four bits from the leading code unit. Code points which require four UTF-8 code units provide three bits from the leading code unit. The tests on the leading code unit to determine how many trailing code units are required adjusted for this, but the code point calculation did not. Instead, the code point calculations for three and four code unit encodings used five bits from the leading code unit (which is correct for encodings which use two code units). Fix this by correcting the masks to use the correct number of bits.

Merge request reports