plaintextextractor: autodetect encoding for text (!133) · Merge requests · Frameworks / KFileMetaData

Sergey Katunin requested to merge sgakerru/kfilemetadata:handle-non-utf-plain-text into master Feb 29, 2024

Autodetect encoding feature

Add autodetect encoding feature to plaintextextractor. Inspired by a similar algorithm in KTextEditor.
Also added some test files (plain text files for win1251, gb18030, euc-jp encodings, also test html files with UTF-16LE and win1251 encodings).
Manually removing a newline character \n AFTER decoding, not before. This is necessary for multi-byte encodings (for encodings with the Little-Endian and Big-Endian byte sequence), which encode a newline character in two or more characters, for example 000A (2 bytes), where 0A == \n and deleting the character 0A without processing the accompanying 00 breaks decoding. For example UTF16-LE, which eventually has the char \n at the beginning of the next line when readline, and not at the end of the previous line, because QFile.readline reads until it reaches \n, so sequence 0A00 breaks on 0A in the end of first line, and on 00 in the start of second line. But QStringDecoder has inner buffer, so it remembers about 0A on previous line, and while decoding second line it appends \n to the beginning of decoded variant of second line.

P.S Now everything works correctly for both UTF16-LE and UTF16-BE for LF (\n) End of Line case, the same is true for single-byte encodings for both LF (\n) and CRLF (\r\n) cases. But apparently QIODevice::Text incorrectly converts CRLF to UTF-16. That is, to correctly read UTF-16 files with the end of CRLF lines, it is necessary to read the file with turned off QIODevice::Text and manually handle different combinations of \r and \n for UTF-16LE and UTF-16BE (\r\n in end of line for UTF-16BE, and for UTF-16LE - \r in end of first line and \n in start of second line).

Does this bundle of UTF-16 and CRLF require proper handling, or is it not worth it?

Edited Mar 18, 2024 by Sergey Katunin

plaintextextractor: autodetect encoding for text

Autodetect encoding feature

Merge request reports