plaintextextractor: autodetect encoding for text
Autodetect encoding feature
-
Add autodetect encoding feature to
plaintextextractor
. Inspired by a similar algorithm inKTextEditor
. -
Also added some test files (plain text files for
win1251
,gb18030
,euc-jp
encodings, also test html files withUTF-16LE
andwin1251
encodings). -
Manually removing a newline character
\n
AFTER decoding, not before. This is necessary for multi-byte encodings (for encodings with the Little-Endian and Big-Endian byte sequence), which encode a newline character in two or more characters, for example000A
(2 bytes), where0A
==\n
and deleting the character0A
without processing the accompanying00
breaks decoding. For exampleUTF16-LE
, which eventually has the char\n
at the beginning of the next line whenreadline
, and not at the end of the previous line, becauseQFile.readline
reads until it reaches\n
, so sequence0A00
breaks on0A
in the end of first line, and on00
in the start of second line. ButQStringDecoder
has inner buffer, so it remembers about0A
on previous line, and while decoding second line it appends\n
to the beginning of decoded variant of second line.
P.S Now everything works correctly for both UTF16-LE
and UTF16-BE
for LF (\n
) End of Line case, the same is true for single-byte encodings for both LF
(\n
) and CRLF
(\r\n
) cases. But apparently QIODevice::Text
incorrectly converts CRLF
to UTF-16
. That is, to correctly read UTF-16
files with the end of CRLF
lines, it is necessary to read the file with turned off QIODevice::Text
and manually handle different combinations of \r
and \n
for UTF-16LE
and UTF-16BE
(\r\n
in end of line for UTF-16BE
, and for UTF-16LE
- \r
in end of first line and \n
in start of second line).
Does this bundle of UTF-16
and CRLF
require proper handling, or is it not worth it?