Skip to content

win: Set activeCodePage to UTF-8

Alvin Wong requested to merge alvinwong/krita:alvin/utf8-code-page into master

This should enable impex plugins of all file formats to access files containing Unicode characters in their paths on Windows.

Historically, Windows has used the ANSI code page (ACP) as the encoding for char strings, which can only represent a limited range of characters. Windows also provides wide versions of the APIs using wchar_t, which is 2-byte chars for strings encoded in UTF-16 (or UCS-2 in some cases). For libraries to support Unicode filenames on Windows, they have to go out of the way to implement it with the wide API. They don't do it consistently either -- some choose to implement wide variants for their API, while some choose to interpret char* paths in UTF-8 (which led to confusion when the caller assumed the API takes the local 8-bit char encoding).

Now, by setting the activeCodePage to UTF-8, this changes the code page for our process to UTF-8. This effectively means that, all -A variants of WinAPI calls now accept UTF-8 strings instead of strings in the system ACP. By extension, C and C++ functions for accessing files that are not the 'wide' variant will now also accept UTF-8 file paths.

With regards to the impex plugins, this changes their behaviour around file paths:

  • If the external library already accepts wchar_t * there should be no change in behaviour.
  • If the external library accepts char * and treats them as UTF-8:
    • If we correctly use QString::toUtf8(), there should be no change in behaviour.
    • If we use QString::toLocal8Bit() or QFile::encodeName() by mistake, having activeCodePage in UTF-8 will render it a non-issue.
  • If the external library accepts char * and uses C or C++ library functions to open them directly:
    • If we correctly use QString::toLocal8Bit() or QFile::encodeName(), they would not have been able to open files with names containing Unicode chars outside of the system ACP in the past, but will now be able to do so.
    • If we use QString::toUtf8() by mistake, having activeCodePage in UTF-8 will render it a non-issue.

As illustrated above, the result is a net improvement.

Potential side effect: If a Python plugin expects to be using the system ACP to interact with an external process via IPC, this can cause the encoding to become mismatch.

Note that this only works starting from Windows 10 Version 1903.

Reference: https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page


The system information now includes a section for locale. On Windows it shows something like this:

Locale

  Languages: zh_TW, en_US
  QLocale current: zh-TW
  QLocale system: zh-HK
  QTextCodec for locale: UTF-8
  Process ACP: 65001 (UTF-8)
  System locale default ACP: 950   (ANSI/OEM - 繁體中文 Big5)

"Process ACP: 65001" indicates that the activeCodePage option is applied. "QTextCodec for locale: UTF-8" means the Qt patch is working correctly.

Edited by Alvin Wong

Merge request reports