Skip to content

Add Rtf extractor

Sergey Katunin requested to merge sgakerru/kfilemetadata:rtf-extractor into master

RTF format support has been implemented both for extracting text content and extracting metadata (title, author, subject, etc).

Main goal: Indexed search support (Baloo) for Rtf files.

Support for three types of RTF documents has been implemented:

a) The document is saved in UTF-8, the non-latin text is encoded using character escaping \u???? in a non-standard representation of UTF-16 encoding (for example, this is how LibreOffice saves it).

b) The document is saved in UTF-8, the non-latin text is encoded using character escaping \'?? in non-unicode encoding (like windows-1251). For example, this is how Word or WordPad save it.

c) The document is saved in non-unicode encoding (like windows-1251), the text is encoded directly in the form of Cyrillic (for windows-1251 case) characters.

Five autotests are included:

  1. LibreOffice Rtf with latin text (Type A).
  2. LibreOffice Rtf with cyrillic text (Type A).
  3. UTF-8 Rtf file with windows-1251 encoding for latin text (Type B).
  4. UTF-8 Rtf file with windows-1251 encoding for cyrillic text (Type B).
  5. Rtf file with windows-1251 encoding and cyrillic text (Type C).

To do this, the following was done:

  1. The rtf-qt library has been moved from the Calligra project (that included in KDE) to KFileMetadata. Later, we can try reuse rtf-qt library in Calligra from here and remove that library in Calligra.

  2. The rtf-qt library has been adapted for Qt6:

    1. QTextCodec was removed in favor of QStringDecoder.
    2. QVariant::Type replaced by QMetaType.
    3. QTextCharFormat: sentFontFamilies instead of sentFontFamily.
  3. Fixed all implicit conversions of ASCII strings to QString casts in the rtf-qt library for the ability to build the library within the framework of KFileMetadata, where implicit conversions are prohibited.

  4. Added rtfextractor and rtfextractortest for KFileMetadata, which is able to extract the contents of the document, as well as the following metadata:

    1. Author.
    2. Subject.
    3. Title.
    4. Description (comment).
    5. Keywords.
    6. Generator (the program that created the document).
    7. Date of creation.
    8. The number of pages.
    9. The number of words.
  5. Fix metadata parsing in rtf-qt library:

    1. The rtf-qt library was modified to support escape symbols in metadata (title, author, comment etc) for unicode encoding (\u????) and non-unicode like windows-1251 (\'??).
    2. In the rtf-qt library, handling of the case with the \upr and \ud tags was added, in order to read exactly the unicode \ud variant, instead of the corrupted text in Latin1 encoding. Without this change, some tags with non-latin text were read as corrupted text, such as the header tag, in a document saved in LibreOffice.
    3. The rtf-qt library has added handling of the situation with empty tags for the number of pages and words (for example, LibreOffice saves this way without filling in these tags). Without this change, these tags had incorrect values for the number of pages and words.

Merge request reports