Skip to content

[TermGenerator] Skip all unprintable characters

Igor Poboiko requested to merge poboiko/baloo:malformed into master

Some extractors can produce text which includes special unicode control characters (e.g. Poppler can give us 0x0001 from some PDFs). TermGenerator then generates proper (yet meaningless) terms out of those characters, and they end up in database. It should be safe to skip all unprintable characters to avoid that (although surrogates are fine, they are dealt with later via QString::normalize call).

Character 0x0001 is the worst, as it is used internally in DocTermsCodec for compactification. Such collision then leads to the corrupted database (some terms from DocTermsDB are not present in PostingDB).

The corruption is not hypothetical (although not critical), I've encountered bunch of broken DB entries for some PDF files on my machine.

Merge request reports