[WIP] Use MDB_DUPSORT for PostingDB

Igor Poboiko requested to merge poboiko/baloo:db_dupsort into master


This patch aims at fixing two large issues.

  1. Currently Baloo has two DBs that largely overlap: PostingDB, which maps term -> {list of docid}, and PositionDB, which maps term -> {list of {docid, {list of positions}}} The data in PostingDB is thus completely redundant.

I suppose that the reasoning behind this was that PositionDB entries might be potentially very huge, and we will have to fetch and unpack them entirely when performing simple simple queries, degrading the search performance.

  1. Entries in PostingDB and PositionDB are potentially huge for frequent terms. When a new document is indexed, the whole entry has to be fetched from DB, modified and written back (leading to huge IO usage). This issue is currently overcome by batching the transactions: baloo indexes 40 files, stores all the indexed data on the heap (see WriteTransaction::m_pendingOperations), and performs read-modify-write cycle when the batch is finished. This can lead to potentially huge memory usage reported by some users: in the worst-case scenario when batch consits of 40 huge plain-text files (e.g. ~10Mb, which is currently an upper threshold), this can easily lead to gigabytes of eaten memory.

(see also old discussion in https://phabricator.kde.org/T9805)

Proposed solution

LMDB has the DUPSORT feature, which allows storing multiple values (docids) for each key (term). By switching to it, we don't need to fetch and unpack entire list of documents+positions when performing lookup, we can only fetch a single entry dor a particular term+document pair. This allows to merge PostingDB and PositionDB without degrading the lookup performance.

For the same reason we don't need to store pending operations on the heap as the entries inside PositionDB are now much smaller (they now correspond to a single term+docid pair) and can be updated on-the-fly during indexing. With it we can get rid of WriteTransaction entirely, reducing the Baloo memory and IO usage as well.

Technical details

  1. PostingDB now maps term -> {docid, {list of positions}} using DUPSORT feature. Due to limitation of DUPSORT (maximal size of value is limited to ~500 bytes), huge lists of positions have to be stored elsewhere; they go to the new PositionDB.

  2. Keys inside new PositionDB are roughly md5sum(term+docid)+delta (see postingHash in postingdb.cpp), and values are overflow lists of positions. Single-byte parameter delta is used to resolve hash collisions and stored inside PostingDB for such entries.

  3. WriteTransaction is gone, the memory is now consumed by LMDB mostly (which claims to be good at it, as it relies on kernel mechanisms to keep track of dirty pages and sync them to disk when necessary). All the DB operations were merged to Transaction and performed on-the-fly.

  4. PositionCodec is gone, it was used to pack huge {list of {docid, {list of positions}}} into QByteArray and is not needed anymore.

  5. As positions are now stored in PostingDB, we can provide them in PostingIterator (fetching them only on demand, e.g. for phrase query lookups). As we don't need to fetch & unpack huge entries, PostingIterator is now based on LMDB cursor (NB: it limits the lifespan of the iterator to that of the transaction!)

  6. As PostingDB now manages two subdbs (postingdb and positiondb), the tests had to be adjusted. SingleDBTest was renamed to DBTest and can be used to test arbitrary number of subdbs instead of single one.

  7. s_dbVersion inside migrator.cpp has to be increased, and all users have to perform full reindexing after the update. That's one of the huge drawbacks of updating the DB scheme, but I hope it's worth it.


I tried to cover all the new code with tests (adopting and modifying existing tests where possible). Obviously, this is quite a lot of new code and thus it's prone to new bugs, so additional testing and suggestions for new test cases would be appreciated. So far I've been running new DB on my setup and didn't encounter any issues.


Database size

"Useful" database size decreased roughly by the size of PositionDB (as expected). Freelist size is slightly larger (43% vs 51%), but the total DB size still has decreased.


File Size: 755,40 MiB
Used:      428,25 MiB

           PostingDB:     133,43 MiB    31.157 %
          PositionDB:     222,30 MiB    51.909 %
            DocTerms:      66,67 MiB    15.568 %
    DocFilenameTerms:       1,36 MiB     0.317 %
       DocXattrTerms:       4,00 KiB     0.001 %
              IdTree:     324,00 KiB     0.074 %
          IdFileName:       1,48 MiB     0.347 %
             DocTime:     836,00 KiB     0.191 %
             DocData:       1,30 MiB     0.304 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:     588,00 KiB     0.134 %


File Size: 593,00 MiB
Used:      290,66 MiB

           PostingDB:     189,98 MiB    65.364 %
          PositionDB:      27,99 MiB     9.629 %
            DocTerms:      66,83 MiB    22.993 %
    DocFilenameTerms:       1,36 MiB     0.466 %
       DocXattrTerms:       4,00 KiB     0.001 %
              IdTree:     324,00 KiB     0.109 %
          IdFileName:       1,48 MiB     0.511 %
             DocTime:     836,00 KiB     0.281 %
             DocData:       1,30 MiB     0.448 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:     588,00 KiB     0.198 %

Massif analysis

Totl number of instructions decreased significantly (6.6e12 v.s. 3.9e12). Memory usage hasn't changed a lot (peak usage 259.5 MiB v.s. 185.6 MiB), most of the memory is eaten by LMDB itself. Although at one snapshot, WriteTransaction::pendingOperations ate 85.3 MiB memory.

In the synthetic worst-case scenario (a single transaction with 40 x 10-MiB files with plain-text garbage), Baloo previously consumed 3.5 GiB memory (out of which only ~800 MiB consumed by LMDB). With this change, Baloo eats only ~700 MiB memory (~90% of which consumed by LMDB).

IO usage

Analyzed via cat /proc/$(pidof baloo_file_extractor)/io. Amount of write I/O decreased almost by a factor of 2: ~18.3 GiB down to ~9.7 GiB.

There are however claims that there might be issues with tracking memmapped io via procfs. I tested it with a small stand-alone program: seems like read_bytes are indeed quite random (explaining "0" in "before" picture), but write_bytes seem to be reasonable.


rchar: 9011275417
wchar: 18309277012
syscr: 69683499
syscw: 40091472
read_bytes: 0
write_bytes: 18264674304
cancelled_write_bytes: 0


rchar: 9008848172
wchar: 9700639294
syscr: 67337771
syscw: 37699360
read_bytes: 60801024
write_bytes: 9658175488
cancelled_write_bytes: 0

Merge request reports