[WIP] Use MDB_DUPSORT for PostingDB
Rationale
This patch aims at fixing two large issues.
- Currently Baloo has two DBs that largely overlap:
PostingDB
, which mapsterm -> {list of docid}
, andPositionDB
, which mapsterm -> {list of {docid, {list of positions}}}
The data inPostingDB
is thus completely redundant.
I suppose that the reasoning behind this was that PositionDB
entries might be potentially very huge,
and we will have to fetch and unpack them entirely when performing simple simple queries,
degrading the search performance.
- Entries in
PostingDB
andPositionDB
are potentially huge for frequent terms. When a new document is indexed, the whole entry has to be fetched from DB, modified and written back (leading to huge IO usage). This issue is currently overcome by batching the transactions: baloo indexes 40 files, stores all the indexed data on the heap (seeWriteTransaction::m_pendingOperations
), and performs read-modify-write cycle when the batch is finished. This can lead to potentially huge memory usage reported by some users: in the worst-case scenario when batch consits of 40 huge plain-text files (e.g. ~10Mb, which is currently an upper threshold), this can easily lead to gigabytes of eaten memory.
(see also old discussion in https://phabricator.kde.org/T9805)
Proposed solution
LMDB has the DUPSORT
feature, which allows storing multiple values (docids)
for each key (term). By switching to it, we don't need to fetch and unpack entire list of
documents+positions when performing lookup, we can only fetch a single entry dor a particular term+document pair.
This allows to merge PostingDB
and PositionDB
without degrading the lookup performance.
For the same reason we don't need to store pending operations on the heap as
the entries inside PositionDB
are now much smaller (they now correspond
to a single term+docid pair) and can be updated on-the-fly during indexing.
With it we can get rid of WriteTransaction
entirely, reducing the Baloo memory and IO usage as well.
Technical details
-
PostingDB
now mapsterm -> {docid, {list of positions}}
usingDUPSORT
feature. Due to limitation ofDUPSORT
(maximal size of value is limited to ~500 bytes), huge lists of positions have to be stored elsewhere; they go to the newPositionDB
. -
Keys inside new
PositionDB
are roughlymd5sum(term+docid)+delta
(seepostingHash
inpostingdb.cpp
), and values are overflow lists of positions. Single-byte parameterdelta
is used to resolve hash collisions and stored insidePostingDB
for such entries. -
WriteTransaction
is gone, the memory is now consumed by LMDB mostly (which claims to be good at it, as it relies on kernel mechanisms to keep track of dirty pages and sync them to disk when necessary). All the DB operations were merged toTransaction
and performed on-the-fly. -
PositionCodec
is gone, it was used to pack huge{list of {docid, {list of positions}}}
intoQByteArray
and is not needed anymore. -
As positions are now stored in
PostingDB
, we can provide them inPostingIterator
(fetching them only on demand, e.g. for phrase query lookups). As we don't need to fetch & unpack huge entries,PostingIterator
is now based on LMDB cursor (NB: it limits the lifespan of the iterator to that of the transaction!) -
As
PostingDB
now manages two subdbs (postingdb
andpositiondb
), the tests had to be adjusted.SingleDBTest
was renamed toDBTest
and can be used to test arbitrary number of subdbs instead of single one. -
s_dbVersion
insidemigrator.cpp
has to be increased, and all users have to perform full reindexing after the update. That's one of the huge drawbacks of updating the DB scheme, but I hope it's worth it.
Testing
I tried to cover all the new code with tests (adopting and modifying existing tests where possible). Obviously, this is quite a lot of new code and thus it's prone to new bugs, so additional testing and suggestions for new test cases would be appreciated. So far I've been running new DB on my setup and didn't encounter any issues.
Benchmarking
Database size
"Useful" database size decreased roughly by the size of PositionDB (as expected). Freelist size is slightly larger (43% vs 51%), but the total DB size still has decreased.
Before
File Size: 755,40 MiB
Used: 428,25 MiB
PostingDB: 133,43 MiB 31.157 %
PositionDB: 222,30 MiB 51.909 %
DocTerms: 66,67 MiB 15.568 %
DocFilenameTerms: 1,36 MiB 0.317 %
DocXattrTerms: 4,00 KiB 0.001 %
IdTree: 324,00 KiB 0.074 %
IdFileName: 1,48 MiB 0.347 %
DocTime: 836,00 KiB 0.191 %
DocData: 1,30 MiB 0.304 %
ContentIndexingDB: 0 B 0.000 %
FailedIdsDB: 0 B 0.000 %
MTimeDB: 588,00 KiB 0.134 %
After
File Size: 593,00 MiB
Used: 290,66 MiB
PostingDB: 189,98 MiB 65.364 %
PositionDB: 27,99 MiB 9.629 %
DocTerms: 66,83 MiB 22.993 %
DocFilenameTerms: 1,36 MiB 0.466 %
DocXattrTerms: 4,00 KiB 0.001 %
IdTree: 324,00 KiB 0.109 %
IdFileName: 1,48 MiB 0.511 %
DocTime: 836,00 KiB 0.281 %
DocData: 1,30 MiB 0.448 %
ContentIndexingDB: 0 B 0.000 %
FailedIdsDB: 0 B 0.000 %
MTimeDB: 588,00 KiB 0.198 %
Massif analysis
Totl number of instructions decreased significantly (6.6e12 v.s. 3.9e12).
Memory usage hasn't changed a lot (peak usage 259.5 MiB v.s. 185.6 MiB), most of the memory is eaten by LMDB itself.
Although at one snapshot, WriteTransaction::pendingOperations
ate 85.3 MiB memory.
In the synthetic worst-case scenario (a single transaction with 40 x 10-MiB files with plain-text garbage), Baloo previously consumed 3.5 GiB memory (out of which only ~800 MiB consumed by LMDB). With this change, Baloo eats only ~700 MiB memory (~90% of which consumed by LMDB).
IO usage
Analyzed via cat /proc/$(pidof baloo_file_extractor)/io
.
Amount of write I/O decreased almost by a factor of 2: ~18.3 GiB down to ~9.7 GiB.
There are however claims that there might be issues with tracking memmapped io via procfs. I tested it with a small stand-alone program: seems like read_bytes are indeed quite random (explaining "0" in "before" picture), but write_bytes seem to be reasonable.
Before
rchar: 9011275417
wchar: 18309277012
syscr: 69683499
syscw: 40091472
read_bytes: 0
write_bytes: 18264674304
cancelled_write_bytes: 0
After
rchar: 9008848172
wchar: 9700639294
syscr: 67337771
syscw: 37699360
read_bytes: 60801024
write_bytes: 9658175488
cancelled_write_bytes: 0