UDSEntry use two vectors to store fields value (!1261) · Merge requests · Frameworks / KIO

Méven Car requested to merge work/meven/uds-entry into master Apr 22, 2023

Use separate vectors to store number values and string values. sizeof UDSEntryPrivate grows from 4 to 7 bytes, but with each value in the entry, we save 1 byte previously unused (either saving sizeof QString or sizeof long long). Realistic entries have at least three entries which means we never increase memory consumption, but for any entry with more fields we save 1 byte per-field. This trickles down in lesser memory used, less allocations, slightly faster access, provided the two internal vectors are accessed in sequence, rather than alternating between number and string keys.

This shouldn't constitute a binary breakage since UDSEntry operator<< and operator>> are kept using the same format, allowing KIO to receive data from workers with a different implementation.

The format on the wire, could be optimized but probably best to wait for KF7...

!1593 (merged) has shown potential perf improvements.

# echo "0" | sudo tee /proc/sys/kernel/perf_event_paranoid // might need that to be allowed to use perf

After:

5900X


ninja && ./bin/udsentry_benchmark -perf -iterations 100

[2/2] Generating mo...
********* Start testing of UDSEntryBenchmark *********
Config: Using QtTest library 6.6.3, Qt 6.6.3 (x86_64-little_endian-lp64 shared (dynamic) release build; by GCC 13.2.0), ubuntu 23.10
PASS   : UDSEntryBenchmark::initTestCase()
PASS   : UDSEntryBenchmark::createSmallEntries()
RESULT : UDSEntryBenchmark::createSmallEntries():
     1,159,602.63 nsecs per iteration (total: 115,960,264, iterations: 100)
     4,955,078.99 CPU cycles per iteration, 4,27 GHz (total: 495,507,899, iterations: 100)
     9,685,675.17 instructions per iteration, 1,955 instr/cycle (total: 968,567,518, iterations: 100)
     1,582,949.28 branch instructions per iteration, 1,37 G/sec (total: 158,294,928, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntries()
RESULT : UDSEntryBenchmark::createLargeEntries():
     1,265,403.48 nsecs per iteration (total: 126,540,349, iterations: 100)
     5,409,500.73 CPU cycles per iteration, 4,27 GHz (total: 540,950,073, iterations: 100)
     9,618,312.77 instructions per iteration, 1,778 instr/cycle (total: 961,831,278, iterations: 100)
     1,582,225.94 branch instructions per iteration, 1,25 G/sec (total: 158,222,595, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntriesOrderedInsert()
RESULT : UDSEntryBenchmark::createLargeEntriesOrderedInsert():
     839,336.48 nsecs per iteration (total: 83,933,649, iterations: 100)
     3,588,025.72 CPU cycles per iteration, 4,27 GHz (total: 358,802,572, iterations: 100)
     8,514,333.35 instructions per iteration, 2,373 instr/cycle (total: 851,433,336, iterations: 100)
     1,357,033.11 branch instructions per iteration, 1,62 G/sec (total: 135,703,311, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromSmallEntries()
RESULT : UDSEntryBenchmark::readFieldsFromSmallEntries():
     62,859,590.77 nsecs per iteration (total: 6,285,959,077, iterations: 100)
     268,718,709.69 CPU cycles per iteration, 4,27 GHz (total: 26,871,870,970, iterations: 100)
     581,958,970.88 instructions per iteration, 2,166 instr/cycle (total: 58,195,897,089, iterations: 100)
     96,994,734.48 branch instructions per iteration, 1,54 G/sec (total: 9,699,473,448, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromLargeEntries()
RESULT : UDSEntryBenchmark::readFieldsFromLargeEntries():
     832,654.18 nsecs per iteration (total: 83,265,419, iterations: 100)
     3,559,395.58 CPU cycles per iteration, 4,27 GHz (total: 355,939,558, iterations: 100)
     8,175,540.20 instructions per iteration, 2,297 instr/cycle (total: 817,554,020, iterations: 100)
     1,329,768.56 branch instructions per iteration, 1,6 G/sec (total: 132,976,856, iterations: 100)
PASS   : UDSEntryBenchmark::saveSmallEntries()
RESULT : UDSEntryBenchmark::saveSmallEntries():
     364,013.96 nsecs per iteration (total: 36,401,397, iterations: 100)
     1,556,021.93 CPU cycles per iteration, 4,27 GHz (total: 155,602,194, iterations: 100)
     5,541,680.74 instructions per iteration, 3,561 instr/cycle (total: 554,168,074, iterations: 100)
     1,142,040.22 branch instructions per iteration, 3,14 G/sec (total: 114,204,023, iterations: 100)
PASS   : UDSEntryBenchmark::saveLargeEntries()
RESULT : UDSEntryBenchmark::saveLargeEntries():
     182,727.39 nsecs per iteration (total: 18,272,739, iterations: 100)
     781,045.94 CPU cycles per iteration, 4,27 GHz (total: 78,104,595, iterations: 100)
     2,598,754.64 instructions per iteration, 3,327 instr/cycle (total: 259,875,464, iterations: 100)
     546,165.53 branch instructions per iteration, 2,99 G/sec (total: 54,616,553, iterations: 100)
PASS   : UDSEntryBenchmark::loadSmallEntries()
RESULT : UDSEntryBenchmark::loadSmallEntries():
     2,016,898.26 nsecs per iteration (total: 201,689,826, iterations: 100)
     8,621,364.68 CPU cycles per iteration, 4,27 GHz (total: 862,136,469, iterations: 100)
     19,689,953.07 instructions per iteration, 2,284 instr/cycle (total: 1,968,995,308, iterations: 100)
     3,513,732.75 branch instructions per iteration, 1,74 G/sec (total: 351,373,275, iterations: 100)
PASS   : UDSEntryBenchmark::loadLargeEntries()
RESULT : UDSEntryBenchmark::loadLargeEntries():
     1,453,370.78 nsecs per iteration (total: 145,337,078, iterations: 100)
     6,212,990.17 CPU cycles per iteration, 4,27 GHz (total: 621,299,018, iterations: 100)
     14,965,408.40 instructions per iteration, 2,409 instr/cycle (total: 1,496,540,840, iterations: 100)
     2,635,884.45 branch instructions per iteration, 1,81 G/sec (total: 263,588,445, iterations: 100)
PASS   : UDSEntryBenchmark::cleanupTestCase()
Totals: 11 passed, 0 failed, 0 skipped, 0 blacklisted, 9255ms
********* Finished testing of UDSEntryBenchmark *********

Laptop i7-8550U

ninja && ./bin/udsentry_benchmark -perf -iterations 100

[0/2] Re-checking globbed directories...
[89/89] Linking CXX executable bin/kioworkertest
********* Start testing of UDSEntryBenchmark *********
Config: Using QtTest library 6.7.0, Qt 6.7.0 (x86_64-little_endian-lp64 shared (dynamic) release build; by GCC 13.2.1 20230801), arch unknown
PASS   : UDSEntryBenchmark::initTestCase()
PASS   : UDSEntryBenchmark::createSmallEntries()
RESULT : UDSEntryBenchmark::createSmallEntries():
     166,801.05 nsecs per iteration (total: 16,680,106, iterations: 100)
     537,359.15 CPU cycles per iteration, 3,22 GHz (total: 53,735,915, iterations: 100)
     1,141,490.02 instructions per iteration, 2,124 instr/cycle (total: 114,149,002, iterations: 100)
     248,958.19 branch instructions per iteration, 1,49 G/sec (total: 24,895,819, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntries()
RESULT : UDSEntryBenchmark::createLargeEntries():
     417,137.01 nsecs per iteration (total: 41,713,701, iterations: 100)
     1,402,748.21 CPU cycles per iteration, 3,36 GHz (total: 140,274,822, iterations: 100)
     1,191,994.30 instructions per iteration, 0,850 instr/cycle (total: 119,199,430, iterations: 100)
     232,768.70 branch instructions per iteration, 558 M/sec (total: 23,276,870, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntriesOrderedInsert()
RESULT : UDSEntryBenchmark::createLargeEntriesOrderedInsert():
     58,369.93 nsecs per iteration (total: 5,836,993, iterations: 100)
     163,292.70 CPU cycles per iteration, 2,8 GHz (total: 16,329,270, iterations: 100)
     254,943.33 instructions per iteration, 1,561 instr/cycle (total: 25,494,334, iterations: 100)
     55,565.65 branch instructions per iteration, 952 M/sec (total: 5,556,565, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromSmallEntries()
RESULT : UDSEntryBenchmark::readFieldsFromSmallEntries():
     8,088,293.83 nsecs per iteration (total: 808,829,383, iterations: 100)
     27,690,400.64 CPU cycles per iteration, 3,42 GHz (total: 2,769,040,064, iterations: 100)
     66,044,096.54 instructions per iteration, 2,385 instr/cycle (total: 6,604,409,655, iterations: 100)
     17,971,682.62 branch instructions per iteration, 2,22 G/sec (total: 1,797,168,263, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromLargeEntries()
RESULT : UDSEntryBenchmark::readFieldsFromLargeEntries():
     138,110.02 nsecs per iteration (total: 13,811,003, iterations: 100)
     454,456.53 CPU cycles per iteration, 3,29 GHz (total: 45,445,654, iterations: 100)
     1,103,337.28 instructions per iteration, 2,428 instr/cycle (total: 110,333,728, iterations: 100)
     289,391.50 branch instructions per iteration, 2,1 G/sec (total: 28,939,150, iterations: 100)
PASS   : UDSEntryBenchmark::saveSmallEntries()
RESULT : UDSEntryBenchmark::saveSmallEntries():
     413,747.84 nsecs per iteration (total: 41,374,785, iterations: 100)
     1,395,015.27 CPU cycles per iteration, 3,37 GHz (total: 139,501,527, iterations: 100)
     4,331,349.26 instructions per iteration, 3,105 instr/cycle (total: 433,134,927, iterations: 100)
     908,005.03 branch instructions per iteration, 2,19 G/sec (total: 90,800,503, iterations: 100)
PASS   : UDSEntryBenchmark::saveLargeEntries()
RESULT : UDSEntryBenchmark::saveLargeEntries():
     291,217.21 nsecs per iteration (total: 29,121,721, iterations: 100)
     828,960.60 CPU cycles per iteration, 2,85 GHz (total: 82,896,061, iterations: 100)
     2,225,414.35 instructions per iteration, 2,685 instr/cycle (total: 222,541,436, iterations: 100)
     471,001.08 branch instructions per iteration, 1,62 G/sec (total: 47,100,108, iterations: 100)
PASS   : UDSEntryBenchmark::loadSmallEntries()
RESULT : UDSEntryBenchmark::loadSmallEntries():
     874,512.19 nsecs per iteration (total: 87,451,220, iterations: 100)
     2,944,183.77 CPU cycles per iteration, 3,37 GHz (total: 294,418,378, iterations: 100)
     8,356,329.86 instructions per iteration, 2,838 instr/cycle (total: 835,632,986, iterations: 100)
     1,696,220.33 branch instructions per iteration, 1,94 G/sec (total: 169,622,033, iterations: 100)
PASS   : UDSEntryBenchmark::loadLargeEntries()
RESULT : UDSEntryBenchmark::loadLargeEntries():
     523,188.95 nsecs per iteration (total: 52,318,895, iterations: 100)
     1,802,280.59 CPU cycles per iteration, 3,44 GHz (total: 180,228,059, iterations: 100)
     4,638,708.87 instructions per iteration, 2,574 instr/cycle (total: 463,870,887, iterations: 100)
     966,316.32 branch instructions per iteration, 1,85 G/sec (total: 96,631,633, iterations: 100)
PASS   : UDSEntryBenchmark::cleanupTestCase()
Totals: 11 passed, 0 failed, 0 skipped, 0 blacklisted, 1717ms
********* Finished testing of UDSEntryBenchmark *********

Before:

5900X

ninja && ./bin/udsentry_benchmark -perf -iterations 100
[0/2] Re-checking globbed directories...
[2/2] Generating mo...
********* Start testing of UDSEntryBenchmark *********
Config: Using QtTest library 6.6.3, Qt 6.6.3 (x86_64-little_endian-lp64 shared (dynamic) release build; by GCC 13.2.0), ubuntu 23.10
PASS   : UDSEntryBenchmark::initTestCase()
PASS   : UDSEntryBenchmark::createSmallEntries()
RESULT : UDSEntryBenchmark::createSmallEntries():
     1,193,799.26 nsecs per iteration (total: 119,379,926, iterations: 100)
     5,102,949.70 CPU cycles per iteration, 4,27 GHz (total: 510,294,971, iterations: 100)
     10,040,838.97 instructions per iteration, 1,968 instr/cycle (total: 1,004,083,897, iterations: 100)
     1,607,006.51 branch instructions per iteration, 1,35 G/sec (total: 160,700,651, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntries()
RESULT : UDSEntryBenchmark::createLargeEntries():
     1,047,783.89 nsecs per iteration (total: 104,778,389, iterations: 100)
     4,479,149.37 CPU cycles per iteration, 4,27 GHz (total: 447,914,937, iterations: 100)
     10,339,603.36 instructions per iteration, 2,308 instr/cycle (total: 1,033,960,337, iterations: 100)
     1,636,707.39 branch instructions per iteration, 1,56 G/sec (total: 163,670,740, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntriesOrderedInsert()
RESULT : UDSEntryBenchmark::createLargeEntriesOrderedInsert():
     1,078,283.10 nsecs per iteration (total: 107,828,310, iterations: 100)
     4,609,502.82 CPU cycles per iteration, 4,27 GHz (total: 460,950,282, iterations: 100)
     10,339,785.68 instructions per iteration, 2,243 instr/cycle (total: 1,033,978,569, iterations: 100)
     1,636,722.47 branch instructions per iteration, 1,52 G/sec (total: 163,672,248, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromSmallEntries()
RESULT : UDSEntryBenchmark::readFieldsFromSmallEntries():
     71,734,443.40 nsecs per iteration (total: 7,173,444,340, iterations: 100)
     306,658,822.22 CPU cycles per iteration, 4,27 GHz (total: 30,665,882,222, iterations: 100)
     651,080,571.29 instructions per iteration, 2,123 instr/cycle (total: 65,108,057,130, iterations: 100)
     107,599,038.31 branch instructions per iteration, 1,5 G/sec (total: 10,759,903,831, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromLargeEntries()
RESULT : UDSEntryBenchmark::readFieldsFromLargeEntries():
     928,909.26 nsecs per iteration (total: 92,890,926, iterations: 100)
     3,970,980.41 CPU cycles per iteration, 4,27 GHz (total: 397,098,042, iterations: 100)
     10,011,983.00 instructions per iteration, 2,521 instr/cycle (total: 1,001,198,301, iterations: 100)
     1,618,245.37 branch instructions per iteration, 1,74 G/sec (total: 161,824,538, iterations: 100)
PASS   : UDSEntryBenchmark::saveSmallEntries()
RESULT : UDSEntryBenchmark::saveSmallEntries():
     391,606.58 nsecs per iteration (total: 39,160,658, iterations: 100)
     1,673,987.23 CPU cycles per iteration, 4,27 GHz (total: 167,398,724, iterations: 100)
     5,492,048.00 instructions per iteration, 3,281 instr/cycle (total: 549,204,801, iterations: 100)
     1,140,941.55 branch instructions per iteration, 2,91 G/sec (total: 114,094,155, iterations: 100)
PASS   : UDSEntryBenchmark::saveLargeEntries()
RESULT : UDSEntryBenchmark::saveLargeEntries():
     162,880.67 nsecs per iteration (total: 16,288,068, iterations: 100)
     696,197.93 CPU cycles per iteration, 4,27 GHz (total: 69,619,794, iterations: 100)
     2,594,184.04 instructions per iteration, 3,726 instr/cycle (total: 259,418,404, iterations: 100)
     545,720.23 branch instructions per iteration, 3,35 G/sec (total: 54,572,024, iterations: 100)
PASS   : UDSEntryBenchmark::loadSmallEntries()
RESULT : UDSEntryBenchmark::loadSmallEntries():
     1,847,275.45 nsecs per iteration (total: 184,727,546, iterations: 100)
     7,896,793.91 CPU cycles per iteration, 4,27 GHz (total: 789,679,392, iterations: 100)
     18,671,291.28 instructions per iteration, 2,364 instr/cycle (total: 1,867,129,128, iterations: 100)
     3,328,595.33 branch instructions per iteration, 1,8 G/sec (total: 332,859,533, iterations: 100)
PASS   : UDSEntryBenchmark::loadLargeEntries()
RESULT : UDSEntryBenchmark::loadLargeEntries():
     1,562,225.62 nsecs per iteration (total: 156,222,563, iterations: 100)
     6,678,393.91 CPU cycles per iteration, 4,27 GHz (total: 667,839,391, iterations: 100)
     16,080,917.90 instructions per iteration, 2,408 instr/cycle (total: 1,608,091,790, iterations: 100)
     2,810,766.89 branch instructions per iteration, 1,8 G/sec (total: 281,076,689, iterations: 100)
PASS   : UDSEntryBenchmark::cleanupTestCase()
Totals: 11 passed, 0 failed, 0 skipped, 0 blacklisted, 10237ms
********* Finished testing of UDSEntryBenchmark *********

Laptop i7-8550U

 ninja && ./bin/udsentry_benchmark -perf -iterations 100

[0/2] Re-checking globbed directories...
[2/2] Generating mo...
********* Start testing of UDSEntryBenchmark *********
Config: Using QtTest library 6.7.0, Qt 6.7.0 (x86_64-little_endian-lp64 shared (dynamic) release build; by GCC 13.2.1 20230801), arch unknown
PASS   : UDSEntryBenchmark::initTestCase()
PASS   : UDSEntryBenchmark::createSmallEntries()
RESULT : UDSEntryBenchmark::createSmallEntries():
     183,638.58 nsecs per iteration (total: 18,363,859, iterations: 100)
     528,858.76 CPU cycles per iteration, 2,88 GHz (total: 52,885,876, iterations: 100)
     1,104,844.93 instructions per iteration, 2,089 instr/cycle (total: 110,484,494, iterations: 100)
     212,139.39 branch instructions per iteration, 1,16 G/sec (total: 21,213,940, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntries()
RESULT : UDSEntryBenchmark::createLargeEntries():
     81,317.89 nsecs per iteration (total: 8,131,790, iterations: 100)
     229,731.54 CPU cycles per iteration, 2,83 GHz (total: 22,973,154, iterations: 100)
     291,120.51 instructions per iteration, 1,267 instr/cycle (total: 29,112,051, iterations: 100)
     53,566.62 branch instructions per iteration, 659 M/sec (total: 5,356,663, iterations: 100)
PASS   : UDSEntryBenchmark::createLargeEntriesOrderedInsert()
RESULT : UDSEntryBenchmark::createLargeEntriesOrderedInsert():
     953,192.30 nsecs per iteration (total: 95,319,230, iterations: 100)
     2,724,595.39 CPU cycles per iteration, 2,86 GHz (total: 272,459,539, iterations: 100)
     989,756.75 instructions per iteration, 0,363 instr/cycle (total: 98,975,675, iterations: 100)
     188,525.51 branch instructions per iteration, 198 M/sec (total: 18,852,551, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromSmallEntries()
RESULT : UDSEntryBenchmark::readFieldsFromSmallEntries():
     10,193,867.15 nsecs per iteration (total: 1,019,386,715, iterations: 100)
     31,687,357.92 CPU cycles per iteration, 3,11 GHz (total: 3,168,735,792, iterations: 100)
     71,854,155.25 instructions per iteration, 2,268 instr/cycle (total: 7,185,415,525, iterations: 100)
     19,273,659.62 branch instructions per iteration, 1,89 G/sec (total: 1,927,365,962, iterations: 100)
PASS   : UDSEntryBenchmark::readFieldsFromLargeEntries()
RESULT : UDSEntryBenchmark::readFieldsFromLargeEntries():
     220,510.70 nsecs per iteration (total: 22,051,070, iterations: 100)
     705,320.81 CPU cycles per iteration, 3,2 GHz (total: 70,532,081, iterations: 100)
     1,199,381.87 instructions per iteration, 1,700 instr/cycle (total: 119,938,188, iterations: 100)
     328,167.23 branch instructions per iteration, 1,49 G/sec (total: 32,816,724, iterations: 100)
PASS   : UDSEntryBenchmark::saveSmallEntries()
RESULT : UDSEntryBenchmark::saveSmallEntries():
     477,174.66 nsecs per iteration (total: 47,717,467, iterations: 100)
     1,476,174.48 CPU cycles per iteration, 3,09 GHz (total: 147,617,449, iterations: 100)
     4,337,759.12 instructions per iteration, 2,939 instr/cycle (total: 433,775,912, iterations: 100)
     917,090.82 branch instructions per iteration, 1,92 G/sec (total: 91,709,083, iterations: 100)
PASS   : UDSEntryBenchmark::saveLargeEntries()
RESULT : UDSEntryBenchmark::saveLargeEntries():
     253,218.85 nsecs per iteration (total: 25,321,886, iterations: 100)
     756,541.94 CPU cycles per iteration, 2,99 GHz (total: 75,654,195, iterations: 100)
     2,226,594.93 instructions per iteration, 2,943 instr/cycle (total: 222,659,494, iterations: 100)
     472,052.79 branch instructions per iteration, 1,86 G/sec (total: 47,205,280, iterations: 100)
PASS   : UDSEntryBenchmark::loadSmallEntries()
RESULT : UDSEntryBenchmark::loadSmallEntries():
     822,340.18 nsecs per iteration (total: 82,234,018, iterations: 100)
     2,763,434.85 CPU cycles per iteration, 3,36 GHz (total: 276,343,485, iterations: 100)
     8,075,257.50 instructions per iteration, 2,922 instr/cycle (total: 807,525,751, iterations: 100)
     1,609,203.72 branch instructions per iteration, 1,96 G/sec (total: 160,920,373, iterations: 100)
PASS   : UDSEntryBenchmark::loadLargeEntries()
RESULT : UDSEntryBenchmark::loadLargeEntries():
     571,344.43 nsecs per iteration (total: 57,134,443, iterations: 100)
     1,927,061.64 CPU cycles per iteration, 3,37 GHz (total: 192,706,165, iterations: 100)
     4,791,910.76 instructions per iteration, 2,487 instr/cycle (total: 479,191,077, iterations: 100)
     988,709.07 branch instructions per iteration, 1,73 G/sec (total: 98,870,908, iterations: 100)
PASS   : UDSEntryBenchmark::cleanupTestCase()
Totals: 11 passed, 0 failed, 0 skipped, 0 blacklisted, 2043ms
********* Finished testing of UDSEntryBenchmark *********

What the benchmark shows, is the typical UDSEntry creation is about the same. For big UDSentry, it is a bit worse except if the keys/values are inserted in type order then the new UDSEntry is faster.

Read access is always better, thanks to smaller vector to go through.

Saving is better for small entries and slightly worse for big entries due to keeping the same format when we could optimize it for the new data structure. Loading smallEntries is slightly worse due to the format not being optimized not compensating for the overhead of having two vectors to fill. The loadLargeEntries is better because the save inserts first strings then numbers, the load then can leverage better CPU cache access and compensate for the overhead.

The benchmark were not run using setting QT_NO_DEBUG, meaning the asserts were used.

Overall I think that's a good trade-off and its leaves small optimization on the table for KF7.

Edited Apr 19, 2024 by Méven Car

UDSEntry use two vectors to store fields value

Merge request reports