Optimize DUChainReferenceCounting's fast path with Q_UNLIKELY

In practice reference counting is enabled on any interval only in small
code fragments. shouldDoDUChainReferenceCounting() is called only in
assertions (which are disabled in Release mode) and in places that lock
a mutex if it returns true. The performance penalty of a wrong
Q_UNLIKELY prediction should be small in comparison to the mutex lock.

Comparison of the generated assembly code with and without Q_UNLIKELY
indicates that this keyword helps the compiler to order the code in a
way optimized for the more common path. Specifically, in the case of
~IndexedString() (in which DUChainReferenceCounting::shouldDo() is
inlined), Q_UNLIKELY eliminates a jump from the fast path: a `jne`
instruction replaces `je` and the fast-path code is moved up.

The following table compares performance of the previous, current and
considered alternative implementations in the affected benchmarks. The
numbers denote milliseconds per iteration. These numbers are minimums,
not averages. The 9031ee87 commit
message specifies the meaning of the columns in the table legend and the
methodology of benchmarking in the paragraph that starts with
"Each number in this and the older commit message's table".

version\benchmark       qhash   create  destroy shouldDo(-) shouldDo(+)
previous commit         0.95    143      75      70         171
this commit             0.84    141      74      70         247
redundant if            0.84    141      74      70         247
inline variable         0.84    141      74      70         247
static data member      0.93    143      79     141         353
extern variable         0.93    143      79     141         353
internal linkage        1.1     145     100     565         742
non-inline f-l static   1.1     146      96     565         742
std::any_of             0.85    142      74      70         565

Versions:
redundant if            - a more verbose Q_UNLIKELY optimization:
                        if (Q_UNLIKELY(count != 0)) {
                             for (std::size_t i = 0; i != count; ++i) {
                                ...
                          Milian suggested to move Q_UNLIKELY into the
                          loop exit condition to eliminate the redundant
                          `if` - after I benchmarked all alternative
                          implementations with the `if`. The benchmark
                          performance and the generated assembly code on
                          the fast path remained the same.
inline variable         - "redundant if", but with inline thread_local
                          variable in the header instead of instance()
static data member      - "redundant if", but with instance as a static
                          data member, not a static member function
extern variable         - "redundant if", but with extern thread_local
                          variable in the header instead of instance()
internal linkage        - "redundant if", but with static thread_local
                          variable in the cpp file instead of instance()
                          and shouldDoDUChainReferenceCounting() defined
                          in the cpp file
non-inline f-l static   - "redundant if", but with instance() defined in
                          the cpp file
std::any_of             - "redundant if", but with shouldDo()
                          reimplemented as follows:
    if (Q_LIKELY(count == 0)) return false;
    return std::any_of(intervals, intervals + count,
        [item](Interval interval) { return interval.contains(item); });

Now that the constructor of DUChainReferenceCounting is constexpr, the
performance of the inline variable is exactly the same as of the inline
function with a function-local static variable inside. I think we should
keep the current version, because it would probably be faster if the
constructor becomes non-constexpr in the future. In addition, lazy
initialization of the thread_local instance variable is preferable and
should be enforced, because most threads never use
DUChainReferenceCounting.
1 job for !205 with constexpr-duchain-referencecounting in 32 minutes and 55 seconds (queued for 102 minutes and 41 seconds)
detached
Status Job ID Name Coverage
  Build
passed #83049
linux kf5-qt5 qt5.15

00:32:55

46.0%