WIP: Optimized uint16 composite ops
Since I'm tired fighting banding with 8-bit painting, I decided to try and optimize 16-bit integer composite ops.
This implements COMPOSITE_ALPHA_DARKEN and COMPOSITE_OVER for uint16. Currently it basically duplicates the 8-bit code, like the 32-bit float implementation already did.
Speaking of the float implementation, I also found out why COMPOSITE_ALPHA_DARKEN was bugged, and re-enabled it after fixing (which however improved the freehand stroke benchmark only marginally).
I think it's possible to unify uint8 and uint16 code, I tried to move all necessary code differences into helper templates, but until I'm sure this code works correctly, I'll keep it separate. Oh one more thing, I also made the COMPOSITE_OVER support alpha locking, it hardly adds any extra code and I think it's used fairly often, so why fall back to scalar code...
Results on my "Haswell" i5 4670:
AlphaDarken | Legacy uint16 | AVX2 uint16 |
---|---|---|
Aligned Mask SrcRand DstRand | 56 msec | 7 msec |
DstUnalig Mask SrcRand DstRand | 56 msec | 8 msec |
SrcUnalig Mask SrcRand DstRand | 57 msec | 10 msec |
Unaligned Mask SrcRand DstRand | 56 msec | 10 msec |
Aligned NoMask SrcRand DstRand | 37 msec | 7 msec |
Aligned NoMask SrcZero DstRand | 36 msec | 1 msec |
Aligned NoMask SrcUnit DstRand | 35 msec | 7 msec |
Aligned NoMask SrcRand DstZero | 30 msec | 6 msec |
Aligned NoMask SrcZero DstZero | 24 msec | 1 msec |
Aligned NoMask SrcUnit DstZero | 29 msec | 6 msec |
Aligned NoMask SrcRand DstUnit | 33 msec | 7 msec |
Aligned NoMask SrcZero DstUnit | 33 msec | 1 msec |
Aligned NoMask SrcUnit DstUnit | 33 msec | 6 msec |
Over | Legacy uint16 | AVX2 uint16 |
---|---|---|
Aligned Mask SrcRand DstRand | 49 msec | 8 msec |
DstUnalig Mask SrcRand DstRand | 49 msec | 9 msec |
SrcUnalig Mask SrcRand DstRand | 49 msec | 11 msec |
Unaligned Mask SrcRand DstRand | 49 msec | 11 msec |
Aligned NoMask SrcRand DstRand | 41 msec | 8 msec |
Aligned NoMask SrcZero DstRand | 4 msec | 1 msec |
Aligned NoMask SrcUnit DstRand | 16 msec | 1 msec |
Aligned NoMask SrcRand DstZero | 24 msec | 5 msec |
Aligned NoMask SrcZero DstZero | 4 msec | 1 msec |
Aligned NoMask SrcUnit DstZero | 7 msec | 3 msec |
Aligned NoMask SrcRand DstUnit | 18 msec | 6 msec |
Aligned NoMask SrcZero DstUnit | 4 msec | 1 msec |
Aligned NoMask SrcUnit DstUnit | 6 msec | 3 msec |
I still need to add some tests to libs/ui/tests/FreehandStrokeBenchmark, but in contrast to fp32, I did actually measure some nice improvements on my last attempt, although to be honest, I'm still puzzled why I feel so little of it in Krita itself.
Test Plan
Well I'm not actually sure about the necessary steps to ensure the code is numerically correct. I implemented the comparison test to KisCompositionBenchmark and it seems to stay withing the error margin, but I don't think that's proof enough that rounding is correct.
Seeing a difference when painting is probably close to impossible unless it triggers some edge case where it over- or underflows, but no idea how to provoke those.
Formalities Checklist
-
I confirmed this builds. -
I confirmed Krita ran and the relevant functions work. -
I tested the relevant unit tests and can confirm they are not broken. (If not possible, don't hesitate to ask for help!) -
I made sure my commits build individually and have good descriptions as per KDE guidelines. -
I made sure my code conforms to the standards set in the HACKING file. -
I can confirm the code is licensed and attributed appropriately, and that unattributed code is mine, as per KDE Licensing Policy.