WIP: Optimized uint16 composite ops (!584) · Merge requests · Graphics / Krita

The source project of this merge request has been removed.

Mathias Wein requested to merge (removed):optimized-uint16-compositeOps into master Nov 12, 2020

Since I'm tired fighting banding with 8-bit painting, I decided to try and optimize 16-bit integer composite ops.

This implements COMPOSITE_ALPHA_DARKEN and COMPOSITE_OVER for uint16. Currently it basically duplicates the 8-bit code, like the 32-bit float implementation already did.

Speaking of the float implementation, I also found out why COMPOSITE_ALPHA_DARKEN was bugged, and re-enabled it after fixing (which however improved the freehand stroke benchmark only marginally).

I think it's possible to unify uint8 and uint16 code, I tried to move all necessary code differences into helper templates, but until I'm sure this code works correctly, I'll keep it separate. Oh one more thing, I also made the COMPOSITE_OVER support alpha locking, it hardly adds any extra code and I think it's used fairly often, so why fall back to scalar code...

Results on my "Haswell" i5 4670:

AlphaDarken	Legacy uint16	AVX2 uint16
Aligned Mask SrcRand DstRand	56 msec	7 msec
DstUnalig Mask SrcRand DstRand	56 msec	8 msec
SrcUnalig Mask SrcRand DstRand	57 msec	10 msec
Unaligned Mask SrcRand DstRand	56 msec	10 msec
Aligned NoMask SrcRand DstRand	37 msec	7 msec
Aligned NoMask SrcZero DstRand	36 msec	1 msec
Aligned NoMask SrcUnit DstRand	35 msec	7 msec
Aligned NoMask SrcRand DstZero	30 msec	6 msec
Aligned NoMask SrcZero DstZero	24 msec	1 msec
Aligned NoMask SrcUnit DstZero	29 msec	6 msec
Aligned NoMask SrcRand DstUnit	33 msec	7 msec
Aligned NoMask SrcZero DstUnit	33 msec	1 msec
Aligned NoMask SrcUnit DstUnit	33 msec	6 msec

Over	Legacy uint16	AVX2 uint16
Aligned Mask SrcRand DstRand	49 msec	8 msec
DstUnalig Mask SrcRand DstRand	49 msec	9 msec
SrcUnalig Mask SrcRand DstRand	49 msec	11 msec
Unaligned Mask SrcRand DstRand	49 msec	11 msec
Aligned NoMask SrcRand DstRand	41 msec	8 msec
Aligned NoMask SrcZero DstRand	4 msec	1 msec
Aligned NoMask SrcUnit DstRand	16 msec	1 msec
Aligned NoMask SrcRand DstZero	24 msec	5 msec
Aligned NoMask SrcZero DstZero	4 msec	1 msec
Aligned NoMask SrcUnit DstZero	7 msec	3 msec
Aligned NoMask SrcRand DstUnit	18 msec	6 msec
Aligned NoMask SrcZero DstUnit	4 msec	1 msec
Aligned NoMask SrcUnit DstUnit	6 msec	3 msec

I still need to add some tests to libs/ui/tests/FreehandStrokeBenchmark, but in contrast to fp32, I did actually measure some nice improvements on my last attempt, although to be honest, I'm still puzzled why I feel so little of it in Krita itself.

Test Plan

Well I'm not actually sure about the necessary steps to ensure the code is numerically correct. I implemented the comparison test to KisCompositionBenchmark and it seems to stay withing the error margin, but I don't think that's proof enough that rounding is correct.

Seeing a difference when painting is probably close to impossible unless it triggers some edge case where it over- or underflows, but no idea how to provoke those.

Formalities Checklist

I confirmed this builds.
I confirmed Krita ran and the relevant functions work.
I tested the relevant unit tests and can confirm they are not broken. (If not possible, don't hesitate to ask for help!)
I made sure my commits build individually and have good descriptions as per KDE guidelines.
I made sure my code conforms to the standards set in the HACKING file.
I can confirm the code is licensed and attributed appropriately, and that unattributed code is mine, as per KDE Licensing Policy.

WIP: Optimized uint16 composite ops

Test Plan

Formalities Checklist

Merge request reports