Skip to content

WIP: Optimized uint16 composite ops

Mathias Wein requested to merge (removed):optimized-uint16-compositeOps into master

Since I'm tired fighting banding with 8-bit painting, I decided to try and optimize 16-bit integer composite ops.

This implements COMPOSITE_ALPHA_DARKEN and COMPOSITE_OVER for uint16. Currently it basically duplicates the 8-bit code, like the 32-bit float implementation already did.

Speaking of the float implementation, I also found out why COMPOSITE_ALPHA_DARKEN was bugged, and re-enabled it after fixing (which however improved the freehand stroke benchmark only marginally).

I think it's possible to unify uint8 and uint16 code, I tried to move all necessary code differences into helper templates, but until I'm sure this code works correctly, I'll keep it separate. Oh one more thing, I also made the COMPOSITE_OVER support alpha locking, it hardly adds any extra code and I think it's used fairly often, so why fall back to scalar code...

Results on my "Haswell" i5 4670:

AlphaDarken Legacy uint16 AVX2 uint16
Aligned Mask SrcRand DstRand 56 msec 7 msec
DstUnalig Mask SrcRand DstRand 56 msec 8 msec
SrcUnalig Mask SrcRand DstRand 57 msec 10 msec
Unaligned Mask SrcRand DstRand 56 msec 10 msec
Aligned NoMask SrcRand DstRand 37 msec 7 msec
Aligned NoMask SrcZero DstRand 36 msec 1 msec
Aligned NoMask SrcUnit DstRand 35 msec 7 msec
Aligned NoMask SrcRand DstZero 30 msec 6 msec
Aligned NoMask SrcZero DstZero 24 msec 1 msec
Aligned NoMask SrcUnit DstZero 29 msec 6 msec
Aligned NoMask SrcRand DstUnit 33 msec 7 msec
Aligned NoMask SrcZero DstUnit 33 msec 1 msec
Aligned NoMask SrcUnit DstUnit 33 msec 6 msec
Over Legacy uint16 AVX2 uint16
Aligned Mask SrcRand DstRand 49 msec 8 msec
DstUnalig Mask SrcRand DstRand 49 msec 9 msec
SrcUnalig Mask SrcRand DstRand 49 msec 11 msec
Unaligned Mask SrcRand DstRand 49 msec 11 msec
Aligned NoMask SrcRand DstRand 41 msec 8 msec
Aligned NoMask SrcZero DstRand 4 msec 1 msec
Aligned NoMask SrcUnit DstRand 16 msec 1 msec
Aligned NoMask SrcRand DstZero 24 msec 5 msec
Aligned NoMask SrcZero DstZero 4 msec 1 msec
Aligned NoMask SrcUnit DstZero 7 msec 3 msec
Aligned NoMask SrcRand DstUnit 18 msec 6 msec
Aligned NoMask SrcZero DstUnit 4 msec 1 msec
Aligned NoMask SrcUnit DstUnit 6 msec 3 msec

I still need to add some tests to libs/ui/tests/FreehandStrokeBenchmark, but in contrast to fp32, I did actually measure some nice improvements on my last attempt, although to be honest, I'm still puzzled why I feel so little of it in Krita itself.

Test Plan

Well I'm not actually sure about the necessary steps to ensure the code is numerically correct. I implemented the comparison test to KisCompositionBenchmark and it seems to stay withing the error margin, but I don't think that's proof enough that rounding is correct.

Seeing a difference when painting is probably close to impossible unless it triggers some edge case where it over- or underflows, but no idea how to provoke those.

Formalities Checklist

  • I confirmed this builds.
  • I confirmed Krita ran and the relevant functions work.
  • I tested the relevant unit tests and can confirm they are not broken. (If not possible, don't hesitate to ask for help!)
  • I made sure my commits build individually and have good descriptions as per KDE guidelines.
  • I made sure my code conforms to the standards set in the HACKING file.
  • I can confirm the code is licensed and attributed appropriately, and that unattributed code is mine, as per KDE Licensing Policy.

Merge request reports

Loading