Integrate SAM 2 (Segment Anything)
I was looking a bit into SAM 2 as it now also handles Videos and how it could be integrated into Kdenlive.
If I understood it correctly then this should be implemented in MLT as a Filter as that is what Kdenlive uses for all of its effects. The official implementation of SAM 2 uses Python/Pytorch and MLT is a C library.
Integration via C/MLT
-
Reimplement SAM 2 in GGML. There is a C implementation of SAM 1 using the. I could not find anything for SAM 2.
-
Export SAM 2 pytorch model to ONNX
2.1 Another potential option is to export the pytorch model to ONNX format as there is a c binding for ONNX runtime. I found a project that did it for SAM 1 here.
2.2. Instead of using ONNX runtime one could use OpenCV DNN (I saw that there is some opencv integration in MLT already used for Tracking so wanted to mention it). There was an effort to do this for SAM 1 here.
-
Embed Python (SAM 2 Python implementation) into C. To be honest, don't think I understood the effort and viability of this approach. I couldn't find much how this could be done in practice for anything bigger than a hello world. Here are some official docs on how to do it in general. So probably not really an option.
Regarding 2) I could not find any ONNX/C integration efforts yet but there is a successful effort in running the ONNX from Python. So unlike 2a) you don't have to reimplement the model itself but still quite some additional logic to pre/post process in- and outputs of the model which would need to be rewritten in C/C++.
Integration via Python/Kdenlive
It seems that Kdenlive relies fully on MLT for effects. But there is some Python integration there as this is how Whisper STT was integrated. Unfortunately, I wasn't able to understand the effort to add possibility of having filters outside of MLT.
I experienced some issues with official SAM 2 project with longer videos. Instead I would go with this Python reimplementation which fixes this issue and its code quality / optimization looks preferable. (It also has a great demo application/GUI, would recommend you give it a shot if you want to play with SAM 2 locally)
Conclusion
So, If I understand it correctly then the Integration via C/MLT approach would be preferable theoretically (also because it could be used outside of Kdenlive). But it doesn't look like a viable option for now. There still seem to be some compatibility issues for 2a) and 2b) for even SAM 1 which is a simpler model than v2 as it does not understand videos natively only images...
The Python/Kdenlive approach seems for more realistic at this point if Kdenlive project is willing to bypass MLT and integrate it directly (not sure if you want that).
Hope this research is of some use