Andreas, author of the Replicate model here -- though "author" feels wrong since I basically just stitched two amazing models together.
The thing that really strikes me is that open source ML is starting to behave like open source software. I was able to take a pretrained text-to-image model and combine it with a pretrained video frame interpolation model and the two actually fit together! I didn't have to re-train or fine tune or map between incompatible embedding spaces, because these models can generalize to basically any image. I could treat these models as modular building blocks.
It just makes your creative mind spin. What if you generate some speech with https://replicate.com/afiaka87/tortoise-tts, generate an image of an alien with Stable Diffusion, and then feed those two into https://replicate.com/wyhsirius/lia. Talking alien! Machine learning is starting to become really fun, even if you don't know anything about partial derivatives.
For the moment at least I'm personally more interested in the image applications than video use cases, but even so this is just fantastic for helping to develop an intuition about how the diffusion mechanism works.
It's admirable that you're so modest regarding the antecedent work, but sometimes it's the "obvious in hindsight" compositional insights that really open up the possibility space. Top work!
It's a nifty piece of work. Often when you're trying to get an answer from a regression model or a neural net you have to try to craft your inputs so carefully that you already sort of know, intuitively, what it will figure out. In some way the thought process of the refining the input is more valuable in a lot of quantitative cases than the actual output.
This is simply very impressive... whether or not it was humbly stitched together, you were sort of the first to do it, so take pride.
The next real magic will be reading its net and figuring out how to get [vfx/film] effects from it... which if I were you would probably occupy 22 hours of my day now.
The thing that really strikes me is that open source ML is starting to behave like open source software. I was able to take a pretrained text-to-image model and combine it with a pretrained video frame interpolation model and the two actually fit together! I didn't have to re-train or fine tune or map between incompatible embedding spaces, because these models can generalize to basically any image. I could treat these models as modular building blocks.
It just makes your creative mind spin. What if you generate some speech with https://replicate.com/afiaka87/tortoise-tts, generate an image of an alien with Stable Diffusion, and then feed those two into https://replicate.com/wyhsirius/lia. Talking alien! Machine learning is starting to become really fun, even if you don't know anything about partial derivatives.