You show the neural net many video snippets of a real lava lamp. It tries to output videos that are supposed to imitate the behavior of a real lava lamp. So it actually outputs videos of lava lamps but to do that convincingly, it has to understand the physical behavior of a lava lamp at least a little bit.
It doesn't have to, no. But come on that's not really what is meant here.
It does need to understand certain properties of owls, even if "arrangement of pixels that look like an eye near arrangement of pixels that look like a beak" is as far as it gets. Though as in another thread it is not necessarily the owl (you would need to do something like rendering a la Neural Radiance Fields (NeRF) to get closer to some perfect comprehension of an owl.
I don't think that's right. The proof of that learning could be that it applies the knowledge to something other than a lava lamp, or at least a modified lava lamp - so something "out of the box".