It's clear that the next frontier is to have 3D-space instead of image space transitions. Language itself is very static and action verbs are not enough to specify scene dynamics. I suppose we would need:
A. an enriched version of natural language that refines the dynamic processes that occur in a scene
B. a data set of isolated processes labeled in the language described in A.
I've had a hard time finding ongoing work on A. and B, perhaps it isn't much of a priority for research groups.
For 3D we would probably need something like Blender or similar, because at some point it's just easier to use a 3D software to pinpoint where you want stuff to be, than try to use words.
Imagine opening blender and typing
> A medium sized classroom, well lit, with two blackboards and many geography posters
And the AI just generates all the 3D meshes and places them appropriately.
Repeat that for other props or characters that you need. After that you can manually tweak the scene as you currently would (moving things, etc).
Then you select a character, and to animate you tell it
> The character calmly walks to the door and proceeds to open it
You could literally do a 100+ hour job in 5 minutes.
I think you're right about the possibilities here and I love the idea and have thought similar things myself too, but to me the inclusion of a 3D element should probably be a format, not necessarily locked into any specific app such as Blender. Maybe (Pixar) USD is one possible format that could be used for general 3D interchange for this kind of thing?
The blender thing was just as an example. Of course it would be possible to convert whatever the output of the model is, to whatever software you want to plug it in.
I've had a hard time finding ongoing work on A. and B, perhaps it isn't much of a priority for research groups.