It is a limiting factor, due to diminishing returns. A model trained on double the data, will be 10% better, if that!
When it comes to multi-modality, then training data is not limited, because of many different combinations of language, images, video, sound etc. Microsoft did some research on that, teaching spacial recognition to an LLM using synthetic images, with good results. [1]
When someone states that there are not enough training data, they usually mean code, mathematics, physics, logical reasoning etc. In the open internet right now, there are is not enough code to make a model 10x better, 100x better and so on.
Synthetic data will be produced of course, scarcity of data is the least worrying scarcity of all.