You don’t typically perform optimizations iteratively with feedback from the fin...

You don’t typically perform optimizations iteratively with feedback from the final test set. Instead you split your training set into validation and training, and you iterate on that, leaving your true hold out test set completely unexamined all along.

You would do model comparisons, quality checks, ablation studies, goodness of fit tests and so forth only using the training & validation portions.

Finally you test the chosen models (in their fully optimized states) on the test set. If performance is not sufficient to solve the problem, then you do not deploy that solution. If you want to continue work, now you must collect enough data to constitute at minimum a fully new test set.