I'd argue that actually, the smaller model is doing a better job at "learning" -...

I'd argue that actually, the smaller model is doing a better job at "learning" - in that it's including key characteristics within an ascii image while poor.

The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.

Maybe a pelican riding a bike as ascii for both?