Yes, I'm assuming once code is JIT'ed, it will take a while before being JIT'ed ...

Yes, I'm assuming once code is JIT'ed, it will take a while before being JIT'ed again. But in the meantime the caches will cover some of the trampoline costs.

With regards to latency, if you have a dedicated JIT thread, you can read ahead and JIT the start of all the jump/calls up to the next current ret in parallel to executing the first chunk. Like a super fancy prefetch. Heuristically, I think it's safe to assume that most code is local but I could be wrong.

Also, if you load hwloc on x86-64, you can read your CPU topology at initialization time. If hyperthreading is present (which is somewhat common), you can set affinity for JIT threads to be on the same core as the emulated CPUs. This will minimize read concurrency overhead since you'll be reading/writing to the physical core's L1 almost every time. (This part is crazy, but you might even be able to get away without using atomics due to how cache-associativity works, but I've never tried it and it might not be guaranteed into the future.)