I'd question how much of that impact comes from specializing the collection code, and how much of it comes from allocation and locality. The temptation with void* collections is to malloc everything, which is lethal for performance, and one of the very few non-bug things I've come across whose fix actually yields a 10x performance improvement.
Memory allocation can indeed be amortized with specialized allocator, but that's not what I had in mind. The context I usually operate with is numerical computation, and the indirection cost is very high in those cases, especially when you can access memory in blocks if you use specialized allocator. It would be very hard to do well for sure. It is well known that allocator is one of the main weakness of the STL (one of the reason for the existence of Electronic Arts STL).
I am certainly not advocating doing this in general - I think the need for atomic support in generic collections is quite low (I have been investigating the issue recently to add fast and generic support for sparse matrices in scipy). I am pretty sure the macro, specialized ones used in freebsd (tree/queue.h) and linux (rbtree, list) have been benchmarked to hell, though, and would trust them more than most STL implementations.