Parallelization and SIMD are pretty easy for numerical code in CL now; I don't think this was the case 10 years ago. LPARALLEL is a nice small library for core-parallelism, and CL-MPI is good for MPI/cluster parallelism. This QVM [1], a pure/density-state simulator, can use either.
[1] MPI version: https://github.com/quil-lang/qvm/tree/master/dqvm