I'll have to read the paper to see what makes it "deep"...
A cursory skim suggests that it is much faster than Stan, but I suppose the more significant question is if it provides the correct results. Stan might take longer, but I'm usually pretty confident that with some simple diagnostics I can see whether the results are what I really need.
One thing that looks cool is the tutorial for probabalistic PCA. That is a b of a thing to do in Stan. It really only works under some very limited conditions. Edward has this ability to combine in a KL divergence minimization in there. Not exactly sure how it works. I should look into it more. I don't really have a good sense of it just from reading the paper and a tutorial or two.
As someone who just implemented hierarchical probabilistic PCA in stan, I agree that it takes finesse, but it is no means impossible. Doing this sort of work efficiently in stan seems to require a some degree of understanding about how the sampler works. It also may require really thinking through your model. It saves you from deriving your own conditional distributions and writing a gibbs sampler, but you're going to have to do some analysis if you want to fit models of certain complexity.
KL divergence minimization (variational inference) is typically a weak approximation to the model you specified. I have seen it produce inferences on simulated which are just plain wrong. These "wrong" models are still often good predictors, so whether variational inference will work well for you depends on whether you care about making valid inferences or just doing prediction.
I would be very interested in seeing how you implemented the hierarchical PPCA.
My problem was that I couldn't identify the coefficients. So for instance, the first principal component could be [x, x, x, ...] or [-x, -x, -x, ...] and the result would be some bimodal distribution. So if you placed restrictions on the first PC it would work (like only positive), but those restrictions may not make sense for the next PCs.
Yes, multimodality is often a problem for mcmc clustering or dimensionality reduction. However, if you use the SVD method to estimate PCA you only have a bimodal distribution since SVD is identified up to the sign. Asymmetric initialization is usually enough to solve the problem.
A cursory skim suggests that it is much faster than Stan, but I suppose the more significant question is if it provides the correct results. Stan might take longer, but I'm usually pretty confident that with some simple diagnostics I can see whether the results are what I really need.