My first question when I saw this was "How to you handle network partitions?", since RabbitMQ's partition handling is, uh, suboptimal. I read bullet points until I found this:
> With our RabbitMQ servers, you wont have to deal with message loss in the event of a network partition.
Reading on, I found that your answer to partition tolerance is to avoid the possibility of partitions by not supporting clustering at all. So that kind of rules out high availability, practically speaking. Shovel and federation are poor options.
As someone who is actively looking for highly available AMQP without message loss, I have to say that I'm not going to pay someone else for a poor solution to the problem. A managed service has to solve the hard problems to be compelling. I can run my own single instance and hope it doesn't go down at a bad time, which is all you're offering.
I know this is all very negative, and I regret that, but I'm part of your target market and you need to know what your offering looks like from my perspective. A managed service can't sidestep the difficult problems of operating their core technology.
What you have described is not negative at all. It is just what it is with RabbitMQ. I can remember the first time(many years ago) when I experienced my first RabbitMQ node split and I was so disappointed with it. But reading on their documentation, I was able to realise that, that is just the way RabbitMQ handles partitions.
> A managed service has to solve the hard problems(high availability) to be compelling.
We hear you. And are working very hard to try and come up with high availability solutions/plans that do not have the same failure modes that currently exist in RabbitMQ.
But we believe that those problems need to be solved in RabbitMQ itself. So we are looking to see if we can come up with other failure handling modes, apart from the 3 outlined in[1]. We cant make any promises, however.
If you are already running your own node and are looking for HA solutions, then we might not be right for you, yet.
But we still believe we can offer you value by taking over the headache of running the node from you so that you can concentrate on other things.
Indeed. In my opinion RabbitMQ is essentially useless in clustered mode.
When Rabbit recovers from a network partition and has to decide between multiple potential master versions of a queue, it picks the largest one to become the new master, and discards the others. It's rather mind-boggling that it can't merge them instead; after all, if your application is capable of handling duplicate deliveries, then merging (which would potentially result in previously ACKed messages becoming visible again) would be a perfectly acceptable solution.
The only way to make it non-lossy is to turn off HA recovery and manually handle network partitions, but it turns out that's not practically feasible, because there are no tools to work with Rabbit queues at a low level; the only way to recover is to discard one or more nodes.
We've also found Rabbit's clustering to be very flaky in general, beside the lack of partition tolerance. We recently had a Rabbit crash where one Rabbit node (not the machine itself) went down, and things got really stuck; the only way to recover was to stop all the nodes, then start them again. After we did that, all the queues were empty. We've also had instances where suddenly bindings go missing, or the bindings are there but attempting to declare them from a client fails with an "bindings already exist" error. And many other weird errors.
The last year or so, after having to endure all of these issues, we've decided to ditch clustering altogether and run a single node. That's risky, but ironically it's a lot more stable than our previous three-node cluster.
In my opinion, Pivotal really needs to redesign RabbitMQ's clustering.
Has anyone successfully moved off Rabbit? ActiveMQ, NSQ? Disque [1] looked promising, but seems dead (last commit was 18 months ago) at this point.
For a stable RabbitMQ cluster, you want dedicated RabbitMQ hosts with sufficient CPU, disk & network throughput for your workload. Most RabbitMQ users don't know what their workload is, or what their hardware boundaries are. We, the RabbitMQ team, should make this easier - and we will, in due course.
A good default cluster is 3 x r4.large with 100GB GP2 for RABBITMQ_MNESIA_BASE & pause_minority. For queues that need HA, a good default is ha-mode: exactly, ha-params: 2, ha-sync-mode: automatic. As for the Erlang version, we recommend 19.3.6.2 which has important fixes relevant for RabbitMQ. Today we recommend RabbitMQ 3.6.11, and 3.6.12 as soon as it ships.
In the past 6 months, I have been focusing on RabbitMQ stability and operability on AWS, GCP & vSphere. Can you tell me more about your RabbitMQ deployment lobster_johnson? This will help: https://s3-eu-west-1.amazonaws.com/rabbitmq-share/help-us-un...
I wouldn't mind moving this discussion to rabbitmq-users mailing list, so that it can benefit more in the RabbitMQ community.
Antirez said he plans on merging disque into a Redis module, now that such a thing exists. I'm pretty excited, and would love to migrate off of RabbitMQ to Disque or whatever the module version is named, as we're already successfully running redis instances.
As for merging queues after partition recovery, the RabbitMQ devs have been talking about implementing that for years. I understand it's a hard problem, or it would already be part of RabbitMQ, since it's the most obvious and desirable solution for applications that can handle duplicates.
We're doing the same thing wrt avoiding clustering and accepting the brief downtime when the single RabbitMQ instance fails.
The only reason why we are not offering a clustering plan is because when you experience a network partition in such a case, you will loose data. And we do not want to be in a situation where we are explaining to our customers that there data has been lost through no fault of our own, but because that's the way RabbitMQ works.
We would rather have single node plans where failure modes are much easier to deal with.
Yes. A bonus is that if you use it in combination with a health-check-capable proxy (HAProxy, Kubernetes), clients can be routed to any non-paused node automatically. In fact, you'll need that, since a paused node will close its ports.
That said, the devil is in the details; I'd be interested to know if RabbitMQ is capable of reliably detecting that it's in a minority. We've had issues where nodes are having issues talking to each other, but the problem is not consistent on both ends (e.g. A can talk to B, but B can't talk to A).
Hi, even in pause_minority mode; when the node/s that was in the minority rejoin the cluster, then it will loose any messages that it had and assume the messages of the majority.
I'm totally with you. I think the problem is AMQP itself. It just doesn't lend itself well to fully managed reliable message passing across network partitions because it was engineered (some say overengineered) to be "zero overhead" and have "delivery guarantees". If you need message passing across network partitions you really need to ask yourself if RabbitMQ is the right tool for the job. As services like Pubnub get cheaper and gain more acceptance from developers I think they will eliminate many of the things RabbitMQ is currently being used for. If you need message passing features that aren't either in RabbitMQ, Redis, or hosted services like Pubnub, then you're probably doing something sophisticated and probably want to build your infrastructure from the ground up.
Totally agreed. A single Rabbit node is brain-dead easy to run - the hardest part is getting the right Erlang packages installed. HA Rabbit is a problem I'd be willing to pay someone else to solve for me.
Why not support TLS/amqps at all pricing levels? That's a huge turn off for me especially since you only have it at your highest pricing level. I'd also make that clear on your comparison page as it seems like you support amqps at the $55 level but do not on your pricing page. Good luck! (seriously, no sarcasm)
> as it seems like you support amqps at the $55 level but do not on your pricing page.
Hi, sorry about that. We'll fix the comparison page.
> Why not support TLS/amqps at all pricing levels?
When we started out, we were offering AMQPS/TLS for all plans through certificates from letsencrypt[1]. However, it became hard to manage certificate renewals, since we had to renew them on one machine and scp them to the respective RabbitMQ servers. This was too labour intensive and not worth.
We however still have plans to roll out TLS for all plans at no extra cost.
> With our RabbitMQ servers, you wont have to deal with message loss in the event of a network partition.
Reading on, I found that your answer to partition tolerance is to avoid the possibility of partitions by not supporting clustering at all. So that kind of rules out high availability, practically speaking. Shovel and federation are poor options.
As someone who is actively looking for highly available AMQP without message loss, I have to say that I'm not going to pay someone else for a poor solution to the problem. A managed service has to solve the hard problems to be compelling. I can run my own single instance and hope it doesn't go down at a bad time, which is all you're offering.
I know this is all very negative, and I regret that, but I'm part of your target market and you need to know what your offering looks like from my perspective. A managed service can't sidestep the difficult problems of operating their core technology.