Aurora (Postgres) looked like it would be great, and the performance for our use case was much better than RDS. But there was a showstopper for us with the replication: If the cluster leader fails, _every read replica will force a restart_ to elect a new leader, resulting in minimum ~30s complete cluster downtime. In our situation we have no problem with a short downtime in write availability, but the fact that we can't maintain read availability via replicas is a deal breaker.
I'm no DBA, but is that not the entire point of a cluster/replicas?
Sure, maybe writes won't be accepted or replicated until a new master is elected, but you can't even read from the cluster?
I'm curious at this point: what is it exactly that you gain from a cluster? Is it only having to wait 30s vs. however long it takes for the master to come back online?
That was the mind-boggling part to me. After testing failover and noticing the behavior we had a number of discussions with AWS reps and support engineers that confirmed it. And Aurora's failover documentation hints at it but downplays the severity.