Another day, another “Using PostgreSQL for…” thing it wasn’t designed for. This isn’t a good idea. What happens when the queue goes down and all messages are dead lettered? What happens when you end up with competing messages? This is not the way.
The other system you're using that isn't Postgres can also go down.
Many developers overcomplicate systems. In the pursuit of 100% uptime, if you're not extremely careful, you removed more 9s with complexity than you added with redundancy. And although hyperscalers pride themselves on their uptime (Amazon even achieved three nines last year!) in reality most customers of most businesses are fine if your system is down for ten minutes a month. It's not ideal and you should probably fix that, but it's not catastrophic either.
What I’ve found is that, particularly with internal customers, they’re fine with an hour a month, possibly several, as long as not all of your eggs are in one basket.
The centralization pushes make a situation where if I have a task to do that needs three tools to accomplish, and one of them goes down, they’re all down. So all I can do is go for coffee or an early lunch because I can’t sub in another task into this time slot. They’re all blocked by The System being down, instead of a system being down.
If CI is borked I can work on docs and catch up on emails. If the network is down or NAS is down and everything is on that NAS, then things are dire.
I wondered, but the lack of "the" before "DC" tipped me toward interpreting it as the place name, especially as AWS us-east-1 is in Northern Virginia. Thanks for clarifying!
Challenge: Design a fault tolerant event-driven architecture. Only rule, you aren’t allowed to use a database. At all. This is actually an interview question for a top employer. Answer this right and you get a salary that will change your life.
There are a ton of job/queue systems out there that are based on SQL DBs. GoodJob and SupaBase Queues are two examples.
It’s not usable for high scale processing but most applications just need a simple queue with low depth and low complexity. If you’re already managing PSQL and don’t want to add more management to your stack (and managed services aren’t an option), this pattern works just fine. Go back 10-15yrs and it was more common, especially in Ruby shops, as teams willing to adopt Kafka/Cassandra/etc were more rare.
I think the PG designers would be surprised by the claim that it wasn't designed for this. Database designers try very hard to support the widest possible range of uses.
If all queue actions are failing instantly, you probably want a separate throttle to not remove them from the Kafka queue, since you'd rather keep them there and resume processing them normally instead of from the DLQ when queue processing is working again. In fact, the rate limit implicitly enforced by adding failure records to the DLQ helps with this.
How so? There are queues that use SQL (or no-SQL) databases as the persistence layer. Your question is more specific to the implementation, not the database as persistence layer itself. And there are ways to address it.