Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Push-based outbox pattern with Postgres logical replication (event-driven.io)
126 points by GordonS on Oct 28, 2022 | hide | past | favorite | 37 comments


Debezium supports outbox[1] pattern out of the box. All your application needs to do is to write to the outbox table and debezium will do the rest mentioned in the post and publish your events to kafka.

1: https://debezium.io/documentation/reference/stable/transform...


I’m working on a db migration project from MySQL to kv database and Debezium saved the project. Don’t know how we would have done it without it. Really cool tech.


Interesting approach, I use a very similar methodology in Marmot (https://github.com/maxpert/marmot) to do CDC for SQLite via triggers, and then these changes are published over NATS, replicated over to other nodes. So row-level replication powered by similar outbox except it uses NATS and JetStreams's quorum property to ensure reliable recovery.


Thought this was an interesting approach, rather than the more well-known `LISTEN`/`NOTIFY` pub/sub mechanism!


LISTEN/NOTIFY also has an 8000 character limitation.


The way I've ways seen LISTEN/NOTIFY used is just to send a "hey, there are items available in the queue!" signal to consumers.

I take your point though, that WAL is less limiting in this regard.


much more robust, with easier replay-ability.


Which way do you mean, using WAL or Publish/Notify?


WAL is reliable, but Publish/Notify isn't. You'll lose messages with Publish/Notify if you aren't listening for them.


If you just use notify to tell listeners “there is work in the queue to grab” can you still lose messages? I’ve always seen it implemented where workers have to go and claim items (transactionally), not just wait to be pushed them.


No, that way is safe. Just make sure that your workers also check for items in the queue when they start.


this is exactly what I meant - WAL is your friend.


What happens if the message processing fails, e.g. your message bus broker is down -- is there a way of signalling to Postgres "hey, try me again later"?


I wrote a Haskell library to implement this as well [0]. It's a very useful pattern.

[0] https://hackage.haskell.org/package/postgresql-replicant


Always found the first method of keeping bookmarks to the positions in a table to be the best method. I don’t think the 3 problems given to be an issue personally. - just loop at the interval that your users can accept the delay - querying a database 10 times a second is nothing, even 100 times, this wouldn’t cause resource limitations - the scaling concern made no sense to me, seems a bit arbitrary.

Cool to know this is an option though, but I much prefer not relying on database internals like wal logs, this hurts scaling more IMO.

The only thing you need to worry about with the table method is out of order increment IDs, which is always possible in a transactional database. But there are many solutions for this.


I really like this solution and agree "bookmarks" are a lot better than outbox, but it is hard to guard against races where the reader increments the bookmark before an active transaction commits an ID at an earlier position; if transaction commits in a different order than IDs are allocated.

Perhaps this is what you mean with "out of order increment"... but what are the "many" solutions to this?

I struggle for a long time to find a good way of doing this in Microsoft SQL and still not perfectly happy about the solution we found.


Do you have a few pointers to the solutions regarding out of order IDs? I'm thinking of keeping track of gaps (yet unseen IDs) in another column and retrieving them in the next poll.


Not OP, but there is an approach here of using a dedicated loop worker to assign post-commit ID sequence. I.e. using the outbox pattern once, simply to assign a post-commit ID.

https://github.com/vippsas/mssql-changefeed/blob/main/MOTIVA...

I wish DBs had this more built in, it seems a critical feature of a DB these days and the commit log already have very similar sequence numbers internally...


or just use a proper eventstore


1) Well assuming data is stored in the SQL as the transactional store, how do you move data safely from SQL to the eventstore? You at least need the postbox pattern. It is not clear to me that the postbox pattern is less hacky than listening to a SQL change feed.

E.g., with Azure Cosmos DB -- you would not use the postbox pattern there. You listen to the DB change feed -- since that is provided and is easily accessible.

2) In our case the schemas in SQL are mostly event based already (we prefer insert over update ...even if we do business logic/flexible querying on the data). So using an event store is mainly a duplication of the data.

An event store is a database too. What exactly is it about a database that makes it a "proper event store"?

I honestly think the focus on duplicating data you have in your DB in a separate event store DB too may be something of a fad that will pass in a while -- similar to NoSQL. It's needed for volumes larger than what a SQL DB can handle, but if you don't need such large volumes why introduce the extra component. Event sourcing architecture is great; but such thinking on an architecture level is really orthogonal to the storage and transport chosen.


I wrote a blog post on this. We were using event sourcing and went with locking+batching method described. For other tables we wanted to read and move to a data warehouse we used the txid check.

A bit hacky, but on AWS RDS we didn’t want to deal with wal intricacies.

https://mattjames.dev/auto-increment-ids-are-not-strictly-mo...


How would this work from a fault tolerance perspective? For example the listening application happens to be offline but the database is inserting records. How would the application catch up?


If the listener is offline, the Postgres will see that a subscriber is behind and mark more of internal data as needing to be retained - this will keep certain maintenance tasks from being run (i.e. VACUUM). If VACUUM is not run for long enough, it will cause a catastrophic failure in your DB.

The application can catch up when restarted, if it retains the last WAL position. When it restarts, can asks Postgres to start replaying from that point.


It would catch up by reading the WAL files when it restarted - using this method, the WAL files will remain on disk until they've been read.


beware of subscribers being down. wal file will fill-up and then you’ll loose messages.


Won't it just... write more WAL files?


yes, until you run out of disk - it's happy to write as much as you can handle.

but, disk isn't usually infinite.


Well of course, but I don't feel like this is a noteworthy limitation here; it applies to any form of queue that persists messages.

I had imagined the GP might have been insinuating something like: a configurable number of WAL files would be written, then they'll be overwritten once all full.


I mean running out of disk is a danger for any persistence, that's not specific to using WAL


What will I lose if instead of this or debezium I simply have a column representing the status of the message as “pending” or “complete”, and a job that periodically processes pending messages and marks them complete upon success, and another job that archives complete ones?


The author says the outbox pattern should "always" be used.

I agree with the goal -- one should definitely never publish an event externally before it is committed to DB! But I think using a "post-commit sequence number" is even more powerful than the outbox pattern.

Sadly few DBs seems to supoort this well with low latency. CosmosDB has great support for a post commit low latency change feed

A hack for mssql is here:

https://github.com/vippsas/mssql-changefeed/blob/main/MOTIVA...

More about this way of publishing events:

https://github.com/vippsas/zeroeventhub


What advantage would this offer over a durable, transactional queue, like, say Pulsar? I don't see much of an advantage, and depending on how much you're pushing through your queue, I could see the outbox table causing issues with your production system if you're using that DB to hold the data.


I've been using this pattern with Elixir (which is well suited for this type of workload) in production. I put together a little library (which is mostly ripped off from Supabase's realtime but with some helpers):

https://github.com/cpursley/walex


Does using this approach prevent direct DML changes or migrations outside application logic?


Anyone know of any Rust libraries that implement this pattern?


The Materialize team manage a fork of https://github.com/sfackler/rust-postgres with the changes required to consume from the Postgres WAL: https://github.com/materializeInc/rust-postgres.

Here is a comment and link to some code which seems to work: https://github.com/sfackler/rust-postgres/issues/116#issueco...


Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: