But you have a fairly major problem with all these solutions and that is when a message leaves a queue. For a message queue you want it to leave the queue when delivered. For a job queue you want it to leave the queue when the work is completed but not to be sent out in the mean time. Now imagine you have processing jobs that take two weeks with 16 cores to complete.
That's one thing that makes this a hard problem. If you have open transactions for two weeks "Deleting this...." then autovacuum is going to be totally ineffective. So failure detection and recovery is an important (and probably the hardest) part of the problem.
PgQ looks to me like it is a solution to a message queue problem not a work queue problem.
IIRC (it's been a while), PgQ doesn't keep transactions open while events are processed: events are fetched and marked as in process; they're then marked succeeded or failed to finish the event. The bulk copy of initial data which bootstraps replication is a long duration process: I don't believe transactions are held open this entire time.
If I understand you correctly, this handles your work queue case.
In my experience you have several critical issues:
1. What happens when a job silently fails?
2. What happens when a job takes a lot longer than expected to succeed?
If you solve the first with a timeout, the second leads to a job rerun. The best (only?) solution I have found is to have some awareness in the job queue of the fact that the job is currently being processed. In my previous work we used advisory locks for that.
It wasn't clear to me how closely you've looked at PgQ. Have you looked into the design (other than the README), or used it and found these failings? I'm certainly not going to be able to answer your questions off the top of my head given the time passed since I last used it.
Given your critiques of everything else out there (from what I gather from the rest of your comments in this thread), it seems like your identified a possible business opportunity.
It's been a little while but I actually read through the source code of it and Londiste. It's possible I missed something, but I didn't see anything that would automatically reset messages if a connection goes away between receiving the message and marking it as completed.
That's one thing that makes this a hard problem. If you have open transactions for two weeks "Deleting this...." then autovacuum is going to be totally ineffective. So failure detection and recovery is an important (and probably the hardest) part of the problem.
PgQ looks to me like it is a solution to a message queue problem not a work queue problem.