For a product that's been around 12 years, I've been surprised at how minimally featured PagerDuty is.
Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.
PD schedule checking and trade negotiation becomes yet another thing in the long list of things I need to do when taking a day off. HR system request off, Department Outlook calendar update, PagerDuty coverage check, Outlook out-of-office status & auto-replies, Slack set away, update status AND pause notifications.
I suppose that's because as an on-call developer I am not the user. The user, management who bought the product, gets KPIs & pretty graphs, so they are happy.
My least favorite thing about PagerDuty is the phone call notification. I drive a car from 2001, and with a cheap bluetooth upgrade, I can do all of these with my voice while driving:
- Get directions to anywhere on the continent
- Send and receive texts to my friends
- Answer and take a call from a human
But if PagerDuty calls me, Stephen Hawking's speech synthesizer brusquely yells at me and demands I take my hands off the wheel and press a button on my phone to acknowledge the alert. No voice recognition, no ability to kick off an automated play. It's a time portal to 1997! Even the _banks_ have friendlier phone automation these days!
JIRA is one of those pieces of software that creates more problems than it solves. That's good for the industry -- it means to get an equivalent amount of work done, you need more people. More jobs competing for the same pool of workers means that you have to pay the workers more. That's good for everyone.
(If I were starting a company tomorrow, though, I'd use Linear. Nicest issue tracking tool I have ever seen. It has all the "big business" features like roadmaps and story points, with a lot of friendliness for the individual contributors -- dark mode, keyboard shortcuts, a dedicated triage UI. It's so nice to see someone finally get it right!)
> Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.
Do you shut down your service for Labor Day? I don't.
I do agree that trading on-call shifts is not very easy within the UI. Part of me dreams of being able to make enough advantaged trades to end up never on-call, like the padre who doubled his holdings in a WW2 POW camp: https://www.ft.com/content/c523efe6-9973-11e1-9a57-00144feab...
Depends on the service? I've maintained a service used for supply chain planning where an outage during business hours needed addressing, but an outage over a weekend or holiday wasn't important. Being able to set up company holidays ahead of time then just not think about it would have been a useful feature for an on-call tool in that scenario.
No, but you could, for example, allocate the holidays equally amongst the team, or as a special class of days with a different schedule like weekends.. rather than just screwing whoever happens to be on that week?
The fact that the product seems to have no concept of holidays when its essentially a scheduler++ is a problem.
The problem with doing trades which is the default easiest thing to do given the PagerDuty interface is that when you come back from (or just before you go on) vacation you typically end up with extra bonus on-call shift outside the cycle. Delightful!
All these things just sort of pile up into the "maybe its just easier not to take a couple days off" category, which is not really a mistake on your employers part.
Oh man, as the engineer who was more or less responsible for PD scheduling between 2013 - 2015, it really hurts to hear they still haven't solved this :/
Depends on the service and industry. Banking adjacent companies are often allowed downtime off US business hours. Even at big tech companies I've run internal services that had business hours support only (nonproduction sandboxes, non-business-impacting services, long running job services with SLOs measured in hours or days)
> Depends on the service and industry. Banking adjacent companies are often allowed downtime off US business hours.
Honestly as a consumer this pisses me off. I get home for the day, relax, eat dinner, and log into the bank to check my finances at 8pm and east coast banks throw up a "scheduled downtime for upgrades" notice.
Hey everyone, Matvey, ex-CEO of Amixr is here. Me and Ildar Iskhakov started this project three years ago because we used to be on-call ourselves and needed better tools. It was an amazing journey from 0 to 1. Tons of coding, first customers, fundraising, iterating, and finally the honor to join Grafana Labs and build Grafana OnCall! I'll be happy to answer your questions if you have any.
It's great to see more competition in this space. Generally speaking, what I miss in these "incident management" products is also an integrated, flawless way to handle incidents when they're happening. I'm talking about:
1. Quickly creating a proper chat
2. Quickly creating an incident document where you can pin chat messages and use it in the post-mortem. Ideally, pinning some graphs that you'd extract from your observability solutions
3. Having a status page to put a small description for non-technical stakeholders.
PagerDuty covers some of this. Monzo's Response [1] and now incident.io [2] try to cover it too. I'd like to have this experience end-to-end.
Monzo's solution does not seem to be actively maintained, is it?
+100 on the creation of incident chat rooms and pinning data to re-use in incident docs. There is nothing worse than copying the timeline events from one tool to a Google Doc.
This is one thing I really like about PagerDuty's incident response, I can pull incidents and Slack messages right into the Incident timeline. I usually end up copy and pasting it into a.. _sigh_ Jira ticket.
For now, we are focusing on rolling Grafana OnCall in the Grafana Cloud. It's a very common use case to have such a system outside of your infrastructure so it won't be affected by probable issues. It should be alive even when everything goes wrong.
We've already received multiple questions about OSS and on-premises. Will roll cloud version first, see how it works, collect feedback and build (and share) future plans!
This looks really neat. We don't use Grafana today. We're running CloudWatch/insights and Squadcast for alerting, but deep integration with the monitoring tool looks cool. Is this usable with self-hosted or AWS managed Grafana?
Yep! The idea of Grafana OnCall is to help you to group, deduplicate, route & deliver to Slack/SMS/Phone alerts from any sources. It could be a CloudWatch, DataDog, self-hosted Alertmanager, or Grafana of course. The only requirement for the alert source is to be able to generate a webhook and send it to us.
Can Grafana OnCall itself be self-hosted and/or run as a part of Grafana itself? Your last response makes it sound like it's a separate product with integrations rather than an extension of Grafana. Is that correct?
It's 100% part of the Grafana Cloud, not a separate product. It's deeply integrated with the rest of Grafana.
Same time we've focused on making it useful for those who don't use Grafana for monitoring. Feel free to sign up in the Grafana Cloud and use just OnCall if you want.
Product looks great but those API request limits are too low, because alerts rain when you are having incidents and rate limiting all of them is harmful. That's why other products have deduplication keys / aliases so you don't miss important ones.
I'd think that receiving even 1/5th the rate limit in a 5 minute window would be disorienting enough to render alerting effectively useless.
I'd question the configuration which fires that many alerts in that time frame, and suggest improving alert aggregations and dependencies to get the number down to one or a handful of meaningful alerts.
The overhead of maintaining those configurations all the time is usually too high to be worth it considering the benefit and likelihood of reaping it.
Also, in my experience with those systems, they only make sense to use very sparingly. Your monitoring becomes extremely fragile when your aggregations and dependencies get complicated enough that "what will our alerting system do when X happens?" results in a flow chart with 18 steps.
If you aren't careful, you can end up making your aggregations less useful than the raw alerts would be.
It would be great to have a dependency graph or labels in the alerts, so they are easily mapped to the things that can break and are important enough to be monitored.
We just had a short outage where an editor removed the index page in the cms which is central to the site. It's stupid that this is possible but we just operate the cms while we build and operate everything around it for our customer.
I think a large part of our alerts where triggered all at once but the one thing they had in common was that the alerts all pointed to the index page in the cms. E.g. the public www alert for index, the public api alert for index, the preview www alert for index, the preview api alert for index....
I was once in a job where I was solo on call for tens of thousands of cores globally and at worst we had like 2000 alerts in a week. These limits seem quite high to me.
With Grafana OnCall’s automatic grouping of alerts within Slack, you can avoid alert storms and reduce the noise your teams are exposed to during an incident.
Seems like the same feature described using different terminology.
Technically Splunk On-call. But I have a few pain points with it, and I miss pagerduty.
If you want to see what teams you are on as the current logged in user, the only way to do it as far as what support told me, is to search for yourself and then check that result.
I see my teams listed under my user profile. Or if I go to the left side bar and click on my name, it says when I'm next on-call for various teams. But the UI looks different than last time I logged in a few weeks ago, so maybe something has changed.
I've been seeing them recommended more and more, and myself have been keeping a passive eye on BetterUptime (which has an on-call feature): https://betteruptime.com/incident-management
Their free and lower prices tiers offer a lot of what others have on their top/most expensive tiers. Also, integrations with various alert sources are just easier in most cases. I spent I don't know how long trying to get OpsGenie to work before I gave up.
I may be biased as a co-founder of Spike.sh, but I think we have one of the best designed incident management products out there. We've focused on making it easy to create on-call schedule and overrides, and added templates for escalation, on-call and alert rules.
I use VictorOps (Now Splunk On-Call) currently and it does the job. Its shift override functionality is quite confusing to get your head around at first but makes sense after the first few times.
I've also used OpsGenie (Atlassian now) and really enjoyed it. The amount of integrations they have is staggering.
In the year I used it, I never personally noticed it going down. Although that being said, their SLA is only 99.9% delivery in any calendar month within 5 minutes. The penalty for missing that SLA is only 10% of that month's bill.
> Once an Incident is triggered, PagerDuty will deliver the First Responder Alert within the Notification Delivery Period for 99.9% of the notifications sent by PagerDuty for the Customer during any calendar month. The “Notification Delivery Period” is five (5) minutes and it is measured as the time it takes PagerDuty to deliver a First Responder Alert to telecommunication providers in accordance with the Service configuration and Contact Information.
> ...
> If PagerDuty fails to meet the SLA set forth herein, Customer may receive a service credit. Customer will be eligible for a credit toward future fees owed to PagerDuty for the PagerDuty Service. The Service Credit is calculated as ten percent (10%) of the fees paid for or attributable to the month when the alleged SLA breach occurred.
It's very rare for them to go down. I think I can remember one major outage during business hours in the last few years at which point we just switched to manual monitoring for the few hours.
If that is within your outage model, you'd probably want a redundant on-call service I suppose, even if it's just escalating to a single known email or sms group.
Your service(s) going down and pagerduty going fully down is very unlikely to happen. Even if it does, you're probably going to get called by customer support because users never go down;)
I'm a grafana fan and a current user of PagerDuty. Maybe there's more to the story but after reading the post I feel like using a calendar integration to manage on-call schedules is the wrong approach. Calendar events are a result of overlaying a rotation on a date range: they're the output, not the input. I'm sure the designers here have looked at how PD enables creating and editing rotations. Curious to know their views on it.
A few more screenshots of the "Scheduling" options would've been great...
We're (more or less) using OpsGenie's free tier, however their scheduling never really "clicked" with me... not sure if i'm special in that regard, however i find the UI/UX pretty... weird...
I'm not sure what this is competing with in it's current incarnation.
I need corresponding mobile phone applications for any alert product I intend to use that can override DND/volume etc. on my phone so I can get woken up at night and respond to problems.
Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.
PD schedule checking and trade negotiation becomes yet another thing in the long list of things I need to do when taking a day off. HR system request off, Department Outlook calendar update, PagerDuty coverage check, Outlook out-of-office status & auto-replies, Slack set away, update status AND pause notifications.
I suppose that's because as an on-call developer I am not the user. The user, management who bought the product, gets KPIs & pretty graphs, so they are happy.