Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Grafana OnCall: an easy-to-use on-call management tool (grafana.com)
232 points by sciurus on Nov 9, 2021 | hide | past | favorite | 74 comments


For a product that's been around 12 years, I've been surprised at how minimally featured PagerDuty is.

Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.

PD schedule checking and trade negotiation becomes yet another thing in the long list of things I need to do when taking a day off. HR system request off, Department Outlook calendar update, PagerDuty coverage check, Outlook out-of-office status & auto-replies, Slack set away, update status AND pause notifications.

I suppose that's because as an on-call developer I am not the user. The user, management who bought the product, gets KPIs & pretty graphs, so they are happy.


My least favorite thing about PagerDuty is the phone call notification. I drive a car from 2001, and with a cheap bluetooth upgrade, I can do all of these with my voice while driving:

- Get directions to anywhere on the continent

- Send and receive texts to my friends

- Answer and take a call from a human

But if PagerDuty calls me, Stephen Hawking's speech synthesizer brusquely yells at me and demands I take my hands off the wheel and press a button on my phone to acknowledge the alert. No voice recognition, no ability to kick off an automated play. It's a time portal to 1997! Even the _banks_ have friendlier phone automation these days!


Uh this is user error. Just make the DTMF tones with your voice.


Whoah there, Captain Crunch!


2600 Hz is a single tone.


Every delightful, successful developer product is eventually doomed to become JIRA.


A multi-billion USD success story ?


That's one way to look at it.


JIRA is one of those pieces of software that creates more problems than it solves. That's good for the industry -- it means to get an equivalent amount of work done, you need more people. More jobs competing for the same pool of workers means that you have to pay the workers more. That's good for everyone.

(If I were starting a company tomorrow, though, I'd use Linear. Nicest issue tracking tool I have ever seen. It has all the "big business" features like roadmaps and story points, with a lot of friendliness for the individual contributors -- dark mode, keyboard shortcuts, a dedicated triage UI. It's so nice to see someone finally get it right!)


JIRA is shovelware.

In that it solves problems for one user (project managers) by shoveling it onto another user (developers or admin assistants).

It solves some problems, but it could definitely do a better job of decreasing overall tracking work required, instead of just moving it around.


> Stuff like national holiday awareness, integration to vacation calendars, a better UI for swapping days/overrides, etc.

Do you shut down your service for Labor Day? I don't.

I do agree that trading on-call shifts is not very easy within the UI. Part of me dreams of being able to make enough advantaged trades to end up never on-call, like the padre who doubled his holdings in a WW2 POW camp: https://www.ft.com/content/c523efe6-9973-11e1-9a57-00144feab...


Depends on the service? I've maintained a service used for supply chain planning where an outage during business hours needed addressing, but an outage over a weekend or holiday wasn't important. Being able to set up company holidays ahead of time then just not think about it would have been a useful feature for an on-call tool in that scenario.


No, but you could, for example, allocate the holidays equally amongst the team, or as a special class of days with a different schedule like weekends.. rather than just screwing whoever happens to be on that week?

The fact that the product seems to have no concept of holidays when its essentially a scheduler++ is a problem.

The problem with doing trades which is the default easiest thing to do given the PagerDuty interface is that when you come back from (or just before you go on) vacation you typically end up with extra bonus on-call shift outside the cycle. Delightful!

All these things just sort of pile up into the "maybe its just easier not to take a couple days off" category, which is not really a mistake on your employers part.


Oh man, as the engineer who was more or less responsible for PD scheduling between 2013 - 2015, it really hurts to hear they still haven't solved this :/


> Do you shut down your service for Labor Day?

Depends on the service and industry. Banking adjacent companies are often allowed downtime off US business hours. Even at big tech companies I've run internal services that had business hours support only (nonproduction sandboxes, non-business-impacting services, long running job services with SLOs measured in hours or days)


> Depends on the service and industry. Banking adjacent companies are often allowed downtime off US business hours.

Honestly as a consumer this pisses me off. I get home for the day, relax, eat dinner, and log into the bank to check my finances at 8pm and east coast banks throw up a "scheduled downtime for upgrades" notice.


Some companies have international teams. National holiday awareness can be useful in that context.


Hey everyone, Matvey, ex-CEO of Amixr is here. Me and Ildar Iskhakov started this project three years ago because we used to be on-call ourselves and needed better tools. It was an amazing journey from 0 to 1. Tons of coding, first customers, fundraising, iterating, and finally the honor to join Grafana Labs and build Grafana OnCall! I'll be happy to answer your questions if you have any.


It's great to see more competition in this space. Generally speaking, what I miss in these "incident management" products is also an integrated, flawless way to handle incidents when they're happening. I'm talking about:

1. Quickly creating a proper chat 2. Quickly creating an incident document where you can pin chat messages and use it in the post-mortem. Ideally, pinning some graphs that you'd extract from your observability solutions 3. Having a status page to put a small description for non-technical stakeholders.

PagerDuty covers some of this. Monzo's Response [1] and now incident.io [2] try to cover it too. I'd like to have this experience end-to-end.

1 - https://github.com/monzo/response 2 - https://incident.io/


Monzo's solution does not seem to be actively maintained, is it?

+100 on the creation of incident chat rooms and pinning data to re-use in incident docs. There is nothing worse than copying the timeline events from one tool to a Google Doc.


AFAIK, the creators created incident.io as a spin-off [1] :) Smart move, I must say.

1 - https://www.indexventures.com/perspectives/incidentio-raises...


This is one thing I really like about PagerDuty's incident response, I can pull incidents and Slack messages right into the Incident timeline. I usually end up copy and pasting it into a.. _sigh_ Jira ticket.


I use incident.io. Pretty happy with it. Very responsive team.


Hi! Thanks for sharing this news. Will this be available for on-premise installations, and when?


For now, we are focusing on rolling Grafana OnCall in the Grafana Cloud. It's a very common use case to have such a system outside of your infrastructure so it won't be affected by probable issues. It should be alive even when everything goes wrong.

We've already received multiple questions about OSS and on-premises. Will roll cloud version first, see how it works, collect feedback and build (and share) future plans!


This looks really neat. We don't use Grafana today. We're running CloudWatch/insights and Squadcast for alerting, but deep integration with the monitoring tool looks cool. Is this usable with self-hosted or AWS managed Grafana?


Yep! The idea of Grafana OnCall is to help you to group, deduplicate, route & deliver to Slack/SMS/Phone alerts from any sources. It could be a CloudWatch, DataDog, self-hosted Alertmanager, or Grafana of course. The only requirement for the alert source is to be able to generate a webhook and send it to us.


Can Grafana OnCall itself be self-hosted and/or run as a part of Grafana itself? Your last response makes it sound like it's a separate product with integrations rather than an extension of Grafana. Is that correct?


It's 100% part of the Grafana Cloud, not a separate product. It's deeply integrated with the rest of Grafana.

Same time we've focused on making it useful for those who don't use Grafana for monitoring. Feel free to sign up in the Grafana Cloud and use just OnCall if you want.


Is there automatic planning of upcoming shifts and compensation accounting? Or do you have to do that manually?


> Alerts from each integration 300 5 minutes

> Alerts from the whole team 500 5 minutes

> API requests per API key 300 5 minutes

Product looks great but those API request limits are too low, because alerts rain when you are having incidents and rate limiting all of them is harmful. That's why other products have deduplication keys / aliases so you don't miss important ones.

https://grafana.com/docs/grafana-cloud/oncall/oncall-api-ref...


I'd think that receiving even 1/5th the rate limit in a 5 minute window would be disorienting enough to render alerting effectively useless.

I'd question the configuration which fires that many alerts in that time frame, and suggest improving alert aggregations and dependencies to get the number down to one or a handful of meaningful alerts.


The overhead of maintaining those configurations all the time is usually too high to be worth it considering the benefit and likelihood of reaping it.

Also, in my experience with those systems, they only make sense to use very sparingly. Your monitoring becomes extremely fragile when your aggregations and dependencies get complicated enough that "what will our alerting system do when X happens?" results in a flow chart with 18 steps.

If you aren't careful, you can end up making your aggregations less useful than the raw alerts would be.


It would be great to have a dependency graph or labels in the alerts, so they are easily mapped to the things that can break and are important enough to be monitored.

We just had a short outage where an editor removed the index page in the cms which is central to the site. It's stupid that this is possible but we just operate the cms while we build and operate everything around it for our customer.

I think a large part of our alerts where triggered all at once but the one thing they had in common was that the alerts all pointed to the index page in the cms. E.g. the public www alert for index, the public api alert for index, the preview www alert for index, the preview api alert for index....


Problem is you get same alert deduplicated hundreds of time. And with those limits, you miss others.


I was once in a job where I was solo on call for tens of thousands of cores globally and at worst we had like 2000 alerts in a week. These limits seem quite high to me.


That's why other products have deduplication keys / aliases so you don't miss important ones.

Care to link to the docs? I'm interested.



Thanks for the links.

From the article:

With Grafana OnCall’s automatic grouping of alerts within Slack, you can avoid alert storms and reduce the noise your teams are exposed to during an incident.

Seems like the same feature described using different terminology.


The output alerts feature looks largely the same, but the input API limits are the part in question.

What happens if you get 1000 API calls about "Alert 1" and 1 API call about "Alert 2".

You want both on call's to trigger once, but will alert 2 get though?


How else do you think they are gonna make money?


Is there really anybody else in the "Pager" category of SaaS products other than PagerDuty that have any traction?


I work on/for an open source solution that we based off of PagerDuty, called GoAlert: https://github.com/target/goalert


Target uses go alert across the enterprise for all on call. Definitely enterprise capable!


Target having a GitHub org has bolstered my future use of Target pickup orders. Didn’t know Target open sourced tech, love to see it.


Thanks for your comment! We are working on doing more in open source to help pay back our use of open source. We can and will do more!


Yep. This is a great product. Has the features you need, is super reliable and easy to manage.


We use OpsGenie. not sure how widely it’s used but given its Atlasian I’d guess a non-trivial amount.


DataDog also launched their own Incident Management tool, not sure how widely it's used: https://www.datadoghq.com/blog/incident-response-with-datado...


Technically Splunk On-call. But I have a few pain points with it, and I miss pagerduty.

If you want to see what teams you are on as the current logged in user, the only way to do it as far as what support told me, is to search for yourself and then check that result.


I see my teams listed under my user profile. Or if I go to the left side bar and click on my name, it says when I'm next on-call for various teams. But the UI looks different than last time I logged in a few weeks ago, so maybe something has changed.

Disclaimer: Am an employee.


I've been seeing them recommended more and more, and myself have been keeping a passive eye on BetterUptime (which has an on-call feature): https://betteruptime.com/incident-management


There is xMatters: https://www.xmatters.com/

Disclaimer: I work at xMatters.


We started using Squadcast: https://squadcast.com

Their free and lower prices tiers offer a lot of what others have on their top/most expensive tiers. Also, integrations with various alert sources are just easier in most cases. I spent I don't know how long trying to get OpsGenie to work before I gave up.



My team uses PagerTree. Easy to get started with, has the tools you need without being overcomplicated.


PD does two big things:

  1. Alerting: Phones you when your servers are down.
  2. Incident Management: Help coordinate a response across multiple people.
For the first, there's also:

  - OpsGenie (owned by Atlassian)
  - Squadcast
  - VictorOps (now Splunk On Call)
  - xMatters
  - PagerTree
For the second, there's a bunch of new contenders:

  - Datadog now has an IM product
  - Blameless
  - Rootly
  - Incident.io
  - FireHydrant


https://allma.io/ is also in the 2nd category and is a pretty awesome product so far, IMO.


Can also checkout https://tellspin.app for a directly in Slack solution


Spike.sh - https://spike.sh

I may be biased as a co-founder of Spike.sh, but I think we have one of the best designed incident management products out there. We've focused on making it easy to create on-call schedule and overrides, and added templates for escalation, on-call and alert rules.


I use VictorOps (Now Splunk On-Call) currently and it does the job. Its shift override functionality is quite confusing to get your head around at first but makes sense after the first few times.

I've also used OpsGenie (Atlassian now) and really enjoyed it. The amount of integrations they have is staggering.


There’s Splunk OnCall (formerly known as VictorOps). It’s a very decent solution.


Also what happens if pagerduty goes down?


Ideally, the services you use should handle that (detect a non-200 and fire off a backup method like a slack webhook or email.)

In reality, probably a lot of missed downtime events, and ops sleeping peacefully I guess.


In the year I used it, I never personally noticed it going down. Although that being said, their SLA is only 99.9% delivery in any calendar month within 5 minutes. The penalty for missing that SLA is only 10% of that month's bill.

> Once an Incident is triggered, PagerDuty will deliver the First Responder Alert within the Notification Delivery Period for 99.9% of the notifications sent by PagerDuty for the Customer during any calendar month. The “Notification Delivery Period” is five (5) minutes and it is measured as the time it takes PagerDuty to deliver a First Responder Alert to telecommunication providers in accordance with the Service configuration and Contact Information.

> ...

> If PagerDuty fails to meet the SLA set forth herein, Customer may receive a service credit. Customer will be eligible for a credit toward future fees owed to PagerDuty for the PagerDuty Service. The Service Credit is calculated as ten percent (10%) of the fees paid for or attributable to the month when the alleged SLA breach occurred.

https://www.pagerduty.com/standard-service-level-agreement/


It's very rare for them to go down. I think I can remember one major outage during business hours in the last few years at which point we just switched to manual monitoring for the few hours.

If that is within your outage model, you'd probably want a redundant on-call service I suppose, even if it's just escalating to a single known email or sms group.


Your service(s) going down and pagerduty going fully down is very unlikely to happen. Even if it does, you're probably going to get called by customer support because users never go down;)


xMatters


Opsgenie?


I'm a grafana fan and a current user of PagerDuty. Maybe there's more to the story but after reading the post I feel like using a calendar integration to manage on-call schedules is the wrong approach. Calendar events are a result of overlaying a rotation on a date range: they're the output, not the input. I'm sure the designers here have looked at how PD enables creating and editing rotations. Curious to know their views on it.


Shameless plug, if you're looking for a simple shift scheduling calendar connected to Slack, I built this: https://turnshift.app.

It's a team calendar to share recurring tasks as a team. Things like PR reviews, who's on support, or who's qualifying leads.

It has far less features than PagerDuty or Grafana OnCall but it serves well a bunch of customers looking for a simple tool to manage team schedules.


A few more screenshots of the "Scheduling" options would've been great...

We're (more or less) using OpsGenie's free tier, however their scheduling never really "clicked" with me... not sure if i'm special in that regard, however i find the UI/UX pretty... weird...


I'm not sure what this is competing with in it's current incarnation.

I need corresponding mobile phone applications for any alert product I intend to use that can override DND/volume etc. on my phone so I can get woken up at night and respond to problems.


but is it possible to send sms/phone call directly from grafana oncall ? If yes, is there a pricing ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: