Editorial Note: This exclusive blog post is written by Botmetric CTO Akash Bhunchal.
Every IT Ops engineer, on an average, receives hundreds of alerts, perhaps even more than that, in his/her inbox or on Slack. Upon seeing an alert, the engineer spends up to 20 minutes to tend to it. Essentially to identify the issue, analyze it, run various commands to get diagnostics and then resolve the issue. And when the identical alert resurfaces, similar mundane steps should be executed again. ‘Sigh! Not again!’ is the usual reaction. And as the day progresses, these alerts slowly snowballs to extreme exhaustion; what we call alert fatigue.
Hear, hear! Are you a DevOps or IT Ops engineer who can relate to this situation? Our philosophy at Botmetric is that humans should solve new problems and machines can solve known problems. So we are on a journey towards NoOps, which essentially means: Say No to solving known Ops problems again and again manually! Automate them so that machines can take care of them.
To this end, Botmetric is delighted to announce the launch of Incidents, Actions, & Triggers app as part of its Ops & Automation product. By using Incidents & Triggers flow, DevOps and IT Ops engineers like you can automate remediation for known operational problems in your infrastructure and use intelligent automation for handling and assuaging alert fatigue.
Above all, Botmetric Incidents, Actions, & Triggers pulls off the complete life cycle of an alert, starting from collection to correlation to the actual execution of a diagnostic or remediation action on the impacted infrastructure. And all this automatically without any human intervention so that your precious engineering time is saved. And that you can spend more time on all things rip-roaring rather than all that humdrum.
Why Botmetric built Incidents, Actions, & Triggers?
Botmetric Incidents, Actions, and Triggers was built to solve five major problems:
1. Alert Deluge
Often, monitoring alerts and event notifications are in the form of an email, and arrives in hundreds, sometimes thousands in full flamboyance looking all red. Some of the common reasons for this avalanche of alerts is improper configuration of alerts and noisy hosts sending too many events.
2. Spot critical alerts/problems from noise
Sometimes alerts are a precursor to an imminent problem. There is no easy way out to quickly run analytics on these alerts to sieve the real issue from the noise out there. It’s critical you prioritize the severity of an alert. And among the deluge of alerts, it is natural you’ll be piqued by the time you figure out the most critical ones. Moreover, you would have spent precious time when a notification pops out in your inbox or Slack. There is no shortcut to determine the problem or critical alert quickly, without using intelligence.
3. Manually solving same problems over and over
It might also so happen that any known issue could be recurring on the same machine, time and again. Resolution to this could be as simple as checking the specific service/system utilization and restarting a service. However, precious human and engineering hours goes wasted in doing this manually each and every time! In a worst case scenario, it could be a false alert and no action is actually needed but you still have to read the alert, login into the system, and verify the context to determine it’s not a useful event. Until Botmetric built its Ops & Automation’s Incidents app, there was no automated way of debugging or resolving such problems.
4. Dealing with multiple monitoring systems, a necessary evil
Often multiple systems are used for monitoring, log management, APM, API/URL checks, etc. And every monitoring system has its own semantics for metrics and alerts. The same alert may appear differently in different monitoring systems. The need of the hour is to correlate these alerts across monitoring systems to give a unified view to the engineers and DevOps.
5. Identifying recurring alerts
Monitoring systems send out multiple alerts and event notifications, one for each of the monitor/metric. However, more often than not, these alerts are related as the underlying problem. For example, low disk space may lead to a process getting hung or a CPU spike. Such scenarios make it difficult to correlate these alerts and get to the root cause faster. If an alert or event is not useful, then you shouldn’t be bothered.
How to Leverage Botmetric Incidents, Actions, & Triggers to Assuage Alert Fatigue
1. Integrate your monitoring systems with Botmetric
Integrate your existing monitoring systems like Datadog, NewRelic, and Cloudwatch with Botmetric Ops and Automation, so that the Botmetric Agent can pull the alerts from there and collate them in one place for further machine analysis.
2. Setup Actions and Triggers for Incidents
Once the monitoring systems’ integrations are in place, you can start reviewing all the alerts and events. These alerts and events correlated for a particular host are classified and grouped together as Incidents by the Intelligent Correlation Engine (ICE). The primary reason being: reducing alert fatigue. With Incidents, you need to scout through lesser alerts rather than browsing through hundreds or thousands of alerts.
The correlated Incidents can be resolved by the engineer manually through an acknowledgement after reviewing it or Botmetric automatically resolves it after 48 hours of inactivity.
3. Create Actions
Botmetric provides a way to execute a set of actions for handling your alerts based on Triggers. You can add scripts or a program with set of commands to handle your alert/event. For example, script (check-thread-restart-apache.sh) can be written to handle APACHE alerts that checks the number of Apache worker threads and restarts the Apache web server if the number of threads is greater than a desired threshold or MEMORY usage goes beyond 75% of the system capacity.
Or a script can be written to enable log rotate if DISK usage is close to 90% of the capacity and push the old logs to S3 for automatic archival.
4. Deploy Triggers
A Trigger is a IF-THIS-THEN-THAT (IFTTT) Job for handling alerts based on a criteria, to deploy diagnostics or remediation scripts that are created using Actions. For instance, you may want to automatically execute the above mentioned script check-thread-restart-apache.sh when a CPU high alert occurs then check IF CPU usage is say more than 90%. Similarly, you can also setup a Trigger to be executed automatically on a high disk usage alert. This will eliminate the need for you to review alerts every time and handle all known operational problems with intelligent automation.
You can also set up a job to trigger automation actions in response to alerts from different monitoring systems. You just need to define what action to take when a particular metric goes above a threshold on the selected hosts using Trigger workflow.
5. Review the status of all the Alerts
Whenever Botmetric receives an alert, it checks if there are one or more workflows that match all the user defined criteria to deploy a response for that alert. If there are any workflows that match these conditions, then the Action will be executed on the impacted servers or cloud infrastructure without any engineers intervention.The status of the Actions executed are available in the Execution History.
To Wrap Up:
As a DevOps engineer, take control on how you manage a deluge of alerts thrusted upon you. Stop investing precious time and effort in identifying, analyzing, and running diagnostic commands to resolve known and recurring alerts. Start your journey towards NoOps with Botmetric Intelligent Incidents, Actions, & Triggers. Reduce the time spent in resolving alerts significantly.
You can access this app in Botmetric Cloud Management Platform under Ops & Automation. Give it a 14-days try, experience what NoOps is, and write to us with your feedback.Until our next blog post, do stay tuned with us on Twitter, Facebook, or LinkedIn for other interesting news from us!