A new survey conducted by Transposit, a provider of an incident management platform, has revealed that many organisations are embracing artificial intelligence (AI) and automation solutions to improve their incident management capabilities, as they grapple with the mounting costs and frequency of service incidents.
The survey polled over 1,000 IT operations, DevOps, site reliability engineering (SRE), and platform engineering professionals in the US. Key findings uncovered that 66.5% of respondents observed an increase in the number of service incidents impacting their customers over the past year. This uptick in incidents directly translates into higher downtime costs, with 63% stating losses up to $499,999 per hour of downtime and 46.6% citing expenses ranging from $100,000 to $2 million per hour.
The survey of 1,000 IT operations, DevOps, SRE and platform engineering professionals was conducted by Transposit, a provider of an incident management platform. The survey found that 67% of respondents have seen an increase in service incidents impacting customers over the past year.
62% of respondents have also seen incident resolution times increase over the past 12 months, with average resolution times being up to 6 hours from initial alert to mitigation. Though 71% claim to have satisfactory automation and 59% have defined processes, only 33% have automated 11-25% of workflows.
The survey revealed rising downtime costs, with 63% citing potential losses of up to $499,999 per hour and 47% citing $100,000-$2 million per hour. 96% agreed consolidating tools into an integrated platform could benefit incident management.
Despite 59.4% indicating they have defined incident management processes in place and 71.1% noting their current automation meets organizational needs, resolving these frequent incidents still proves challenging. 73.9% stated their reliability engineering teams face difficulties solving incidents due to brittle automation scripts, manual processes, and accessing specialized knowledge. Moreover, 79.8% said it takes up to 6 hours on average to fully resolve an incident from initial alert to issue mitigation.
Complicating incident resolution further, 42.5% described their existing process as ineffective or only partially adopted across the organization. Team members struggle to consistently follow defined procedures due to confusing documentation, limited tool access, and over-reliance on individual knowledge rather than institutional knowledge. 71.3% also reported it takes up to 30 minutes just to assemble the appropriate team members when an incident occurs.
Top challenges were lack of leadership buy-in (57%), knowledge sharing (54%), process documentation (54%), and clarity on automation targets (52%). 60% plan to implement AI/ML tools and 53% plan to adopt new automation tools in the next 12 months.
The research also highlighted roadblocks to increasing automation within incident management workflows. Insufficient leadership buy-in, knowledge sharing, process documentation, and lack of clarity on ideal automation targets were frequently cited obstacles. However, organizations leveraging SaaS-based tools demonstrated an ability to implement automations much faster compared to those relying on custom in-house solutions.
To enhance incident management and reduce mean time to resolution (MTTR), 72.1% of respondents expect their organization's tech stack to expand over the next 12 months. 60% are looking at implementing AI- or machine learning-based tools, with 53.1% eyeing new automation tools and applications. Hiring is also slated to rise for site reliability engineers and platform engineers, with 69.3% projecting increased headcount.
The findings revealed broad consensus that AI and human-in-the-loop automation are pivotal for improving data accuracy, accelerating incident response, automating repetitive tasks, and reducing noise through event correlation. 89.8% believe integrating generative AI capabilities into platforms could decrease the time required to create new automations.
However, 90.2% agreed automation should allow human judgment at critical decision points to ensure reliability, a 9.8% increase from the previous year's survey. Additionally, an overwhelming 96.3% think consolidating tools into a single, integrated platform would prove beneficial.
“The insights reveal the urgent need for adaptive, LLM-based automation that evolves as APIs, tools, and processes change,” commented Divanny Lamas, Transposit CEO. “Instead of relying solely on predefined rules, generative AI can play a transformative role in incident management by assimilating cues and context in real-time.”
As downtime costs continue to mount amidst rising service incidents, the study highlights AI and automation as instrumental for equipping modern operations teams to minimize disruptions, resolve incidents swiftly, and substantially reduce MTTR through an integrated solution. With human guidance integrated at key points, this intelligent approach can enable teams to focus on resolving intricate problems versus repetitive tasks.