Why do you need a new IT incident management framework?

Incidents and outages are an existential threat to businesses that build, operate and consume technology services. Businesses and customers rely heavily on these critical systems. When they fail, customers' credibility can be irreparably damaged, putting both the company's reputation and revenues at stake.

Your teams are already responding to incidents, but how are they doing it? How are they adapting as the technology landscape changes? Could they do it better?

With this new approach to incident management, we help you aim for the ideal state and change the incident narrative from one of blame to one of long-term learning.

We will also provide real-world, right-sized patterns and examples that can be used for incremental improvement to change behaviour for long-term investment, giving pragmatic and tactile practices and patterns, with examples from some of the best practitioners and companies, to address a complicated topic that is difficult to cover well.

ITIL is no longer sufficient for incident management.

The traditional ITIL-based incident management framework provided organisations with a structured way to classify, manage and resolve incidents. This framework, as well as adjacent processes such as problem management, has become the reference model for organisations to deal with the reality of incident management.

However, today's enterprise software systems are made up of hundreds of different systems and technologies that interact in surprising ways. As complexity has increased, the ITIL framework has not evolved to cope with the messy reality.

As a result, the traditional way of thinking about and dealing with incidents has become an operational debt and can prevent organisations from evolving. There is also a dearth of practical and accessible experience on how leading companies are dealing with the realities of incident management and response in this complex world.

The shortcomings of traditional incident, problem and service request management are as follows:

  • They focus on blame and finger-pointing as opposed to research, learning and improvement.
  • Incidents are treated as exceptions: Both incident workflows and incident work are outsourced from the day-to-day work of the teams that build and run the software.
  • Lack of practice creating a reactive versus a proactive stance.
  • Focus on a single "root cause" versus understanding multiple contributing factors and making broad-based improvements.
  • Insufficient protocol and response structure.
  • Lack of tools and practices that provide visibility into the incident response process.
  • Limited impact assessment and understanding.

What are the benefits of improved incident management?

Incidents cannot be avoided. But we can greatly reduce the frequency, duration and impact of incidents on both our customers and the employees who operate these systems.

The benefits of improving incident management and response are considerable and can lead to reduced impact on customers, increased customer confidence in the company, reduced stress on teams and employees, and increased revenue.

The general principles of improvement are as follows

  • Move towards a "just culture" where incidents are used as an opportunity to learn.
  • Integrate workflows and understanding of incidents into normal operational behaviour throughout the service lifecycle.
  • Encourage ownership and accountability for client outcomes.
  • Use incidents to expose the true behaviour of the IT system and process.
  • Recognise that complex systems fail in surprising ways.
  • Break down silos and build trust by encouraging collaboration and mutual learning.
    Make incremental and continuous improvements.

A new incident management framework

The incident problem space is very broad, and our goal is to break it down, eliminate mythologising and create a framework that can evolve over time with more depth and breadth as our industry learns more.

We propose this new incident management framework: Prepare, Respond, Review.

The figure below describes the cycle of incident response patterns Preparedness, Response, Review, as well as common patterns within the pre-incident (preparedness), incident response (response) and post-incident (review) phases.

Preparedness: Pre-incident patterns

  • Making incidents visible and part of daily work
  • Well-defined roles for incidents
  • Well-defined incident response triggers
  • Well-defined on-call rotation and schedule
  • Recruitment and training of on-call staff
  • Incident command training and certification
  • Well-defined communication plan
  • Well-defined behavioural protocols

Response: Incident response patterns.

  • Periodic CAN Reports (Conditions, Actions, Needs)
  • Shared Incident Status Document
  • Incident call recording

Review: Post-incident response patterns.

  • Reviews of localised incidents
  • Global incident reviews
  • Elements for improvement following the review
  • Incident Review Template
  • Incident impact assessment

Before delving into patterns, we believe it is essential for organisations to measure how well their teams are currently doing in incident management.

Incident management evaluation

Below is an incident response assessment: a collection of probing questions that allow you and your team to answer and evaluate your current incident response preparedness. Take these questions back to your team to see how well you are doing and where there are areas for improvement.

Pre-incident assessment

Questions to ask your team Visible incidents and part of everyday work

Do you have a shared backlog between operations and engineering teams that makes pre-work, response work and review work visible?

Do you have available capacity in operations and engineering for pre-work, response work and review work?

Do you share and discuss a backlog of work incidents with stakeholders, including product management?

Do incidents have long handovers between first responders (help desk), secondary responders and tertiary responders?

Well-defined incident management functions

  • Does it have clear and specific roles to avoid overlaps, confusion and delays?
  • Who on your incident response team is responsible for driving resolution in a timely manner and keeping all members of the response team on track?
  • Do you conduct a post-mortem after every incident to help the team improve in areas that were missed during the incident?

Well-defined incident response triggers

  • Is your equipment overloaded with alerts and notifications?
  • How long does it take for incident response personnel to gain the necessary skills to resolve the problem?
  • Does everyone understand the business impact associated with reported outages?

Well-defined on-call rotation and schedule

  • Do your teams have a scheduled on-call rotation?
  • Does on-call rotation include developers?
  • Can other teams easily find the right person to contact during an incident if they need help?

Incident command training and certification

  • What activities does your organisation undertake to ensure that each incident response is managed in a consistent and collaborative manner?
  • Do you have a structured training programme for your incident response leaders?
  • How do you communicate incident response roles and ensure that each responder is aware of those roles, including who is responsible for incident resolution?

Well-defined communication plan 

  • Do you have a defined incident communication plan?
  • Does your communication plan describe the owner of the communication, frequency of communication, content, audience and delivery?
  • Does each service/application have its own specific communication plan?

Active assessment of incident response.

Periodic report

  • Does your incident process have a well-defined status report/CAN for stakeholders?
  • Have you defined a regular cadence for sending status/CAN reports to stakeholders?
  • Do you have a dedicated scribe to manage the CAN reporting process?

Shared document on the status of incidents

Do all members of the incident response team actively record and share information?

How do new interveners obtain relevant background information on the incident?

Incident call recording

  • Do you record your incident calls so that your incident response team has the ability to review the details of an incident?
  • Does your incident response team have data collected in case the main incident team needs to return to some resolution events to address missing information, confusion or disagreements?
  • What information does your team have to review during postmortem and incident review meetings?

Deployment of the incident

  • Are your incident responders coordinating to resolve incidents more quickly and develop domain knowledge?
  • Are your tickets handled by incident responders in real time or are they handled in a tiered approach?

Post-incident assessment.

Local incident reviews

  • How long after an incident is resolved does your response personnel conduct a review?
  • Does your environment foster continuous improvement, learning and accountability by conducting no-blame incident review workshops?
  • Does your organisation capture and share improvements from incident review and documentation across the organisation?

Global incident reviews

  • How often does your organisation meet to review incidents and disseminate lessons learned to all teams?
  • Do you regularly ask actionable questions to foster a culture of open incident review?
  • Does it engage cross-functional teams and stakeholders to build resilience across the organisation?
  • During global incident reviews, are other teams approached to provide assistance to help with improvement items and broad-based improvement patterns?

Elements for improvement following the review

  • Do your teams identify actionable system improvements after an incident?
  • Are these improvements consistently tracked, prioritised and implemented?
  • Do your teams make trade-off and risk decisions on backlog improvements?

Incident Review Template

  • Do you have a work management system or central knowledge repository to store and share incident review information?
  • Does your incident response team have an incident review template?
  • Do you regularly evaluate how you collect incident information to identify necessary adjustments?

Incident impact assessment

  • Do your teams have a framework for assessing the impact of an incident?
  • Is the incident impact assessment part of the review process?
  • Do you take advantage of incident impact assessment to discover the true behaviour of the system?
  • Do you use the results of incident impact assessments to inform overall improvements?

Preparedness: Pre-incident patterns

  • Making incidents visible and part of daily work
  • Well-defined roles for incidents
  • Well-defined incident response triggers
  • Well-defined on-call rotation and schedule
  • Recruitment and training of on-call staff
  • Incident command training and certification
  • Well-defined communication plan
  • Well-defined behavioural protocols

Response: Incident response patterns.

  • Periodic CAN Reports (Conditions, Actions, Needs)
  • Shared Incident Status Document
  • Incident call recording

Review: Post-incident response patterns.

  • Reviews of localised incidents
  • Global incident reviews
  • Elements for improvement following the review
  • Incident Review Template
  • Incident impact assessment

Before delving into patterns, we believe it is essential for organisations to measure how well their teams are currently doing in incident management.

Incident management evaluation

Below is an incident response assessment: a collection of probing questions that allow you and your team to answer and evaluate your current incident response preparedness. Take these questions back to your team to see how well you are doing and where there are areas for improvement.

Pre-incident assessment

Questions to ask your team Visible incidents and part of everyday work

Do you have a shared backlog between operations and engineering teams that makes pre-work, response work and review work visible?

Do you have available capacity in operations and engineering for pre-work, response work and review work?

Do you share and discuss a backlog of work incidents with stakeholders, including product management?

Do incidents have long handovers between first responders (help desk), secondary responders and tertiary responders?

Conclusion

We recognise that each environment has its own priorities and constraints, but like any good architecture, there tends to be a high percentage of consistency between organisations. The key outcomes around rapid incident identification and resolution are universal. These patterns have been developed to reflect those requirements, as well as to present some emerging patterns that have worked well for high performing teams.

The desired state for incident response should encompass some key characteristics:

  • It must be able to quickly identify the source of the incident and inform the right people quickly and with the necessary information to resolve the problem.
  • Incident response teams should work collaboratively with the common goal of resolving the problem with transparency, clear communication and in a manner that can lead to continuous improvement.
  • Incidents should be reviewed with an emphasis on organisational learning and improvement actions, rather than assigning root cause and blame.

You can contact our Operation Management specialists without obligation so that we can analyse how to optimise your IT department.

If you're not sure how to get started, leave your details and we'll contact you as soon as possible.

Error: Contact form not found.

  • Log4j2 vulnerability

Log4j2 vulnerability A very serious vulnerability has been discovered in the popular Java-based logging package Log4j. This

  • RHEL vs SUSE

RHEL vs SUSE Red Hat Enterprise Linux and Suse are two open source enterprise operating systems that offer many features.

  • technology integrator

3 key questions to ask before hiring a technology integrator: The success of a systems infrastructure transformation can depend on

We can accompany you in your end-to-end projects. Let's work together.

2022-10-24T13:09:40-03:00
Go to Top