Five laws of incidents and problems

Incidents and problems are in place to restore a service, fix an issue, work out why the issue or outage happened in the first place and then try and make sure this doesn’t happen again. All teams should be working together to make sure there is minimum downtime to the business on all incidents provided the right priorities are followed. We have all seen the analogies of incidents and problems.

eg. http://www.reddit.com/r/ITIL/comments/2d1zga/how_do_you_explain_the_difference_between/

https://itilbegood.com/2014/07/28/requests-incidents-problems-and-known-errors-in-a-nutshell/

However, where it gets a bit confusing is, where does investigating an incidents root cause and resolving the service cross over into problem root cause territory. Why should an engineer set about investigating an outage have to raise a problem if they have the incident from the customer, surely this all seems like a lot of paperwork for a few clicks?

Therefore, I wanted to put a stake in the ground, after a few years doing support, and then everyone can shout me down but at the end of the discussion / bloodbath we might have a solution. Of course it does depend upon organisations but there seems to be some confusion on incidents and problems.

At the heart of the matter is this truth,

Between incidents and problems, you should be able to restore the service quickly and root cause found with the cause of the incidents being mitigated or a work around published, so future incidents can be fixed quicker. The whole purpose is to provide fixes to the business so the business operation is minimally impacted. If there is an impact, the situation should be recovered and steps to mitigate the impact or minimise it, the next time it occurs.

Ok, so lets look at two incidents, one a customer can’t access their file shares and one customer calls in and says their Citrix sessions have hung…..and two minutes later another person calls up to say their citrix session have also hung.

The first one, the support engineer would pick up the call and after some trouble shooting realise the customers password had expired, reset and reboot, the customer is up and running. The way to mitigate it is to tell the customer to reset the password before it expires. So, this process has gone through the restore of service, finding the root cause and mitigating the issue.

Next, the engineer checks the Citrix session and finds out both customers are on the same server, the engineer can not remote onto the server, therefore the server looks like it has crashed. There is a known error entry which tells the engineer to take the server out of the load balancer and reset the customers sessions, the customers will re-connect to another server so service is restored. The engineer then reboots the server and upon reboot the server looks fine. However, would you put the server back into the live environment?

These two incidents illustrate the issue, the engineer on the first call was competent to go through all the steps and complete the incident. However, is the engineer competent to go through all the steps of trouble shooting the server? Maybe not, maybe a Citrix team needs to be involved in checking out the server before the server is put back in to the production environment. This is where a problem should be raised, the incident can be closed or linked to the problem but a problem should be raised as the server needs to be checked out why it crashed but the production environment continues to function.

Law one, raising a problem comes down to the competency of the support team. Can  they restore the service, find the root cause and mitigate it in an incident or can they only restore the service and then raise a problem for a specialist team to find the root cause and mitigate the issue.

Next, time needs to be monitored on incidents. Engineers love to trouble shoot it and fix issues, trying fix after fix to get to the bottom of the issue, however, this may take an hour. However, is this good for the business? If the engineer could put in a work around for the issue in the first 5 mins and leave the customer to get on with their day but raise a problem to investigate the issue further without needing to bother the customer, then surely this is a better way of working from the business point of view?

Law two, incidents, where a work around is present this should be implemented and a problem should be raised to find the root cause at a later date. The priority is to restore the service to the business.

When to raise a problem should be a thing of governance. ITIL explains this ITIL Service Operation page 99 (service operation process – Incidents versus problems)

The rules for invoking problem management during an incident can vary and are at the discretion of individual organisations.

Therefore when to raise a problem is up to the organisation. In the examples of the Citrix server, I would suggest a problem should be raise when the impact is to many customers, a key service or server is impacted or to group incidents together to raise to 3rd party suppliers in supplier meetings, eg the support teams notice a few hard drives are failing in the first few months. These incidents could be group togeher to raise to the 3rd party supplier.

Law three, governance should write up rules on when a problem should be raised and clearly communicated to the IT organisation.

eg A problem should be raised for all Citrix server crashes and assigned to the Citrix team

Incidents should be monitored for trends and to check if a problem could be raised to mitigate recurring incidents. Monitoring the incidents can also help check if a work around could be put in place for a long running incident and problem raised to find the root cause.

Law four, all incidents should be monitored for trend analysis and time to fix to see if a problem can be raised to mitigate the underlying issue.

Finally, once the root cause is found either through incidents and problems, one of two things should happen :

– Mitigate the issue.
– Add the issue to the known error database with a workaround / fix.

Law five, all root causes should be mitigated or the fix time shortened by writing up a known error entry with a fix or work around.

I believe by following these laws engineers have scope to troubleshoot issues as they come in whilst the business operation down time is minimised.

What does everyone think?

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

3 thoughts on “Five laws of incidents and problems”

  1. You are still thinking of Incident and Problem as functions not processes, I.e. as teams. That a problem record is only required if we want to functionally (horizontally) escalate to a ” problem team”.

    The problem record represents the thing we are dealing with not the people dealing with it. It’s the same thing whether the service desk or Level Two resolve it .

    Like

    1. …if there was a fault causing an incident, and that cause was repaired, there should be a problem record regardless of who did it. The records don’t exist just to track current work, they also serve as historical data, which is crap if we aren’t opening a record for every problem.

      Like

  2. Hi itskeptic, I think my write up needs to be adjusted as I do not think problems should be functions but should always be processes.

    The points I was trying to make are, an incident should be used to resolve the service but the triggers to define when a problem should be raised needs to be defined. In my write up the triggers were time, competency, governance rules and trends. The problem record would then be drawn up and dealt with by a team, virtual team or an individual. The example was a Citrix team would take up the problem record, check the server and find the root cause of the crash and mitigate or add a known error log entry to the KEDB. However, this could of been done by the engineer who restored the service, but importantly the service needs to be restored as the incident and then a problem can be worked upon after this time.

    However, I disagree in practice a problem record should be drawn up for all incidents when the cause was found. I know in the ITIL books this is the practice but if I have an incident which I find just needs a password reset then I am unlikely want to increase my work load by closing an incident and then raising a problem just to put password reset as the root cause. This is why I thought of the first two laws, competency and time. This would allow support engineers somes time eg 4 hours to work on the incident which means relativily easily fixed issues could be closed off quickly without creating more paperwork than there needs to be. However, in law five, all root causes found through incidents and problems should be added to the KEDB to reduce fix times for all future incidents.

    Thankyou very much for commenting and I hope more people do as I use this blog to write down how I understand various subjects or ideas and wish more people would comment so I can learn more.

    Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s