Tag Archives: incidents

Five laws of incidents and problems

Incidents and problems are in place to restore a service, fix an issue, work out why the issue or outage happened in the first place and then try and make sure this doesn’t happen again. All teams should be working together to make sure there is minimum downtime to the business on all incidents provided the right priorities are followed. We have all seen the analogies of incidents and problems.

eg. http://www.reddit.com/r/ITIL/comments/2d1zga/how_do_you_explain_the_difference_between/

https://itilbegood.com/2014/07/28/requests-incidents-problems-and-known-errors-in-a-nutshell/

However, where it gets a bit confusing is, where does investigating an incidents root cause and resolving the service cross over into problem root cause territory. Why should an engineer set about investigating an outage have to raise a problem if they have the incident from the customer, surely this all seems like a lot of paperwork for a few clicks?

Therefore, I wanted to put a stake in the ground, after a few years doing support, and then everyone can shout me down but at the end of the discussion / bloodbath we might have a solution. Of course it does depend upon organisations but there seems to be some confusion on incidents and problems.

At the heart of the matter is this truth,

Between incidents and problems, you should be able to restore the service quickly and root cause found with the cause of the incidents being mitigated or a work around published, so future incidents can be fixed quicker. The whole purpose is to provide fixes to the business so the business operation is minimally impacted. If there is an impact, the situation should be recovered and steps to mitigate the impact or minimise it, the next time it occurs.

Ok, so lets look at two incidents, one a customer can’t access their file shares and one customer calls in and says their Citrix sessions have hung…..and two minutes later another person calls up to say their citrix session have also hung.

The first one, the support engineer would pick up the call and after some trouble shooting realise the customers password had expired, reset and reboot, the customer is up and running. The way to mitigate it is to tell the customer to reset the password before it expires. So, this process has gone through the restore of service, finding the root cause and mitigating the issue.

Next, the engineer checks the Citrix session and finds out both customers are on the same server, the engineer can not remote onto the server, therefore the server looks like it has crashed. There is a known error entry which tells the engineer to take the server out of the load balancer and reset the customers sessions, the customers will re-connect to another server so service is restored. The engineer then reboots the server and upon reboot the server looks fine. However, would you put the server back into the live environment?

These two incidents illustrate the issue, the engineer on the first call was competent to go through all the steps and complete the incident. However, is the engineer competent to go through all the steps of trouble shooting the server? Maybe not, maybe a Citrix team needs to be involved in checking out the server before the server is put back in to the production environment. This is where a problem should be raised, the incident can be closed or linked to the problem but a problem should be raised as the server needs to be checked out why it crashed but the production environment continues to function.

Law one, raising a problem comes down to the competency of the support team. Can  they restore the service, find the root cause and mitigate it in an incident or can they only restore the service and then raise a problem for a specialist team to find the root cause and mitigate the issue.

Next, time needs to be monitored on incidents. Engineers love to trouble shoot it and fix issues, trying fix after fix to get to the bottom of the issue, however, this may take an hour. However, is this good for the business? If the engineer could put in a work around for the issue in the first 5 mins and leave the customer to get on with their day but raise a problem to investigate the issue further without needing to bother the customer, then surely this is a better way of working from the business point of view?

Law two, incidents, where a work around is present this should be implemented and a problem should be raised to find the root cause at a later date. The priority is to restore the service to the business.

When to raise a problem should be a thing of governance. ITIL explains this ITIL Service Operation page 99 (service operation process – Incidents versus problems)

The rules for invoking problem management during an incident can vary and are at the discretion of individual organisations.

Therefore when to raise a problem is up to the organisation. In the examples of the Citrix server, I would suggest a problem should be raise when the impact is to many customers, a key service or server is impacted or to group incidents together to raise to 3rd party suppliers in supplier meetings, eg the support teams notice a few hard drives are failing in the first few months. These incidents could be group togeher to raise to the 3rd party supplier.

Law three, governance should write up rules on when a problem should be raised and clearly communicated to the IT organisation.

eg A problem should be raised for all Citrix server crashes and assigned to the Citrix team

Incidents should be monitored for trends and to check if a problem could be raised to mitigate recurring incidents. Monitoring the incidents can also help check if a work around could be put in place for a long running incident and problem raised to find the root cause.

Law four, all incidents should be monitored for trend analysis and time to fix to see if a problem can be raised to mitigate the underlying issue.

Finally, once the root cause is found either through incidents and problems, one of two things should happen :

– Mitigate the issue.
– Add the issue to the known error database with a workaround / fix.

Law five, all root causes should be mitigated or the fix time shortened by writing up a known error entry with a fix or work around.

I believe by following these laws engineers have scope to troubleshoot issues as they come in whilst the business operation down time is minimised.

What does everyone think?

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

Requests, Incidents, Problems and Known Errors in a nutshell

Over the past few weeks I have noticed some talk and discussion around what incidents, problems and requests are and what are the differences between them on some of the ITIL blogs. So here is my take :

Requests

These are requests made by the customer, eg please can you install x software or please can you replace the toner on the sales printer. These types of ‘can I haves’ should be logged as a request. These are separate to incidents, as they will have different SLA’s and priorities associated to them. Installing a piece of software for one member of the sales team has a different priority than someone in the sales team can’t access the network shares.

Incidents

These are for when thing breaks or isn’t working. eg My PC won’t turn on, I can’t access any network shares or none of the print outs are coming out of the printer. These are different to requests as it normally means the customer or team cannot work or a service is degraded so they can’t work as well. The person who picks up the incident will associate a priority eg a whole office who can’t access the network might be a Priority 1 incidents and a customer who can’t print might be a priority 3 call. These priorities should be documented with an SLA associated to them so the business will know roughly how log an incident of this type will take to fix. Again, it is up to you and the business to work out these priorities and SLA’s, ITIL is just a guide. The incident can be closed when the incident is fix permanently or a work around has been put in place which restores the service back to normal.

Ahh, and this is where some will wheel out the old chestnut, is a password reset and incident or a request?

Answer

1) Why is this not automated? Plenty of tools can allow the customer re set their password themselves without needing to log a incidents/request.

2) It is up to you and how you want to define it. All you are trying to do is separate incidents (priority) over a request (sometimes, not as higher priority, as an incident), be able to produce stats on the two to show trends to help with incident and request management and reporting to the business to show how great IT are.

Problems

What happens if all that the person who picks up the incident, can do is produce a work around or doesn’t know why the fix worked or multiple customers are logging the same type of incident eg reboot the PC and the problem goes away or all that can be done to resolve the incident is produce a work around, meaning the issues still exists but there is a sticky plaster to hold everything together? Now, problems come into play. Problems are something where a virtual problem team or an individual can look into the issue deeper, hopefully finding out the root cause and a permanent fix. A problem is also something that can be taken ‘off line’. The service has been restored as the incident has been closed so the danger has past but the problem can be used to investigate over a longer period to find the real issue.

Known errors

Through your diligent problem management and investigation, the root cause is found. However, like most things in life, it is not an easy fix. The fix requires a new server, cabling or the manufacturer of the component has acknowledged there is an issue but there is no driver update so all you can do is stick with the work around. ITIL has rather cleverly thought of this scenario and known errors can be used.

e.g.

An incident was logged and a workaround took two days to come up with but the manufacturer needs to update a drives before a permanent fix can be implemented. If someone logs a similar issues, the wheel doesn’t need to be created again, a known error should of been created after the first incidents work around was found so this can be used to implement a fix/work around quickly for the second incident.

A known error and the known error database greatly reduces the fix times for subsequent and similar incidents which are awaiting permanent fixes or there are other reasons why a permanent fix can’t be implemented, so a work around is as good as it is going to get.

Hopefully, requests, incidents, problems and known errors are a little clear on what they are and what the differences are.

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

Interior design, the ITIL way.

Decorating with ITIL

What is ITIL? ITIL is a collection of 5 topics covering Service Strategy, Design, Transition, Operations and Continuous Improvement which should be used to form, implement, keep it going and improve your ITIL strategy to improve your business to IT alignment….

That was boring. No, I believe ITIL to be bigger and at it heart more simplistic then an all or nothing approach to ITIL and must be implemented exactly how the manual says so. Let me explain using an analogy.

Imagine, IT, as a house. It is a shell of house, how are you going to decorate it? You are probably going to decorate it in ways that works best for you and the people who use your house. How will you know how to decorate your house, you need some ideas…look no further than the ITIL Interior Design book. In it, you will find loads of ideas on how to decorate your new house. The covers all shapes of houses and is designed to give you ideas for your home. The book gives ideas on how to design what you want to do, implement it, keep up the day-to-day maintenance on it and how make improvements to your house. However, a word of warning, its not a step by step book. The book is more there to give you ideas to research and find out how to use it best for your house.

Using the book you can tailor design items to fit your needs eg a twenty foot incident management dining room table doesn’t fit into your house, then buy a six foot incident management dining room table, which works much better in your house but follows the design principals of the twenty foot dining table. How about a change management media centre, do you need top of the range or mid range to suit your budget but gets similar results? These are two examples of incident management and change management which the essence of what these actually do stays the same but you need to mould it to what fits your business.

The metric you want are not the concrete composite used to make the driveway, you want to know how much the amenities cost per year. Much as the same way you need to tailor the reporting metric used to report ITIL to what is most useful to the business. Does reporting just how many changes are made each week mean as much as reporting how many changes were approved AND how many failed or were rolled back with possibly the report showing how many changes where service / customer impacting. This helps to show to the business how successful and possibility how competent IT is at implementing change.

All these services can be then upgraded when the budget allows or makes good business sense to upgrade through continuous improvement. In most houses do the wallpaper, carpets and doors stay the same in the house throughout the whole life of the house, no, these get upgraded and changed. Using the energy metric you can also see if you can save more money through changing suppliers or improving the heat insulation. All this is continuous service improvement, providing you with more value from your home.

For me, this is what ITIL is, it about returning the best value returned to the business and to do this you have to fit ITIL into what works best with the business which may mean leaving some ITIL out to start with to implement when it is time. Though what ITIL, I believe, is trying to get a department, which has traditionally, been a law unto itself thinking more about the business. So many times I have heard IT complain, ‘Without IT there would be no business’ well, without the business there would be no IT. After all, if the business didn’t make any money, IT wouldn’t have a budget. So using ITIL, I believe IT can repay the investment and provide the business with the best business aligned IT infrastructure it can to make the business do even better and hopefully make more money.

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.