Five laws of incidents and problems

Incidents and problems are in place to restore a service, fix an issue, work out why the issue or outage happened in the first place and then try and make sure this doesn’t happen again. All teams should be working together to make sure there is minimum downtime to the business on all incidents provided the right priorities are followed. We have all seen the analogies of incidents and problems.

eg. http://www.reddit.com/r/ITIL/comments/2d1zga/how_do_you_explain_the_difference_between/

https://itilbegood.com/2014/07/28/requests-incidents-problems-and-known-errors-in-a-nutshell/

However, where it gets a bit confusing is, where does investigating an incidents root cause and resolving the service cross over into problem root cause territory. Why should an engineer set about investigating an outage have to raise a problem if they have the incident from the customer, surely this all seems like a lot of paperwork for a few clicks?

Therefore, I wanted to put a stake in the ground, after a few years doing support, and then everyone can shout me down but at the end of the discussion / bloodbath we might have a solution. Of course it does depend upon organisations but there seems to be some confusion on incidents and problems.

At the heart of the matter is this truth,

Between incidents and problems, you should be able to restore the service quickly and root cause found with the cause of the incidents being mitigated or a work around published, so future incidents can be fixed quicker. The whole purpose is to provide fixes to the business so the business operation is minimally impacted. If there is an impact, the situation should be recovered and steps to mitigate the impact or minimise it, the next time it occurs.

Ok, so lets look at two incidents, one a customer can’t access their file shares and one customer calls in and says their Citrix sessions have hung…..and two minutes later another person calls up to say their citrix session have also hung.

The first one, the support engineer would pick up the call and after some trouble shooting realise the customers password had expired, reset and reboot, the customer is up and running. The way to mitigate it is to tell the customer to reset the password before it expires. So, this process has gone through the restore of service, finding the root cause and mitigating the issue.

Next, the engineer checks the Citrix session and finds out both customers are on the same server, the engineer can not remote onto the server, therefore the server looks like it has crashed. There is a known error entry which tells the engineer to take the server out of the load balancer and reset the customers sessions, the customers will re-connect to another server so service is restored. The engineer then reboots the server and upon reboot the server looks fine. However, would you put the server back into the live environment?

These two incidents illustrate the issue, the engineer on the first call was competent to go through all the steps and complete the incident. However, is the engineer competent to go through all the steps of trouble shooting the server? Maybe not, maybe a Citrix team needs to be involved in checking out the server before the server is put back in to the production environment. This is where a problem should be raised, the incident can be closed or linked to the problem but a problem should be raised as the server needs to be checked out why it crashed but the production environment continues to function.

Law one, raising a problem comes down to the competency of the support team. Can  they restore the service, find the root cause and mitigate it in an incident or can they only restore the service and then raise a problem for a specialist team to find the root cause and mitigate the issue.

Next, time needs to be monitored on incidents. Engineers love to trouble shoot it and fix issues, trying fix after fix to get to the bottom of the issue, however, this may take an hour. However, is this good for the business? If the engineer could put in a work around for the issue in the first 5 mins and leave the customer to get on with their day but raise a problem to investigate the issue further without needing to bother the customer, then surely this is a better way of working from the business point of view?

Law two, incidents, where a work around is present this should be implemented and a problem should be raised to find the root cause at a later date. The priority is to restore the service to the business.

When to raise a problem should be a thing of governance. ITIL explains this ITIL Service Operation page 99 (service operation process – Incidents versus problems)

The rules for invoking problem management during an incident can vary and are at the discretion of individual organisations.

Therefore when to raise a problem is up to the organisation. In the examples of the Citrix server, I would suggest a problem should be raise when the impact is to many customers, a key service or server is impacted or to group incidents together to raise to 3rd party suppliers in supplier meetings, eg the support teams notice a few hard drives are failing in the first few months. These incidents could be group togeher to raise to the 3rd party supplier.

Law three, governance should write up rules on when a problem should be raised and clearly communicated to the IT organisation.

eg A problem should be raised for all Citrix server crashes and assigned to the Citrix team

Incidents should be monitored for trends and to check if a problem could be raised to mitigate recurring incidents. Monitoring the incidents can also help check if a work around could be put in place for a long running incident and problem raised to find the root cause.

Law four, all incidents should be monitored for trend analysis and time to fix to see if a problem can be raised to mitigate the underlying issue.

Finally, once the root cause is found either through incidents and problems, one of two things should happen :

– Mitigate the issue.
– Add the issue to the known error database with a workaround / fix.

Law five, all root causes should be mitigated or the fix time shortened by writing up a known error entry with a fix or work around.

I believe by following these laws engineers have scope to troubleshoot issues as they come in whilst the business operation down time is minimised.

What does everyone think?

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

Is your service desk really a help desk in a fish costume?

Fish

Let me start with some honesty. I hate calling call centres. Every time I call I seem to loose a little bit of the will to live especially when they try and all call themselves all manner of different names, my recent favourite was ‘a customer experience agent’, when all I get is passed around, all telling me someone else should be dealing with my call or they will check and give me a call back, if I had a pound every time I was told that. A customer experience centre is still a bad call centre if I still get passed around and nobody really knows what to say or do.

I have just moved house and needed to register with British Gas (UK Gas and Electric utility company), I needed to do four things on the call:

– Give them all the details, name, address, occupation etc
– Set up a direct debit/giro so every month the right amount for the bill would be taken out of my account automatically.
– Register to add points to my nectar card, its a UK points card where I can get money off my supermarket shop
– Register for Hive, an awesome new system where I can control my  heating from my phone/tablet/computer and hive knows when I am coming home and leaving and switches on or off my heating accordingly.

Hmmm, so not too much that could go wrong. Immediately the person who picked up the call and heard what I wanted, paused, stuttered and said she needed to put me through to someone else. Already I am loosing the will, getting a little frustrated that I want to give them money but they are making it so hard, maybe a little harsh, but this has normally how it starts and only gets worse.

Then another lady picked up the phone and she nailed it, names and address…done, direct debit….done, nectar card points…done, hive…errr, never done one of these before, hold while she asked someone how to do it (this is not a problem, it is a new system so I thought I was probably a first), then bang I have an engineer coming out the following week.

Sill with me, wondering what this has to do with fish and ITIL. Well, recently I saw on a forum the title ‘How to change a help desk into a Service Desk’, the person had been tasked with changing a help desk into a service desk because that is what ITIL says and you can’t be ITIL compliant with out it. I thought, that is like calling a call centre agent a customer experience agent, a name doesn’t change a thing, it is what you do to change the perception of the customer that the team has changed.

People should ask How do I change my IT organisation into an ITIL IT organisation?’ rather then changing team names and think you are done.

I have worked in many support environments, all called many different names and some supposedly within an ITIL framework. However, I would say all were a help desk once you took away the nice names. A help desk to me, was and is, a team which takes all the calls about anything, tries to fix anything and if they can not then they have to beg, and plead, with other support teams to help them as there are no support agreements internally to get assistance. If the Help Desk can not get help then they hold onto the ticket and try over a few days to resolve it. The customer thinks the help desk is a little hit and miss, one customer even once said ‘why don’t you just call yourself, desk, instead of help desk.’

How does this differ to a Service Desk? This really is a trick question as if the organisation hasn’t changed to ITIL then the Service Desk is still just a Help Desk with a new name.

Please read my earlier blog posts to get an idea of what ITIL is about  :

https://itilbegood.com/2014/04/07/what-is-itil/

https://itilbegood.com/2014/07/19/service-management-as-a-rugby-game/

If the IT organisation wants to be an ITIL organisation, they should :

Have a catalogue showing the business what services are supported and the service desk knowing how these services are supported.

The services should be backed up with a configuration database, showing how these services are configured. the aim here is to give support engineers access to the latest configuration of the service with all the components to make troubleshooting easier so the service is resumed quickly. This is not an exercise in creating a database and ticking a box, it has to be usable and up-to-date. How the information should be presented should be after speaking to the stakeholder who will use this information. The database is a read and write database not a write database that nobody reads.

OLA’s should be written to show internal IT organisation resolution times for services, what is included and who can be involved in these fix times and how changes to these services should be implemented. If it comes in a 30 page document, ask yourself, if you were the engineer who just got a call saying nobody can access their e-mails, could you:

Would you know where then OLA is held?
How should the support teams escalate the incident?
Find out who should be on a bridge call to help fix it?
Use the configuration database to troubleshoot?
What support agreements with 3rd parties are in place?
What and who should communicate to the business the issue?
If an emergency change or a normal change needs to be implemented, how and who should do this?

All while users are screaming at you to fix it, if you feel you can’t, fix the documentation to make it easier. One suggestion would be to create a share point dashboard from the 30 page OLA document which support engineers can look to for easy reference.

After the OLA is created, SLA’s can be written which the business then knows how long an incident, request or outages should take to complete.

So, now the IT organisation knows :

What services are supported
How the services are configured
How the services are supported

Next, how does the business tell you they want something or something is broken. This is done through requests and incidents. The Service Desk should categorise the requests / incidents and add a priority to them. The priority comes from the OLA. When the service is restored to normal the incident is closed.

https://itilbegood.com/category/in-a-nutshell/

Overview of requests, incidents, problems : https://itilbegood.com/2014/07/28/requests-incidents-problems-and-known-errors-in-a-nutshell/

If an incident can’t be resolved definitively so a work around can only be used or an incident has been closed but the root cause could not be found. Open a problem, this can be worked on by an individual or a team of people to find the root cause and find the fix for the incident.

Though, ITIL is all about constant improvement there should be some sort of incident and problem management to analyse the incidents and problems to see if these can be reduced or done better through additional training or better procedures.

If you want to make a change to the service, a group of people (defined in the OLA) should assess the change in a regular meeting for proposed work, impact, back out plan, timings (is this within a change window defined in the OLA) and if the business needs to be aware, either by the business being in the same change meeting or a business communication, or both. This should minimise the impact to the business for any changes to services.

Finally, make sure there are some reports showing the business how IT is doing and the value provided. Maybe a report showing the number of changes (successful / unsuccessful), incidents (closure rate/time, categories), problems (types, closure rate, resolutions), SLA (within SLA and if not, what steps have been taken to rectify this)

Now the IT organisation knows :

How to log incidents and requests.
How to investigate incidents in more depth.
How to improve / spot trends with the incidents and problems process.
Make changes to services in a controlled way.

Finally, the IT organisation should put in place some method of improvement. Can areas of IT be improved to provide better service or value to the business?

If the IT organisation can provide these support structures to help the Service Desk, without them, the Service Desk is a Help desk still

Remember

Help Desk

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

 

 

 

Setting up my blog

Thankyou

Thankyou everyone who has commented on my blog and posted all the great feedback. I really appriciate it as I am just starting up this blog, so it is still a bit scary to post my ideas and thoughts out for everyone to see. So I am glad people are liking what they are reading. A few comments have been around what theme I have used, do you need HTML skills etc etc, therefore I thought I would answer them all in a post.

I have no prior knowledge to starting up blogs nor any HTML knowledge. I just went on wordpress and started this blog.

The theme I use is twenty fourteen and have done nothing special to make the pages load quicker, a few people have commented my blog loads really quickly, all I can say is thankyou WordPress.

I have also not paid for anything initially, I have paid for my .com address but I started with a standard blog.wordpress.com address. however, this is all I have paid for. I decided to have wordpress host my site and register the domain with as they charge to re-direct blogs to web site domains so it would not have worked out any cheaper doing it any other way.

I have also set up google webmaster tools so hopefully as more people find me, I will move up the rankings and maybe one day make it to the first page when you search ‘ITIL blogs’

I find it a great way to publish my ideas and get feedback that people like my ideas or with the subject of ITIL get experts providing questions or challenging me which can only improve my knowledge and understanding. It is scary posting your ideas, it’s sort of like running through the streets naked with everyone watching, but I looked around the ITIL blogs and thought ‘well no one else is doing anything too special’ (There are exceptions, itskeptic for one, is a great blog. Everyone in ITSM should read his blog) so if I post my ideas and understanding then would I be at least be helping create a conversation to build knowledge and understanding or everyone. I have always been in support so I want to alway help people understand IT. IT shouldn’t be as hard as it sometimes seems. One of the key quotes I have tried to live by with this blog is :

If you can’t explain it simply, you don’t know it well enough – Albert Einstein.

I have then set up a twitter, reddit, flipboard, stumbleupon and delicious page which I publise these blog posting on, this has the logo to keep everything the same. The blog automatically posts a new post to my twitter and linkedin account. Also I add my twitter account at the bottom of all postings so hopefully people start following me on twitter…hint 🙂

Hope this helps any new bloggers out. Ahhh, finally, WordPress happiness engineers (not sure about the title) are great and will help you with any questions you have. They helped me put in the HTML coding for my social media icons. The icons used can be found at http://martz90.deviantart.com/art/Circle-Icons-Pack-371172325. Thanks Martz.

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

Answers on a postcard

I saw this on @itilorg twitter feed and thought it was pretty funny, and could really make the foundation exam much more interesting.

Question 1

1406706168913

From the picture, what is this?

A)  Incident

B) Problem

C) Change

D) Request

E) Really, really bad day

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

Do we still need ITIL?

This is a great blog post http://optimalservicemanagement.com/blog/do-we-still-need-itil/. It reminds me of WHY you do ITIL, not because the ITIL book says so on page 32 but to provide value to the business.

If you just follow ITIL blindly then you will create a mess. Engage brain and see how bits can work for you, maybe some won’t work for you or could in future and that is ok. Just look to ‘adopt and adapt’ to make your IT organisation the best value for money it can be to the business.

Thankyou for reading my post. This is my opportunity to blog about a subject I love but am still learning. These posts are my way of showing how I understand the subject, however, I would encourage you to leave comments, did you agree / disagree with the post? Did I not explain something well enough or incorrectly? Do you want me to blog about another subject within ITIL? All feedback helps me to understand more. Thankyou.

This blog is all about making IT more user friendly by looking at ITIL, Service Management and everything else to make IT better. Please leave comments and tell me what you think, this is also an opportunity for me, to write down my ideas and get feedback from everyone to help me understand the subject better.