We understand that any service interruption can cause inconvenience and stress to our customers. During such an incident, we aim to provide timely and useful updates regularly whilst engineers address the situation.
Our internal processes include role assignments, communications, example logging and incident rooms. Following an incident we will have a debrief.
Here we set out to explain actions we may take, and why we might take them. These are subject to change as we continually look to improve our approach.
The following activities are visible to customers. We will do some or all of the following:
- Create a Status Page Incident - this is the default place to seek information about service availability. We recognise that some customers just want to know when a fix will be applied, and others like to have more detailed information. We try to set reasonable expectations and appropriate levels of information.
- Regularly update that page - usually every 20 - 30 minutes at first and then longer periods depending on circumstances
- Set up a form where customers can log their call issues to assist us in tracking issues
- Occasionally ask for fresh examples of call issues following some remedial action or changed situation
- Provide more ad-hoc updates to customers who are part of our Community Slack workspace
- Prioritise support to those who we can help most immediately
- Occasionally turn off the support phone lines to allow staff to focus on tickets as this can be a far more effective way to address customer issues
- Provide a message on the support phone system acknowledging there is a problem and indicating that the Status Page is the best source of information
- Not usually address service complaints during incidents - we wish to focus on active support as a priority
- Provide expected time(s) to fix where practical
- Provide a post mortem for major incidents
Some of the problems we face include:
- Not all reported problems are related to the incident - this can be difficult to determine if connected parts of the services have failed
- Some problems are related to the incident but are not the root cause - in other words they are symptoms caused by the incident
- Providing information that might help some, but not all, affected customers - we might suggest resetting a handset but that would only help some customers and for others it will not help and possibly provide some further discouragement
We will continue to examine what we do during incidents to minimise downtime and maximise communication to our customers.