Fault-Tolerant Telecommunication System Patterns (Summary)

By Michael Adams, James Coplien, Robert Gamoke, Robert Hanmer, Fred Keeve, Keith Nicodemus
AT&T Bell Laboratories Copyright 1995 AT&T. All rights reserved.
Summary by Tom Evans, Abstractions Inc.

Minimize Human Intervention

History has shown that people cause the majority of problems in continuously running systems (wrong actions, wrong systems, wrong button).

Let the machine try to do everything itself, deferring to the human only as an act of desperation and last resort.

People Know Best

How do you balance automation with human authority and responsibility?

Assume that people know best, particularly the maintenance folks. Design the system to allow knowledgeable users to override the automatic controls.

Five Minutes of No Escalation Messages

Rolling in console messages: the human-machine interface is saturated with error reports that may be rolling off the screen, or consuming resources just for the intense displaying activity

When taking the first action down the scenario that could lead to an excess number of messages:
Display a [ONE] message.
Periodically display an update message.
If the abnormal condition ends, display a [normal] message
Do not display a message for every change in state.

Riding Over Transients

How do you know whether a problem will work itself out or not?

Don't react immediately to detected conditions.
Make sure the condition really exists by checking several times, or use Leaky Bucket Counters to detect a critical number of occurrences in a specific time interval.
...just by waiting a while, give transients& a chance to pass.

Leaky bucket counters

How do you deal with transient faults?

A failure type has a counter that is set to an initial value
The counter is decremented for each fault or event
& And incremented on a periodic basis
React if the counter reaches the threshold

SICO* First and Always

Making a system highly available and resilient in the face of hardware and software faults and transient errors

* System Integrity Control Program (SICO) coordinates system integrity

Give system integrity the ability and power to re-initialize the system whenever system sanity is threatened by error conditions. The same system integrity should oversee both the initialization process and the normal application functionality so that initialization can be restarted if it runs into errors.

Try All Hardware Combos

The Central Controller (CC) has several configurations. There are many possible paths through CC subsystems depending on the configuration. How do you select a workable configuration in light of a faulty subsystem?

Based on ROM or 'Boot' file configuration tables, try the next available hardware configuration if the previous 're-boot' failed

Fool Me Once

Sometimes the fault is very intermittent (usually triggered by software, such as diagnostics). After a recovery completes, users expect the configuration state display to disappear and the system to be sane.

But if the system in fact trips on another fault, it may reboot itself and re-initiate the initialization sequence using the same configuration& which raises the probability that the system will loop in reboots and never attempt different configurations.

The first time the application tells PC that "all is well", believe it, (30 second wait) and reset the configuration counter.

The second and subsequent times within a longer time window (30 minutes), ignore the request.