Fault-Tolerant Telecommunication System Patterns (Summary)
By Michael Adams, James Coplien, Robert Gamoke, Robert Hanmer, Fred Keeve,
Keith Nicodemus
AT&T Bell Laboratories Copyright ©1995 AT&T. All
rights reserved.
History has shown that people cause the majority of problems in continuously running systems (wrong actions, wrong systems, wrong button). |
Let the machine try to do everything itself, deferring to the human only as an act of desperation and last resort. |
How do you balance automation with human authority and responsibility? |
Assume that people know best, particularly the maintenance folks. Design the system to allow knowledgeable users to override the automatic controls. |
Rolling in console messages: the human-machine interface is saturated with error reports that may be rolling off the screen, or consuming resources just for the intense displaying activity |
When taking the first action down the scenario that could lead to an
excess number of messages: |
How do you know whether a problem will work itself out or not? |
Don't react immediately to detected conditions. |
How do you deal with transient faults? |
A failure type has a counter that is set to an initial value |
Making a system highly available and resilient in the face of hardware and software faults and transient errors * System Integrity Control Program (SICO) coordinates system integrity |
Give system integrity the ability and power to re-initialize the system whenever system sanity is threatened by error conditions. The same system integrity should oversee both the initialization process and the normal application functionality so that initialization can be restarted if it runs into errors. |
The Central Controller (CC) has several configurations. There are many possible paths through CC subsystems depending on the configuration. How do you select a workable configuration in light of a faulty subsystem? |
Based on ROM or 'Boot' file configuration tables, try the next available hardware configuration if the previous 're-boot' failed |
Sometimes the fault is very intermittent (usually triggered by software, such as diagnostics). After a recovery completes, users expect the configuration state display to disappear and the system to be sane. But if the system in fact trips on another fault, it may reboot itself and re-initiate the initialization sequence using the same configuration& which raises the probability that the system will loop in reboots and never attempt different configurations. |
The first time the application tells PC that "all is well", believe it, (30 second wait) and reset the configuration counter. The second and subsequent times within a longer time window (30 minutes), ignore the request. |