Fault tolerance

[1]: 155 Its basic design was magnetic drums connected via relays, with a voting method of memory error detection (triple modular redundancy).

Eventually, they separated into three distinct categories: Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s,[2] in preparation for Project Apollo and other research aspects.

[3] Hyper-dependable computers were pioneered mostly by aircraft manufacturers,[1]: 210 nuclear power companies, and the railroad industry in the United States.

Again, IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF, Unisys, and General Electric built their own.

[7] In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it.

[8] Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair.

This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.

Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results.

Historically, the trend has been to move away from N-model and toward M out of N, as the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.

Tandem Computers built their entire business on such machines, which used single-point tolerance to create their NonStop systems with uptimes measured in years.

Additionally, some sites, including popular platforms such as Twitter (until December 2020), provide an optional lightweight front end that does not rely on JavaScript and has a minimal layout, to ensure wide accessibility and outreach, such as on game consoles with limited web browsing capabilities.

For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.

In computers, a program might fail-safe by executing a graceful exit (as opposed to an uncontrolled crash) to prevent data corruption after an error occurs.

In fault-tolerant computer systems, programs that are considered robust are designed to continue operation despite an error, exception, or invalid input, instead of crashing completely.

Resilient buildings and infrastructure are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions.

Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test.

The cost of a redundant restraint method like seat belts is quite low, both economically and in terms of weight and space, so the third test is passed.

However, the similarly critical systems for actuating the brakes under driver control are inherently less robust, generally using a cable (can rust, stretch, jam, snap) or hydraulic fluid (can leak, boil and develop bubbles, absorb water and thus lose effectiveness).

Therefore, no redundancy is built into it per se (and it typically uses a cheaper, lighter, but less hardwearing cable actuation system), and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still, before driving off to find a flat piece of road on which to stop.

On cheaper, slower utility-class machines, even if the front wheel should use a hydraulic disc for extra brake force and easier packaging, the rear will usually be a primitive, somewhat inefficient, but exceptionally robust rod-actuated drum, thanks to the ease of connecting the footpedal to the wheel in this way and, more importantly, the near impossibility of catastrophic failure even if the rest of the machine, like a lot of low-priced bikes after their first few years of use, is on the point of collapse from neglected maintenance.

Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work.

Considering the importance of high-value systems in transport, public utilities and the military, the field of topics that touch on research is very wide: it can include such obvious subjects as software modeling and reliability, or hardware design, to arcane elements such as stochastic models, graph theory, formal or exclusionary logic, parallel processing, remote data transmission, and more.

The approach has performance costs: because the technique rewrites code to insert dynamic checks for address validity, execution time will increase by 80% to 500%.

[21] Recovery shepherding is a lightweight technique to enable software programs to recover from otherwise fatal errors such as null pointer dereference and divide by zero.

[22] For 17 of 18 systematically collected real world null-dereference and divide-by-zero errors, a prototype implementation enables the application to continue to execute to provide acceptable output and service to its users on the error-triggering inputs.

For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly fault resistant.