Troubleshooting

Troubleshooting is a form of problem solving, often applied to repair failed products or processes on a machine or a system.

It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again.

Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

In general, troubleshooting is the identification or diagnosis of "trouble" in the management flow of a system caused by a failure of some kind.

(For example, selecting the "print" option from various computer applications is intended to result in a hardcopy emerging from some specific device).

There are two major elements required to enable a troubleshooting diagnosis to take place - à priori domain knowledge and search strategies.

Diagnosis of a problem transpires as a rapid recognition process in which symptoms evoke appropriate situation categories.

However, the strategy won’t work independently with truly novel problems, or where deeper understanding of whatever is taking place is sought.

Topography in this context means a description or an analysis of a structured entity, showing the relations among its elements.

Comprehensive documentation produced by proficient technical writers is very helpful, especially if it provides a theory of operation for the subject device or system.

Some computerized troubleshooting services (such as Primary, later renamed Manesar), immediately show the top 10 solutions with the highest probability of fixing the underlying problem.

The computer uses these reports to update its estimates of which solutions have the highest probability of fixing that particular set of symptoms.

[9][10] Efficient methodical troubleshooting starts on with a clear understanding of the expected behavior of the system and the symptoms being observed.

[12] It is similar to the game of "twenty questions": Anyone can isolate one option out of a million by dividing the set of alternatives in half 20 times (because 2^10 = 1024 and 2^20 = 1,048,576).

Often considerable effort and emphasis in troubleshooting is placed on reproducibility ... on finding a procedure to reliably induce the symptom to occur.

In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it).

Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem.

In computer programming race conditions often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (analogous to "heating up" a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes.

Nevertheless, sometimes troubleshooters must resort to statistical methods ... and can only find procedures to increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible.

In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the root cause has been found and that the problem is truly solved.