Observability (software)

In software engineering, more specifically in distributed computing, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components.

[1][2] To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it.

The term is borrowed from control theory, where the "observability" of a system measures how well its state can be determined from its outputs.

Similarly, software observability measures how well a system's state can be understood from the obtained telemetry (metrics, logs, traces, profiling).

The definition of observability varies by vendor: a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre [...] without needing to ship new codesoftware tools and practices for aggregating, correlating and analyzing a steady stream of performance data from a distributed application along with the hardware and network it runs onobservability starts by shipping all your raw data to central service before you begin analysisthe ability to measure a system’s current state based on the data it generates, such as logs, metrics, and tracesObservability is tooling or a technical solution that allows teams to actively debug their system.

As a result, when a previously unknown issue is encountered, it is impossible to add new metrics without shipping new code.

A distributed trace is an interrelated series of discrete events (also called spans) that track the progression of a single user request.

In fast-changing systems, instrumentation itself is often the best possible documentation, since it combines intention (what are the dimensions that an engineer named and decided to collect?)

[13] Majors et al. suggest that the pillars of observability are high cardinality, high-dimensionality, and explorability, arguing that runbooks and dashboards have little value because "modern systems rarely fail in precisely the same way twice.

Self monitoring may be put in place in addition to high availability and redundancy to further avoid correlated failures.