Data-flow analysis

Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program.

A simple way to perform data-flow analysis of programs is to set up data-flow equations for each node of the control-flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes, i.e., it reaches a fixpoint.

[1][2][3][4] Data-flow analysis is the process of collecting information about the way the variables are defined and used in the program.

The transfer function of each statement separately can be applied to get information at a point inside a basic block.

Each particular type of data-flow analysis has its own specific transfer function and join operation.

If the control-flow graph does not contain cycles (there were no explicit or implicit loops in the procedure) solving the equations is straightforward.

The control-flow graph can then be topologically sorted; running in the order of this sort, the entry states can be computed at the start of each block, since all predecessors of that block have already been processed, so their exit states are available.

The latter two steps are repeated until we reach the so-called fixpoint: the situation in which the in-states (and the out-states in consequence) do not change.

This can be guaranteed by imposing constraints on the combination of the value domain of the states, the transfer functions and the join operation.

The value domain should be a partial order with finite height (i.e., there are no infinite ascending chains

The combination of the transfer function and the join operation should be monotonic with respect to this partial order.

The efficiency of iteratively solving data-flow equations is influenced by the order at which local nodes are visited.

The initial value of the in-states is important to obtain correct and accurate results.

If the minimum element represents totally conservative information, the results can be used safely even during the data-flow iteration.

If it represents the most accurate information, fixpoint should be reached before the results can be applied.

The following are examples of properties of computer programs that can be calculated by data-flow analysis.

This is because data-flow analysis operates on the syntactical structure of the CFG without simulating the exact control flow of the program.

However, to be still useful in practice, a data-flow analysis algorithm is typically designed to calculate an upper respectively lower approximation of the real program properties.

The result is typically used by dead code elimination to remove statements that assign to a variable whose value is not used afterwards.

Solving the data-flow equations starts with initializing all in-states and out-states to the empty set.

Its computed in-state differs from the previous one, so its predecessors b1 and b2 are inserted and the process continues.

Several modern compilers use static single-assignment form as the method for analysis of variable dependencies.

[6] In 2002, Markus Mohnen described a new method of data-flow analysis that does not require the explicit construction of a data-flow graph,[7] instead relying on abstract interpretation of the program and keeping a working set of program counters.

A combination of control flow analysis and data flow analysis has shown to be useful and complementary in identifying cohesive source code regions implementing functionalities of a system (e.g., features, requirements or use cases).

[8] There are a variety of special classes of dataflow problems which have efficient or general solutions.

Using this representation, the join and transfer functions can be implemented as bitwise logical operations.

The transfer function for each block can be decomposed in so-called gen and kill sets.

[9][11] Solutions to these problems provide context-sensitive and flow-sensitive dataflow analyses.

There are several implementations of IFDS-based dataflow analyses for popular programming languages, e.g. in the Soot[12] and WALA[13] frameworks for Java analysis.