Shortest common supersequence

Given two sequences X = < x1,...,xm > and Y = < y1,...,yn >, a sequence U = < u1,...,uk > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

In the shortest common supersequence problem, two sequences X and Y are given, and the task is to find a shortest possible common supersequence of these sequences.

For two input sequences, an SCS can be formed from a longest common subsequence (LCS) easily.

By inserting the non-LCS symbols into Z while preserving their original order, we obtain a shortest common supersequence U

There is no similar relationship between shortest common supersequences and longest common subsequences of three or more input sequences.

(In particular, LCS and SCS are not dual problems.)

For the general case of an arbitrary number of input sequences, the problem is NP-hard.

[1] The closely related problem of finding a minimum-length string which is a superstring of a finite set of strings S = { s1,s2,...,sn } is also NP-hard.

[3] However, perhaps the simplest solution is to reformulate the problem as an instance of weighted set cover in such a way that the weight of the optimal solution to the set cover instance is less than twice the length of the shortest superstring S. One can then use the O(log(n))-approximation for weighted set-cover to obtain an O(log(n))-approximation for the shortest superstring (note that this is not a constant factor approximation).

The instance I of set cover is formulated as follows: The instance I can then be solved using an algorithm for weighted set cover, and the algorithm can output an arbitrary concatenation of the strings x for which the weighted set cover algorithm outputs P(x).