Frequent subtree mining

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold.

[1] It is a more general form of the maximum agreement subtree problem.

[2] Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree isomorphic to a given pattern.

[3] The problem of frequent subtree mining has been formally defined as:[1] In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.

A sub-tree is frequent if its support is not less than a user-specified threshold (often denoted as minsup).

Starting from the root of the tree, node labels are added to the string in depth-first search order.

-1 is added to the string whenever the search process backtracks from a child to its parent.

Two k-sub-trees are said to be in the same prefix equivalence class if the string representation of them are identical up to the (k-1)-th node.

In other words, all elements in a prefix equivalence class only differ by the last node.

For example, two trees with string representation A B -1 C -1 and A B -1 D -1 are in the prefix equivalence class A B with elements (C, 0) and (D,0).

where l and r are the minimum and maximum node index in the sub-tree rooted at A.

As such the index of any descendant of A must lie in the scope of A, which will be a very useful property when counting the support of sub-trees.

Frequent sub-tree patterns follow the anti-monotone property.

By utilizing this property, k-sub-trees candidates can be generated based on frequent (k-1)-sub-trees through prefix class extension.

This operation is repeated for any two ordered, but not necessarily distinct elements in C to construct the extended prefix classes of k-sub-trees.

TreeMiner performs depth first candidate generation using scope-list representation of sub-trees to facilitate faster support counting.

A k-sub-tree S can be representation by a triplet (t,m,s) where t is the tree id the sub-tree comes from, m is the prefix match label, and s the scope of the last node in S. Depending on how S occurs in different trees across the database, S can have different scope-list representation.

By keeping track of distinct tree ids used in the scope-list tests, the support of sub-trees can be calculated efficiently.

Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining.

[1] Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining.

[4] Other domains in which frequent subtree mining is useful include computational biology,[5][6] RNA structure analysis,[6] pattern recognition,[7] bioinformatics,[8] and analysis of the KEGG GLYCAN database.

[7] Furthermore, due to combinatorial explosion, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database".