A BK-tree is a metric tree suggested by Walter Austin Burkhard and Robert M. Keller[1] specifically adapted to discrete metric spaces.
For simplicity, consider integer discrete metric
An arbitrary element a is selected as root node.
The root node may have zero or more subtrees.
The k-th subtree is recursively built of all elements b such that
BK-trees can be used for approximate string matching in a dictionary.
[2][example needed] This picture depicts the BK-tree for the set
of words {"book", "books", "cake", "boo", "boon", "cook", "cake", "cape", "cart"} obtained by using the Levenshtein distance The BK-tree is built so that: The insertion primitive is used to populate a BK-tree
Input: Output: Algorithm: Given a searched element
, the lookup primitive traverses the BK-tree to find the closest element of
The key idea is to restrict the exploration of
to nodes that can only improve the best candidate found so far by taking advantage of the BK-tree organization and of the triangle inequality (cut-off criterion).
Input: Output: Algorithm: Consider the example 8-node B-K Tree shown above and set
is initialized to contain the root of the tree, which is subsequently popped as the first value of
since the distance from "book" to "cool" is 2, and
as this is the best (i.e. smallest) distance found thus far.
Next each outgoing arc from the root is considered in turn: the arc from "book" to "books" has weight 1, and since
The next arc, from "book" to "cake," has weight 4, and since
Therefore, the subtree rooted at "cake" will be pruned from the search, as the word closest to "cool" cannot appear in that subtree.
To see why this pruning is correct, notice that a candidate word
appearing in "cake"s subtree having distance less than 2 to "cool" would violate the triangle inequality: the triangle inequality requires that for this set of three numbers (as sides of a triangle), no two can sum to less than the third, but here the distance from "cool" to "book" (which is 2) plus the distance from "cool" to
(which is less than 2) cannot reach or exceed the distance from "book" to "cake" (which is 4).
Therefore, it is safe to disregard the entire subtree rooted at "cake".
, the distance from "cool" to "books."
remains set at 2 and the single outgoing arc from the node containing "books" is considered.
, the distance from "cool" to "boo."
Each outgoing arc from "boo" is now considered; the arc from "boo" to "boon" has weight 1, and since
are considered in arbitrary order: suppose the node containing "cook" is popped first, improving
to distance 1, then the node containing "boon" is popped last, which has distance 2 from "cool" and therefore does not improve the best result.
Finally, "cook" is returned as the answer