SimRank

SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model.

SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects.

Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor,[1] inserting additional terms that are neglected by SimRank[2] or using PageRank-based alternatives.

[3] Many applications require a measure of "similarity" between objects.

One obvious example is the "find-similar-document" query, on traditional text corpora or the World-Wide Web.

More generally, a similarity measure can be used to cluster objects, such as for collaborative filtering in a recommender system, in which “similar” users and items are grouped based on the users’ preferences.

In a document corpus, matching text may be used, and for collaborative filtering, similar users may be identified by common preferences.

SimRank is a general approach that exploits the object-to-object relationships found in many domains of interest.

A similar approach can be applied to scientific papers and their citations, or to any other document corpus with cross-reference information.

In the case of recommender systems, a user’s preference for an item constitutes a relationship between the user and the item.

Such domains are naturally modeled as graphs, with nodes representing objects and edges representing relationships.

The base case is that objects are maximally similar to themselves .

[4] It is important to note that SimRank is a general algorithm that determines only the similarity of structural context.

SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships.

For example, for Web pages SimRank can be combined with traditional textual similarity; the same idea applies to scientific papers or other document corpora.

For recommendation systems, there may be built-in known similarities between items (e.g., both computers, both clothing, etc.

), as well as similarities between users (e.g., same gender, same spending level).

Following the earlier motivation, a recursive equation is written for

be the column normalized adjacency matrix whose entry

is a lower bound on the actual SimRank score

It was shown in [4] that the values converge to limits satisfying the basic SimRank equation, the SimRank scores

The original SimRank proposal suggested choosing the decay factor

generally imply relatively low accuracy of iteratively computed SimRank scores.

For guaranteeing more accurate computation results, the latter paper suggests either using a smaller decay factor (in particular,

CoSimRank is a variant of SimRank with the advantage of also having a local formulation, i.e. CoSimRank can be computed for a single node pair.

To compute the similarity score of only a single node pair, let

Step two sums up the vector similarity of each iteration.

Both, matrix and local representation, compute the same similarity score.

CoSimRank can also be used to compute the similarity of sets of nodes, by modifying

Lizorkin et al.[5] proposed three optimization techniques for speeding up the computation of SimRank: In particular, the second observation of partial sums memoization plays a paramount role in greatly speeding up the computation of SimRank from