HITS algorithm

Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg.

The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages.

Journals such as Science and Nature are filled with numerous citations, making these magazines have very high impact factors.

In other words, it is better to receive citations from an important journal than from an unimportant one.

Counting the number of links to a page can give us a general estimate of its prominence on the Web, but a page with very few incoming links may also be prominent, if two of these links come from the home pages of sites like Yahoo!, Google, or MSN.

Because these sites are of very high importance but are also search engines, a page can be ranked much higher than its actual relevance.

In the HITS algorithm, the first step is to retrieve the most relevant pages to the search query.

The HITS computation is performed only on this focused subgraph.

According to Kleinberg the reason for constructing a base set is to ensure that most (or many) of the strongest authorities are included.

Authority and hub values are defined in terms of one another in a mutual recursion.

The algorithm performs a series of iterations, each consisting of two basic steps: The Hub score and Authority score for a node is calculated with the following algorithm: HITS, like Page and Brin's PageRank, is an iterative algorithm based on the linkage of the documents on the web.

In order to calculate the hub/authority scores of each node, repeated iterations of the Authority Update Rule and the Hub Update Rule are applied.

A k-step application of the Hub-Authority algorithm entails applying for k times first the Authority Update Rule and then the Hub Update Rule.

The final hub-authority scores of nodes are determined after infinite repetitions of the algorithm.

As directly and iteratively applying the Hub Update Rule and Authority Update Rule leads to diverging values, it is necessary to normalize the matrix after every iteration.

Thus the values obtained from this process will eventually converge.

The hub and authority values converge in the pseudocode above.

The code below does not converge, because it is necessary to limit the number of steps that the algorithm runs for.

Expanding the root set into a base set