Cosine similarity

It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle.

In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in

For example, in information retrieval and text mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document.

[1] The technique is also used to measure cohesion within clusters in the field of data mining.

[2] One advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates need to be considered.

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula: Given two n-dimensional vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as where

The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity.

Cosine similarity can be seen as a method of normalizing document length during comparison.

In the case of information retrieval, the cosine similarity of two documents will range from

), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient.

To repair the triangle inequality property while maintaining the same ordering, one can convert to Euclidean distance

Alternatively, the triangular inequality that does work for angular distances can be expressed directly in terms of the cosines; see below.

The normalized angle, referred to as angular distance, between any two vectors

is a formal distance metric and can be calculated from the cosine similarity.

When the vector elements may be positive or negative: Or, if the vector elements are always positive: Unfortunately, computing the inverse cosine (arccos) function is slow, making the use of the angular distance more computationally expensive than using the more common (but not metric) cosine distance above.

normalisation of the vectors, followed by the application of normal Euclidean distance.

Then the Euclidean distance over the end-points of any two vectors is a proper metric which gives the same ordering as the cosine distance (a monotonic transformation of Euclidean distance; see below) for any comparison of vectors, and furthermore avoids the potentially expensive trigonometric operations required to yield a proper metric.

Once the normalisation has occurred, the vector space can be used with the full range of techniques available to any Euclidean space, notably standard dimensionality reduction techniques.

This normalised form distance is often used within many deep learning algorithms.

In biology, there is a similar concept known as the Otsuka–Ochiai coefficient named after Yanosuke Otsuka (also spelled as Ōtsuka, Ootsuka or Otuka,[6] Japanese: 大塚弥之助)[7] and Akira Ochiai (Japanese: 落合明),[8] also known as the Ochiai–Barkman[9] or Ochiai coefficient,[10] which can be represented as: Here,

If sets are represented as bit vectors, the Otsuka–Ochiai coefficient can be seen to be the same as the cosine similarity.

[11] In a recent book,[12] the coefficient is tentatively misattributed to another Japanese researcher with the family name Otsuka.

The confusion arises because in 1957 Akira Ochiai attributes the coefficient only to Otsuka (no first name mentioned)[8] by citing an article by Ikuso Hamai (Japanese: 浜井生三),[13] who in turn cites the original 1936 article by Yanosuke Otsuka.

[7] The most noteworthy property of cosine similarity is that it reflects a relative, rather than absolute, comparison of the individual vector dimensions.

However more recent metrics with a grounding in information theory, such as Jensen–Shannon, SED, and triangular divergence have been shown to have improved semantics in at least some contexts.

[17] The ordinary triangle inequality for angles (i.e., arc lengths on a unit hypersphere) gives us that Because the cosine function decreases as an angle in [0, π] radians increases, the sense of these inequalities is reversed when we take the cosine of each value: Using the cosine addition and subtraction formulas, these two inequalities can be written in terms of the original cosines, This form of the triangle inequality can be used to bound the minimum and maximum similarity of two objects A and B if the similarities to a reference object C is already known.

For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive.

For calculating soft cosine, the matrix s is used to indicate similarity between features.

The time complexity of this measure is quadratic, which makes it applicable to real-world tasks.

[21] An efficient implementation of such soft cosine similarity is included in the Gensim open source library.