Triplet loss

Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples.

It was conceived by Google researchers for their prominent FaceNet algorithm for face detection.

[1] Triplet loss is designed to support metric learning.

Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions.

In the context of face detection, data points correspond to images.

The loss function is defined using triplets of training points of the form

in a finite dimensional Euclidean space is denoted by

is a hyperparameter called the margin, and its value must be set manually.

Thus, the full form of the function to be minimized is the following: A baseline for understanding the effectiveness of triplet loss is the contrastive loss,[2] which operates on pairs of samples (rather than triplets).

Training with the contrastive loss pulls embeddings of similar pairs closer together, and pushes dissimilar pairs apart.

Its pairwise approach is greedy, as it considers each pair in isolation.

Triplet loss innovates by considering relative distances.

This process adds an additional layer of complexity compared to contrastive loss.

A naive approach to preparing training data for the triplet loss involves randomly selecting triplets from the dataset.

In general, the set of valid triplets of the form

To speed-up training convergence, it is essential to focus on challenging triplets.

In the FaceNet paper, several options were explored, eventually arriving at the following.

For each anchor-positive pair, the algorithm considers only semi-hard negatives.

It may appear puzzling that the mining process neglects "very hard" negatives (i.e., closer to the anchor than the positive).

Experiments conducted by the FaceNet designers found that this often leads to a convergence to degenerate local minima.

Triplet mining is performed at each training step, from within the sample points contained in the training batch (this is known as online mining), after embeddings were computed for all points in the batch.

While ideally the entire dataset could be used, this is impractical in general.

Batches are constructed by selecting a large number of same-category sample points (40), and randomly selected negatives for them.

Triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities.

This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks.

[3] In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture.