Feature learning

Feature learning is motivated by the fact that ML tasks such as classification often require input that is mathematically and computationally convenient to process.

[6] It is also possible to use the distances to the clusters as features, perhaps after transforming them through a radial basis function (a technique that has been used to train RBF networks[15]).

[16] In a comparative evaluation of unsupervised feature learning methods, Coates, Lee and Ng found that k-means clustering with an appropriate transformation outperforms the more recently invented auto-encoders and RBMs on an image classification task.

Furthermore, PCA can effectively reduce dimension only when the input data vectors are correlated (which results in a few dominant eigenvalues).

Local linear embedding (LLE) is a nonlinear learning approach for generating low-dimensional neighbor-preserving representations from (unlabeled) high-dimension input.

The reconstruction weights obtained in the first step capture the "intrinsic geometric properties" of a neighborhood in the input data.

[21] Aharon et al. proposed algorithm K-SVD for learning a dictionary of elements that enables sparse representation.

[23] These architectures are often designed based on the assumption of distributed representation: observed data is generated by the interactions of many different factors on multiple levels.

In a deep learning architecture, the output of each intermediate layer can be viewed as a representation of the original input data.

The weights together with the connections define an energy function, based on which a joint distribution of visible and hidden nodes can be devised.

The weights can be trained by maximizing the probability of visible variables using Hinton's contrastive divergence (CD) algorithm.

The idea is to add a regularization term in the objective function of data likelihood, which penalizes the deviation of the expected hidden variables from a small constant

RBMs have also been used to obtain disentangled representations of data, where interesting features map to separate hidden units.

A larger portion of negative samples is typically necessary in order to prevent catastrophic collapse, which is when all inputs are mapped to the same representation.

[9] In either case, the output representations can then be used as an initialization in many different problem settings where labeled data may be limited.

[11] Many self-supervised training schemes have been developed for use in representation learning of various modalities, often first showing successful application in text or image before being transferred to other data types.

[10] A limitation of word2vec is that only the pairwise co-occurrence structure of the data is used, and not the ordering or entire set of context words.

More recent transformer-based representation learning approaches attempt to solve this with word prediction tasks.

[30] Other self-supervised techniques extend word embeddings by finding representations for larger text structures such as sentences or paragraphs in the input data.

[31] The domain of image representation learning has employed many different self-supervised training techniques, including transformation,[32] inpainting,[33] patch discrimination[34] and clustering.

[36] Many other self-supervised methods use siamese networks, which generate different views of the image through various augmentations that are then aligned to have similar representations.

[37] SimCLR is a contrastive approach which uses negative examples in order to generate image representations with a ResNet CNN.

[34] Bootstrap Your Own Latent (BYOL) removes the need for negative samples by encoding one of the views with a slow moving average of the model parameters as they are being modified during training.

[40] Another approach is to maximize mutual information, a measure of similarity, between the representations of associated structures within the graph.

[9] Approaches usually rely on some natural or human-derived association between the modalities as an implicit label, for instance video clips of animals or objects with characteristic sounds,[46] or captions written to describe images.

[47] CLIP produces a joint image-text representation space by training to align image and text encodings from a large dataset of image-caption pairs using a contrastive loss.

Since particular distance functions are invariant under particular linear transformations, different sets of embedding vectors can actually represent the same/similar information.