Medoid

They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression[2] (where while the data is sparse the medoid need not be).

A geometric median is defined in any dimension, but unlike a medoid, it is not necessarily a point from within the original dataset.

[3][4] However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models.

Correlated Sequential Halving[8] also leverages multi-armed bandit techniques, improving upon Meddit.

By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time.

Medoids can be applied to various text and NLP tasks to improve the efficiency and accuracy of analyses.

This technique helps in organizing, summarizing, and retrieving information from large collections of documents, such as in search engines, social media analytics and recommendation systems.

This approach is especially useful for extractive summarization tasks, where the goal is to generate a summary by selecting the most relevant sentences from the original text.

Medoid-based clustering can be applied to group text data based on similar sentiment patterns.

By analyzing the medoids of these clusters, researchers can gain an understanding of the underlying topics in the text corpus, facilitating tasks such as document categorization, trend analysis, and content recommendation.

[13] When applying medoid-based clustering to text data, it is essential to choose an appropriate similarity measure to compare documents effectively.

Each technique has its advantages and limitations, and the choice of the similarity measure should be based on the specific requirements and characteristics of the text data being analyzed.

[15] Medoids can be employed to analyze and understand the vector space representations generated by large language models (LLMs), such as BERT, GPT, or RoBERTa.

By applying medoid-based clustering on the embeddings produced by these models for words, phrases, or sentences, researchers can explore the semantic relationships captured by LLMs.

This approach can help identify clusters of semantically similar entities, providing insights into the structure and organization of the high-dimensional embedding spaces generated by these models.

[16] Active learning involves choosing data points from a training pool that will maximize model performance.

Medoids can play a crucial role in data selection and active learning with LLMs.

Medoid-based clustering can be used to identify representative and diverse samples from a large text dataset, which can then be employed to fine-tune LLMs more efficiently or to create better training sets.

[18] This approach can help in understanding the model's decision-making process, identifying potential biases, and uncovering the underlying structure of the LLM-generated embeddings.

As the discussion around interpretability and safety of LLMs continues to ramp up, using medoids may serve as a valuable tool for achieving this goal.

As a versatile clustering method, medoids can be applied to a variety of real-world issues in numerous fields, stretching from biology and medicine to advertising and marketing, and social networks.

In gene expression analysis,[19] researchers utilize advanced technologies consisting of microarrays and RNA sequencing to measure the expression levels of numerous genes in biological samples, which results in multi-dimensional data that can be complex and difficult to analyze.

Medoids are a potential solution by clustering genes primarily based on their expression profiles, enabling researchers to discover co-expressed groups of genes that could provide valuable insights into the molecular mechanisms of biological processes and diseases.

One popular approach to making use of medoids in social network analysis is to compute a distance or similarity metric between pairs of nodes based on their properties.

Medoids also can be employed for market segmentation,[21] which is an analytical procedure that includes grouping clients primarily based on their purchasing behavior, demographic traits, and various other attributes.

Clustering clients into segments using medoids allows companies to tailor their advertising and marketing techniques in a manner that aligns with the needs of each group of customers.

The medoids serve as representative factors within every cluster, encapsulating the primary characteristics of the customers in that group.

High dimensionality doesn't only affect distance metrics however, as the time complexity also increases with the number of features.

Depending on how such medoids are initialized, k-medoids may converge to different local optima, resulting in different clusters and quality measures,[27] meaning k-medoids might need to run multiple times with different initializations, resulting in a much higher run time.

One way to counterbalance this is to use k-medoids++,[28] an alternative to k-medoids similar to its k-means counterpart, k-means++ which chooses initial medoids to begin with based on a probability distribution, as a sort of "informed randomness" or educated guess if you will.

This example shows how cosine similarity will compare the angle of lines between objects to determine how similar the items are. Note that most text embeddings will be at least a few hundred dimensions instead of just two.

This Jaccard similarity formula can easily be applied to text.