Medoid

They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression[2] (where while the data is sparse the medoid need not be).

A geometric median is defined in any dimension, but unlike a medoid, it is not necessarily a point from within the original dataset.

[3][4] However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models.

Correlated Sequential Halving[8] also leverages multi-armed bandit techniques, improving upon Meddit.

By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time.

Medoids can be applied to various text and NLP tasks to improve the efficiency and accuracy of analyses.

This technique helps in organizing, summarizing, and retrieving information from large collections of documents, such as in search engines, social media analytics and recommendation systems.

This approach is especially useful for extractive summarization tasks, where the goal is to generate a summary by selecting the most relevant sentences from the original text.

Medoid-based clustering can be applied to group text data based on similar sentiment patterns.

By analyzing the medoids of these clusters, researchers can gain an understanding of the underlying topics in the text corpus, facilitating tasks such as document categorization, trend analysis, and content recommendation.

[13] When applying medoid-based clustering to text data, it is essential to choose an appropriate similarity measure to compare documents effectively.

Each technique has its advantages and limitations, and the choice of the similarity measure should be based on the specific requirements and characteristics of the text data being analyzed.

[15] Medoids can be employed to analyze and understand the vector space representations generated by large language models (LLMs), such as BERT, GPT, or RoBERTa.

By applying medoid-based clustering on the embeddings produced by these models for words, phrases, or sentences, researchers can explore the semantic relationships captured by LLMs.

This approach can help identify clusters of semantically similar entities, providing insights into the structure and organization of the high-dimensional embedding spaces generated by these models.

[16] Active learning involves choosing data points from a training pool that will maximize model performance.

Medoids can play a crucial role in data selection and active learning with LLMs.

Medoid-based clustering can be used to identify representative and diverse samples from a large text dataset, which can then be employed to fine-tune LLMs more efficiently or to create better training sets.

[18] This approach can help in understanding the model's decision-making process, identifying potential biases, and uncovering the underlying structure of the LLM-generated embeddings.

As the discussion around interpretability and safety of LLMs continues to ramp up, using medoids may serve as a valuable tool for achieving this goal.

As a versatile clustering method, medoids can be applied to a variety of real-world issues in numerous fields, stretching from biology and medicine to advertising and marketing, and social networks.

In gene expression analysis,[19] researchers utilize advanced technologies consisting of microarrays and RNA sequencing to measure the expression levels of numerous genes in biological samples, which results in multi-dimensional data that can be complex and difficult to analyze.

Medoids are a potential solution by clustering genes primarily based on their expression profiles, enabling researchers to discover co-expressed groups of genes that could provide valuable insights into the molecular mechanisms of biological processes and diseases.

One popular approach to making use of medoids in social network analysis is to compute a distance or similarity metric between pairs of nodes based on their properties.

Medoids also can be employed for market segmentation,[21] which is an analytical procedure that includes grouping clients primarily based on their purchasing behavior, demographic traits, and various other attributes.

Clustering clients into segments using medoids allows companies to tailor their advertising and marketing techniques in a manner that aligns with the needs of each group of customers.

The medoids serve as representative factors within every cluster, encapsulating the primary characteristics of the customers in that group.

High dimensionality doesn't only affect distance metrics however, as the time complexity also increases with the number of features.

Depending on how such medoids are initialized, k-medoids may converge to different local optima, resulting in different clusters and quality measures,[27] meaning k-medoids might need to run multiple times with different initializations, resulting in a much higher run time.

One way to counterbalance this is to use k-medoids++,[28] an alternative to k-medoids similar to its k-means counterpart, k-means++ which chooses initial medoids to begin with based on a probability distribution, as a sort of "informed randomness" or educated guess if you will.

This example shows how cosine similarity will compare the angle of lines between objects to determine how similar the items are. Note that most text embeddings will be at least a few hundred dimensions instead of just two.
This Jaccard similarity formula can easily be applied to text.
This example shows how Euclidean distance will calculate the distance between objects to determine how similar the items are. Note that most text embeddings will be at least a few hundred dimensions instead of just two.
This is an example of how text can be grouped with similar items when embedded based on location. This represents grouping by Euclidean distance. If these were grouped by a different similarity measure like cosine similarity, then the medoids may be different.
Example of a normalized Jaccard dissimilarity graph using Nuclear Profiles. Each NP is compared to each other NP in the table, and the corresponding dissimilarity is inputted into the cell corresponding to the pair of NPs being compared. Higher numbers indicate higher dissimilarity, while lower numbers indicate higher similarity. Most labels are excluded due to size constraints. The diagonal is marked as “0” so as not to skew the data.
The initial dataset that will be used throughout this section. A grey dot indicates an object that is unassigned to any cluster.
The first center selection. The large points are centers, and the colors separate each object by its cluster.
The initial clusters.
The medoid selection.
The final clusters.