Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection.
[1] Paraphrasing is also useful in the evaluation of machine translation,[2] as well as semantic parsing[3] and generation[4] of new samples to expand existing corpora.
[5] Barzilay and Lee[5] proposed a method to generate paraphrases through the usage of monolingual parallel corpora, namely news articles covering the same event on the same day.
Training consists of using multi-sequence alignment to generate sentence-level paraphrases from an unannotated corpus.
Pairings between patterns are then found by comparing similar variable words between different corpora.
There has been success in using long short-term memory (LSTM) models to generate paraphrases.
[7] In short, the model consists of an encoder and decoder component, both implemented using variations of a stacked residual LSTM.
The decoding LSTM takes the hidden vector as input and generates a new sentence, terminating in an end-of-sentence token.
The encoder and decoder are trained to take a phrase and reproduce the one-hot distribution of a corresponding paraphrase by minimizing perplexity using simple stochastic gradient descent.
New paraphrases are generated by inputting a new phrase to the encoder and passing the output to the decoder.
With the introduction of Transformer models, paraphrase generation approaches improved their ability to generate text by scaling neural network parameters and heavily parallelizing training through feed-forward layers.
[8] These models are so fluent in generating text that human experts cannot identify if an example was human-authored or machine-generated.
[9] Transformer-based paraphrase generation relies on autoencoding, autoregressive, or sequence-to-sequence methods.
Autoencoder models predict word replacement candidates with a one-hot distribution over the vocabulary, while autoregressive and seq2seq models generate new text based on the source predicting one word at a time.
[10][11] More advanced efforts also exist to make paraphrasing controllable according to predefined quality dimensions, such as semantic preservation or lexical diversity.
The main concept is to produce a vector representation of a sentence and its components by recursively using an autoencoder.
Given an odd number of inputs, the first vector is forwarded as-is to the next level of recursion.
The autoencoder is trained to reproduce every vector in the full recursion tree, including the initial word embeddings.
of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings.
The encoder and decoder can be implemented through the use of a recursive neural network (RNN) or an LSTM.
Thus a simple logistic regression can be trained to good performance with the absolute difference and component-wise product of two skip-thought vectors as input.
Models such as BERT can be adapted with a binary classification layer and trained end-to-end on identification tasks.
[16][17] Transformers achieve strong results when transferring between domains and paraphrasing techniques compared to more traditional machine learning methods such as logistic regression.
Other successful methods based on the Transformer architecture include using adversarial learning and meta-learning.
Since paraphrase recognition can be posed as a classification problem, most standard evaluations metrics such as accuracy, f1 score, or an ROC curve do relatively well.
A notable drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced.
Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition.
However, paraphrases often have several lexically different but equally valid solutions, hurting BLEU and other similar evaluation metrics.
PEM, on the other hand, attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using N-grams overlap in a pivot language.
[23] Consistently reliable paraphrase detection have all used the Transformer architecture and all have relied on large amounts of pre-training with more general data before fine-tuning with the question pairs.