Adversarial stylometry

All adversarial stylometry shares the core idea of faithfully paraphrasing the source text so that the meaning is unchanged but the stylistic signals are obscured.

[1] Brennan & Greenstadt (2009) introduced the first corpus of adversarially authored texts specifically for evaluating stylometric methods;[3] other corpora include the International Imitation Hemingway Competition, the Faux Faulkner contest, and the hoax blog A Gay Girl in Damascus.

[4] Rao & Rohatgi (2000) suggest that short, unattributed documents (i.e., anonymous posts) are not at risk of stylometric identification, but pseudonymous authors who have not practiced adversarial stylometry in producing corpuses of thousands of words may be vulnerable.

[11] Gröndahl & Asokan (2020a) say that the general validity of the hypothesis underlying stylometry—that authors have invariant, content-independent 'style fingerprints'—is uncertain, but "the deanonymisation attack is a real privacy concern".

[12] Those interested in practicing adversarial stylometry and stylistic deception include whistleblowers avoiding retribution;[13] journalists and activists;[10] perpetrators of frauds and hoaxes;[14] authors of fake reviews;[15] literary forgers;[16] criminals disguising their identity from investigators;[17] and, generally, anyone with a desire for anonymity or pseudonymity.

[2] Wang, Juola & Riddell (2022) found that gross errors introduced by Google Translate were rare, but more common with several intermediate translations—however, occasional simple or short sentences and misspellings in the source text appeared verbatim in the output, potentially providing an identifying signal.

[2][32] Stylometric signals vary in how simply they can be adversarially masked; an author may easily change their vocabulary by conscious choice, but altering the pattern of grammar or the letter frequency in their text may be harder to achieve, though Juola & Vescovi (2011) report that imitation typically succeeds at masking more characteristics than obfuscation.

[36][37] How to best mask stylometric characteristics in practice, and what tasks to perform manually, what with tool assistance, and what fully automatically, is an open field of research, especially in short documents with limited potential variability.

[40] Manual stylistic modulation is a significant effort, with poor scalability properties; tool assistance can reduce the burden to varying degrees.

[34] Further, when an author chooses a method, they must rely on their threat model and trust that it is valid, and that unknown analyses able to detect remaining stylistic signals cannot or will not be performed, or that the masking successfully transfers;[50] a stylometrist with knowledge of how the author attempted to mask their style, however, may be able to exploit some weakness in the method and render it unsafe.

[53] Rewriting an input text to defeat stylometry, as opposed to consciously removing stylistic characteristics during composition, poses challenges in retaining textual meaning.

[11] For sensibility, if a text is so ungrammatical as to be incomprehensible or so ill-formed that it cannot fit in to its genre then the method has failed, but compromises short of that point may be useful.

[44] If inconspicuity is partially lost, then there is the possibility that more expensive and less scalable analyses will be performed (e.g., consulting a forensic linguist) to confirm suspicions or gather further evidence.

[58] However, Gröndahl & Asokan (2020a) assess existing evidence as insufficient to prove that adversarial stylometry is always detectable, with only limited methods having been studied.

[63] Kacmarcik & Gamon (2006) observe that low-dimensional stylometric models which operate on small numbers of features are less resistant to adversarial stylometry.

[65] Potthast, Hagen & Stein (2016) reported that even simple automated methods of adversarial stylometry caused major difficulties for state-of-the-art authorship identification systems, though at significant soundness and sensibility cost.