Perceiver

It uses a small set of latent units that forms an attention bottleneck through which the inputs must pass.

[1] It associates position and modality-specific features with every input element (e.g. every pixel, or audio sample).

[1] Perceiver uses cross-attention to produce linear complexity layers and to detach network depth from input size.

Perceiver IO can flexibly query the model's latent space to produce outputs of arbitrary size and semantics.

It achieves results on tasks with structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-tasking.

Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation.