Overlapping markup

In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner.

Contiguous overlap can always be represented as a linear document with milestones (typically co-indexed start- and end-markers), without the need for fragmenting a (logical) component into multiple physical ones.

[9] With XHTML and SGML-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible.

[11] SGML, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any.

This feature has been poorly supported by tools and has seen very little actual use; using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor.

[16] To illustrate these approaches, marking up the sentences and lines of a fragment of Richard III by William Shakespeare will be used as a running example.

The advantage of this approach is that each document is simple and can be processed with existing tools, but requires maintenance of redundant content and it can be difficult to cross-reference between different views.

[18][19] Schmidt (2012, 3.5 Variation) recommends this approach for encoding multiple variants of a single text and to accept the duplication of the parts which do not vary, rather than attempting to create a structure that represents all of the variation present; further, he suggests that this alignment be performed automatically, and that misalignment is rare in practice.

Milestones can be used to embed a non-privileged structure within a hierarchical language, In their basic form they can only represent contiguous overlap.

Generic XML can of course parse the milestone elements, but do not understand their special meaning and so cannot easily process or validate the non-privileged structure.

[30] Join-based representations can introduce the possibility of cycles between elements; detecting and rejecting these adds complexity to implementations.

[33] Example: It has been claimed that separating markup and text can result in overall simplification and increased maintainability,[34] and by 2017, ``[t]he current state of the art to [represent] (...) linguistically annotated data is to use a graph-based representation serialized as standoff XML as a pivot format´´,[35] i.e., that standoff was the most widely accepted approach to address the overlapping markup challenge.

[40] Standoff formalisms are not natively supported by database management systems, so that (by 2017) it was suggested to ``use ... standoff XML as a pivot format (...) and relational data bases for querying.´´[35] In practical applications, this requires complicated architectures and/or labor-intense transformation between pivot format and internal representation.

[41] This has been a motivation to develop corpus management systems on the basis of graph data bases and for using established graph-based formalisms as pivot formats.

Standoff annotations can thus be more adequately represented as generalised directed multigraphs and use formalisms and technologies developed for this purpose, most notably those based on the Resource Description Framework (RDF).

RDF is semantically equivalent to graph-based data models underlying standoff markup; it does not require special-purpose technology for storing, parsing and querying.

Multiple interlinked RDF files representing a document or a corpus constitute an example of Linguistic Linked Open Data.