Canonicalization

This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

Other operations performed by this function to canonicalize filenames are the handling of /.. components referring to parent directories, simplification of sequences of multiple slashes, removal of trailing slashes, and the resolution of symbolic links.

Permitting cmd.exe to execute would be an error caused by a failure to canonicalize the filename to the simplest representation, C:\Windows\System32\cmd.exe, and is called a directory traversal vulnerability.

Variable-width encodings in the Unicode standard, in particular UTF-8, may cause an additional need for canonicalization in some situations.

In this context, canonicalization is the process of translating every string character to its single valid byte sequence.

[3] Most search engines support the Canonical link element as a hint to which URL should be treated as the true version.

As indicated by John Mueller of Google, having other directives in a page, like the robots noindex element can give search engines conflicting signals about how to handle canonicalization [4] Example: All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.

Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs.

A simple example would be the following two snippets of XML: The first example contains extra spaces in the closing tag of the first node.

A full summary of canonicalization changes is listed below: In morphology and lexicography, a lemma is the canonical form of a set of words.