Semantic heterogeneity

Semantic heterogeneity is when database schema or datasets for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values.

Decomposing the various sources of semantic heterogeneities provides a basis for understanding how to map and transform data to overcome these differences.

One of the first known classification schemes applied to data semantics is from William Kent more than two decades ago.

[2] Kent's approach dealt more with structural mapping issues than differences in meaning, which he pointed to data dictionaries as potentially solving.

[5] This table shows the combined 40 possible sources of semantic heterogeneities across sources: Language Encoding For example, ASCII v UTF-8 Ambiguous sentence references, such as I'm glad I'm a man, and so is Lola (Lola by Ray Davies and the Kinks) Synonyms Acronyms Homonyms When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example, Berlin the city v Berlin the official city-state) When two individuals are asserted as being the same when they are actually distinct (for example, John F. Kennedy the president v John F. Kennedy the aircraft carrier) Domain Data representation Confusion often arises in the use of literals v URIs v object types Data A common problem, more acute with closed world approaches than with open world ones A different approach toward classifying semantics and integration approaches is taken by Sheth et al.[6] Under their concept, they split semantics into three forms: implicit, formal and powerful.