In social systems, data friction consumes energy and produces turbulence and heat – that is, conflicts, disagreements, and inexact, unruly processes.
[16] After the Second World War large scientific projects have increasingly relied on knowledge infrastructure to collect, process and analyze important amount of data.
Punch-cards system were first used experimentally on climate data in the 1920s and were applied on a large scale in the following decade: "In one of the first Depression-era government make-work projects, Civil Works Administration workers punched some 2 million ship log observations for the period 1880–1933.
The first initiative to create a database of electronic bibliography of open access data was the Educational Resources Information Center (ERIC) in 1966.
[19] Knowledge infrastructures were also set up in space engineering (with NASA/RECON), library search (with OCLC Worldcat) or the social sciences: "The 1960s and 1970s saw the establishment of over a dozen services and professional associations to coordinate quantitative data collection".
[20] Early discourses and policy frameworks on open scientific data emerged immediately in the wake of the creation of the first large knowledge infrastructure.
[25]Christine Borgman does not recall any significant policy debates over the meaning, the production and the circulation of scientific data save for a few specific fields (like climatology) after 1966.
Conversely, the Worm Community System could only be browsed on specific terminals shared across scientific institutions: "To take on board the custom-designed, powerful WCS (with its convenient interface) is to suffer inconvenience at the intersection of work habits, computer use, and lab resources (…) The World-Wide Web, on the other hand, can be accessed from a broad variety of terminals and connections, and Internet computer support is readily available at most academic institutions and through relatively inexpensive commercial services.
The development and the generalization of the World Wide Web lifted numerous technical barriers and frictions had constrained the free circulation of data.
[43] As it fully acknowledge the complexity of data management, the principles do not claim to introduce a set of rigid recommendations but rather "degrees of FAIRness", that can be adjusted depending on the organizational costs but also external restrictions in regards to copyright or privacy.
[52][53] In 2021, Colavizza et al. identified three categories or levels of access: Supplementary data files have appeared in the early phase of the transition to scientific digital publishing.
[57] The mutability of computer memory was especially challenging: in contrast with printed publications, digital data could not be expected to remain stable on the long run.
Data management have to find a middle ground between continuous enhancements and some form of generic stability: "the concept of a fluid, changeable, continually improving data archive means that study cleaning and other processing must be carried to such a point that changes will not significantly affect prior analyses"[58] Structured bibliographic metadata for database has been a debated topic since the 1960s.
[57] In 1977, the American Standard for Bibliographic Reference adopted a definition of "data file" with a strong focus on the materiability and the mutability of the dataset: neither dates nor authors were indicated but the medium or "Packaging Method" had to be specified.
Permanent digital object identifiers (or DOI) have been introduced for scientific articles to avoid broken links, as website structures continuously evolved.
As digital tools have become widespread, the infrastructures, the practices and the common representations of research communities have increasingly relied of shared meanings of what is data and what can be done with it.
[70] Standards definition of open data used by a wide range of public nd private actors have been partly elaborated by researchers around concrete scientific issues.
[95] Yet, research data is often not an original creation entirely produced by one or several authors, but rather a "collection of facts, typically collated using automated or semiautomated instruments or scientific equipment.
[95] This principle largely predates the contemporary policy debate over scientific data, as the earliest court cases ruled in favor of compilation rights go back to the 19th century.
In the United States compilation rights have been defined in the Copyright Act of 1976 with an explicit mention of datasets: "a work formed by the collection and assembling of pre-existing materials or of data" (Par 101).
[96] In its 1991 decision, Feist Publications, Inc., v. Rural Telephone Service Co., the Supreme Court has clarified the extents and the limitations on database copyrights, as the "assembling" should be demonstrably original and the "raw facts" contained in the compilation are still unprotected.
[96] Even in the jurisdiction where the application of the copyright to data outputs remains unsettled and partly theoretical, it has nevertheless created significant legal uncertainties.
[97] Criteria for the originality of compilations have been harmonized across the membership states, by the 1996 Database Directive and by several major case laws settled by the European court of justice such as Infopaq International A/S v Danske Dagblades Forening c or Football Dataco Ltd et al. v Yahoo!
In 2014, the European Medicines Agency have introduced important changes to the sharing of clinical trial data, in order to prevent the release of all personal details and all commercially relevant information.
In 2003, the Berlin Declaration called for a universal waiver of reuse rights on scientific contributions that explicitly included "raw data and metadata".
[112] In accordance with the principles of the Berlin Declaration it is not a license but a waiver, as the producer of the data "overtly, fully, permanently, irrevocably and unconditionally waives, abandons, and surrenders all of Affirmer's Copyright and Related Rights".
[65] Data loss has also been singled out as a significant issue in major journals like Nature or Science[123] Surveys of research practices have consistently shown that storage norms, infrastructures, and workflow remain unsatisfying in most disciplines.
[35] A 2017-2018 survey of 1372 researchers contacted through the American Geophysical Union shows that only "a quarter and a fifth of the respondents" report good data storage practices.
Moreover, accessibility did not decay significantly for older publications: "URLs and DOIs make the data and code associated with papers more likely to be available over time".
In accordance with Ostrom, Cameron Neylon understates that open infrastructures are not only characterized by the management of a pool of shared resources but also by the elaboration of joint governance and norms.