Data grid

[6] Information about the locations of files and mappings between the LFNs and PFNs may be stored in a metadata or replica catalogue.

[9] There are multiple possible methods that might be used to include starting the entire transmission over from the beginning of the data to resuming from where the transfer was interrupted.

As an example, GridFTP provides for fault tolerance by sending data from the last acknowledged byte without starting the entire transfer from the beginning.

The data transport service also provides for the low-level access and connections between hosts for file transfer.

To meet the needs for scalability, fast access and user collaboration, most data grids support replication of datasets to points within the distributed storage architecture.

[16] All replicas are then cataloged or added to a directory based on the data grid as to their location for query by users.

In a flat topology it is entirely a matter of the peer relationships between nodes as to how updates take place.

There are a number of ways the replication management system can handle the creation and placement of replicas to best serve the user community.

[17] There have been numerous strategies proposed and tested on how to best manage replica placement of datasets within the data grid to meet user requirements.

It is a matter of the type of data grid and user community requirements for access that will determine the best strategy to use.

When the number of hits for a specific dataset exceeds the replication threshold it triggers the creation of a replica on the server that directly services the user’s client.

This improves system performance in terms of response time, number of replicas and helps load balance across the data grid.

This method can also use dynamic algorithms that determine whether the cost of creating the replica is truly worth the expected gains given the location.

[citation needed] A candidate server may have sufficient storage space but be servicing many clients for access to stored files.

[21] Such characteristics of the data grid systems as large scale and heterogeneity require specific methods of tasks scheduling and resource allocation.

Another specificity of data grids, dynamics, consists in the continuous process of connecting and disconnecting of nodes and local load imbalance during an execution of tasks.

As a result, much of the data grids utilize execution-time adaptation techniques that permit the systems to reflect to the dynamic changes: balance the load, replace disconnecting nodes, use the profit of newly connected nodes, recover a task execution after faults.

In such a case the RMSs in the federation will employ an architecture that allows for interoperability based on an agreed upon set of protocols for actions related to storage resources.

[23] Data grids have been designed with multiple topologies in mind to meet the needs of the scientific community.

One project that uses this data grid topology is the Network for Earthquake Engineering Simulation (NEES) in the United States.

Hierarchical topology lends itself to collaboration where there is a single source for the data and it needs to be distributed to multiple locations around the world.

One such project that will benefit from this topology would be CERN that runs the Large Hadron Collider that generates enormous amounts of data.

[10] More recent research requirements for data grids have been driven by the Large Hadron Collider (LHC) at CERN, the Laser Interferometer Gravitational Wave Observatory (LIGO), and the Sloan Digital Sky Survey (SDSS).

The data grid is an evolving technology that continues to change and grow to meet the needs of an expanding community.

One of the earliest programs begun to make data grids a reality was funded by the Defense Advanced Research Projects Agency (DARPA) in 1997 at the University of Chicago.

[32] This research spawned by DARPA has continued down the path to creating open source tools that make data grids possible.