Distributed file system for cloud

Once computer networks started to proliferate, it became obvious that the existing file systems had many limitations and were unsuitable for multi-user environments.

This makes cloud computing particularly suited to support different types of applications that require large-scale distributed processing.

This data-intensive computing needs a high performance file system that can share data between virtual machines (VM).

Relying on a single server results in the NFS protocol suffering from potentially low availability and poor scalability.

In a cloud computing environment, failure is the norm,[13][14] and chunkservers may be upgraded, replaced, and added to the system.

The master rebalances replicas periodically: data must be moved from one DataNode/chunkserver to another if free space on the first server falls below a certain threshold.

[15] However, this centralized approach can become a bottleneck for those master servers, if they become unable to manage a large number of file accesses, as it increases their already heavy loads.

[16] In order to get a large number of chunkservers to work in collaboration, and to solve the problem of load balancing in distributed file systems, several approaches have been proposed, such as reallocating file chunks so that the chunks can be distributed as uniformly as possible while reducing the movement cost as much as possible.

It provides fault-tolerant, high-performance data storage a large number of clients accessing it simultaneously.

GFS uses MapReduce, which allows users to create programs and run them on multiple machines without thinking about parallelization and load-balancing issues.

When the log becomes too large, a checkpoint is made and the main-memory data is stored in a B-tree structure to facilitate mapping back into the main memory.

[25] HDFS , developed by the Apache Software Foundation, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes).

Among the most notable differences are that MapR-FS is a fully read/write filesystem with metadata for files and directories distributed across the namespace, so there is no NameNode.

[37] It answers the challenges of dealing with huge files and directories, coordinating the activity of thousands of disks, providing parallel access to metadata on a massive scale, manipulating both scientific and general-purpose workloads, authenticating and encrypting on a large scale, and increasing or decreasing dynamically due to frequent device decommissioning, device failures, and cluster expansions.

[38] BeeGFS is the high-performance parallel file system from the Fraunhofer Competence Centre for High Performance Computing.

The distributed metadata architecture of BeeGFS has been designed to provide the scalability and flexibility needed to run HPC and similar applications with high I/O demands.

Operations such as open, close, read, write, send, and receive need to be fast, to ensure that performance.

For example, each read or write request accesses disk storage, which introduces seek, rotational, and network latencies.

The environment becomes insecure if the service provider can do all of the following:[49] The geographic location of data helps determine privacy and confidentiality.

For example, clients in Europe won't be interested in using datacenters located in United States, because that affects the guarantee of the confidentiality of data.

A variety of solutions exists, such as encrypting only sensitive data,[52] and supporting only some operations, in order to simplify computation.

Such integrity means that data has to be stored correctly on cloud servers and, in case of failures or incorrect computing, that problems have to be detected.

[54] Integrity is easy to achieve using cryptography (typically through message-authentication code, or MACs, on data blocks).

For instance, Skute [61] is a mechanism based on key/value storage that allows dynamic data allocation in an efficient way.

(relying on technical factors such as hardware components and non-technical ones like the economic and political situation of a country) and the diversity is the geographical distance between

[66] Replication is a great solution to ensure data availability, but it costs too much in terms of memory space.

[67] DiskReduce[67] is a modified version of HDFS that's based on RAID technology (RAID-5 and RAID-6) and allows asynchronous encoding of replicated data.

[69] Some other approaches that apply different mechanisms to guarantee availability are: Reed-Solomon code of Microsoft Azure and RaidNode for HDFS.

The US government has decided to spend 40% of its compound annual growth rate (CAGR), expected to be 7 billion dollars by 2015.

The cost of a server is determined by the quality of the hardware, the storage capacities, and its query-processing and communication overhead.