Apache Hadoop

Apache Hadoop ( /həˈduːp/) is a collection of open-source software utilities for reliable, scalable, distributed computing.

It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

[4][5] All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop was the Google File System paper that was published in October 2003.

[23] The first design document for the Hadoop Distributed File System was written by Dhruba Borthakur in 2007.

For effective scheduling of work, every Hadoop-compatible file system should provide location awareness, which is the name of the rack, specifically the network switch where a worker node is.

Hadoop applications can use this information to execute code on the node where the data is, and, failing that, on the same rack/switch to reduce backbone traffic.

The standard startup and shutdown scripts require that Secure Shell (SSH) be set up between nodes in the cluster.

Some consider it to instead be a data store due to its lack of POSIX compliance,[36] but it does provide shell commands and Java application programming interface (API) methods that are similar to other file systems.

This is also known as the slave node and it stores the actual data into HDFS which is responsible for the client to read and write.

HDFS stores large files (typically in the range of gigabytes to terabytes[40]) across multiple machines.

HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals of a Hadoop application.

The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.

[41] In May 2012, high-availability capabilities were added to HDFS,[42] letting the main metadata server called the NameNode manually fail-over onto a backup.

These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure.

HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes.

[43] HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write operations.

File access can be achieved through the native Java API, the Thrift API (generates a client in a number of languages e.g. C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, the HDFS-UI web application over HTTP, or via 3rd-party network client libraries.

[44] HDFS is designed for portability across various hardware platforms and for compatibility with a variety of underlying operating systems.

[45] Due to its widespread integration into enterprise-level infrastructure, monitoring HDFS performance at scale has become an increasingly important issue.

Monitoring end-to-end performance requires tracking metrics from datanodes, namenodes, and the underlying operating system.

To reduce network traffic, Hadoop needs to know which servers are closest to the data, information that Hadoop-specific file system bridges can provide.

However, some commercial distributions of Hadoop ship with an alternative file system as the default – specifically IBM and MapR.

The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

In Hadoop 3, there are containers working in principle of Docker, which reduces time spent on application development.

It can also be used to complement a real-time system, such as lambda architecture, Apache Storm, Flink, and Spark Streaming.

Search Webmap is a Hadoop application that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo!

[67] The cloud allows organizations to deploy Hadoop without the need to acquire hardware or specific setup expertise.

[70] The naming of products and derivative works from other vendors and the term "compatible" are somewhat controversial within the Hadoop developer community.

Hadoop cluster
A multi-node Hadoop cluster