Data-intensive computing

[1] The rapid growth of the Internet and World Wide Web led to vast amounts of information available online.

In addition, business and government organizations create large amounts of both structured and unstructured information, which need to be processed, analyzed, and linked.

Parallel processing of data-intensive applications typically involves partitioning or subdividing the data into multiple segments which can be processed independently using the same executable application program in parallel on an appropriate computing platform, then reassembling the results to produce the completed output data.

[15] Areas of focus were: Pacific Northwest National Labs defined data-intensive computing as “capturing, managing, analyzing, and understanding data at volumes and rates that push the frontiers of current technologies”.

Several solutions have emerged including the MapReduce architecture pioneered by Google and now available in an open-source implementation called Hadoop used by Yahoo, Facebook, and others.

Hadoop now encompasses multiple subprojects in addition to the base core, MapReduce, and HDFS distributed filesystem.

These additional subprojects provide enhanced application processing capabilities to the base Hadoop implementation and currently include Avro, Pig, HBase, ZooKeeper, Hive, and Chukwa.

Hadoop implements a distributed data processing scheduling and execution environment and framework for MapReduce jobs.

Hadoop includes a distributed file system called HDFS which is analogous to GFS in the Google MapReduce implementation.

These include HBase, a distributed column-oriented database which provides random access read/write capabilities; Hive, which is a data warehouse system built on top of Hadoop that provides SQL-like query capabilities for data summarization, ad hoc queries, and analysis of large datasets; and Pig – a high-level data-flow programming language and execution framework for data-intensive computing.

to provide a specific language notation for data analysis applications and to improve programmer productivity and reduce development cycles when using the Hadoop MapReduce environment.

Pig provides capabilities in the language for loading, storing, filtering, grouping, de-duplication, ordering, sorting, aggregation, and joining operations on the data.

To address both batch and online aspects data-intensive computing applications, HPCC includes two distinct cluster environments, each of which can be optimized independently for its parallel data processing purpose.

The Roxie platform provides an online high-performance structured query and analysis system or data warehouse delivering the parallel data access processing requirements of online applications through Web services interfaces supporting thousands of simultaneous queries and users with sub-second response times.

Both Thor and Roxie systems utilize the same ECL programming language for implementing applications, increasing programmer productivity.