|
Grid Data Farm for Petascale
High performance and data-intensive computing and
networking technology has become a vital part of large-scale scientific
research projects in areas such as high energy physics, astronomy space
exploration and human genome projects. One such example is the Large
Hadron Collider project at CERN, where four major experiment groups will
generate an order of Petabyte of raw data from four big underground
particle detectors each year, data acquisition starting in from 2006.Grid
Technology will play an essential role in constructing world wide data
analysis environments where thousands of physicists will collaborate and
compete for the particle physics data analysis at the energy frontier. A
multi-tier "Regional Centers "world wide computing model has
been studied by the MONARC project. It consists of Tier-0 center at CERN,
multiple Tier -1 centers in participating countries tens of Tier-2 centers
in participating countries and many Tier-3 centers in universities and
institutes.
Grid Data Farm is a Petascale data-intensive computing
initiated in Japan. The underlying hardware will be a thousands node scale
PC cluster, each node facilitating a near Terabyte of storage, and
incoming data of approximately continuous 600 Mbps b/w from CERN will be
systematically stored and will be subjected to intensive processing .The
grid data farm will facilitate the following features for collider data
processing as well as serving as a frame work for other types of data
-intensive scientific applications.
Major components of the Grid Data Farm are the Gfarm client, the Gfarm
server and the Gfarm (distributed filesystem with Gfarm parallel I/O .The
Gfarm filesystem consists of a thousands node scale PC cluster each node
with a local disk and possibly distributed over the Grid, and Petascale
data are distributed across the disks in the Gfarm filesystem managed by
the Meta Data Management System and the Gfarm Filesystem Daemon. The Meta
Data Management System provides a mapping from
" Global distributed filesystem for Petabyte scale
data,
" Parallel I/O and parallel processing for fast data analysis,
" World-wide group-oriented authentication and access control,
" Thousands-node,wide-area resource management and scheduling,
" Multi-Tier data sharing and efficient access,
" Program sharing and management,
" System monitoring and administration,
" Fault tolerance /dynamic reconfiguration /Automated data
regeneration or re-computation
Logical files names to the distributed physical
components and also stores meta data such as a replica catalog and a
history that is necessary to reproduce the data .The Gfarm file system
daemon provides a facility of remote file operations with access control
as well as remote program loading and resource monitoring. Large scale
distributed data are accessed by the Gfarm parallel I/O and processed in
parallel. Grid Farm middleware is based on Grid based RPC (GridRPC), an
extended variant of the Ninf system and other lower level Grid service
middleware such as Globus.It makes easy for us to register analysis
software for large-scale data processing Load balancing, Job Scheduling,
Fault Tolerance, and Data Maintenance are transparently or semi
transparently handled by the system using simple GUIs or a simple shell
front end. More sophisticated client interaction is possible using Grid
RPC
|