The Teraflow Testbed
The Teraflow Testbed(TFT)
is an infrastructure designed to use new 10 GB/s network
protocols and data services for long haul, high performance networks. It
consists of distributed nodes over three continents that can transmit,
process, and mine very high volume data flows, or what we call teraflows.
The TFT targets 10Gb/s wide area networks. The nodes are integrated using
advanced 10 Gb/s photonic networks and rely on both Layer 2 optical
switching and Layer 3 routers. Currently nodes are located in CERN,
Switzerland; Amsterdam, The Netherlands; Tokyo, Japan; Chicago-National Center
for Data Mining (NCDM), USA and Chicago-StarLight, USA. It is anticipated that nodes will be added in
additional locations in the near future.
Using the TFT, NCDM is developing new network protocols and innovative
web-based data integration and data mining services that scale to teraflows.
We are also designing a new class of applications that move not only the
queries and computations, but also the data when required. In addition, the
TFT will be used to utilize these protocols and services over both traditional
routed networks as well as lambda grids.
The TFT is designed to support high end-to-end transfers of data from a disk
at one of the nodes to a disk at any of the other nodes. The TFT is also
designed to perform data integration of two high volume data streams. Finally,
the TFT is designed to allow parallelism of network flows within a single
node, as well as striping across multiple nodes.
The Terra Wide Data Mining Network (TWDM)
The Terra Wide Data Mining Network (TWDM) is
an infrastructure built on top of DataSpace for the remote analysis,
distributed mining, and real time exploration of scientific, engineering,
business, and other complex data. Terra mining applications are designed to
exploit the capabilities provided by emerging domestic and international high
performance optical networks so that Gigabyte and Terabyte data sets can be
remotely explored in real time.
With traditional approaches to predictive modeling, data is
collected, statistical models are built, predictions are made, and the
models are validated as additional data becomes available. For many
problems, ranging from controlling the outbreak of an infectious disease
to preventing terrorism, data arrives discretely and it is useful to
update predictions immediately as soon as new data is available.
DataSpace supports an approach to predictive modeling which works
with "events," which are abstractions representing new bits of
information which are assumed to arrive in a real time stream. Events
are aggregated to build "profiles," which are the inputs to predictive
models. DataSpace supports an open standard called the Data
Transformation Markup Language (DTML) for updating profiles with new
events in real time or near real time.
Initially, the TWDM Network will consist of five geographically
distributed Terra Nodes linked by optical networks through StarLight in
Chicago. These sites include StarLight itself, the Laboratory for
Advanced Computing at UIC, SARA in Amsterdam, and Dalhousie University
in Halifax. Additional sites will be added in 2002, including Imperial
College in London.
The Terabyte Challenge Network
DataSpace was developed using the Terabyte Challenge Network, which
is an open, distributed network for DataSpace tools, services, and
protocols. It consists of ten sites distributed over three continents
connected by high performance links. It has been instrumented for
network measurements and provides a platform for experimental discovery
of scientific, engineering, business, and e-business data. The network
includes a variety of distributed data mining applications, including
the analysis of climate data, astronomical data, network data, web
data, and business data.
The Terabyte Challenge network consists of local clusters of
workstations which are connected to form wide area clusters of clusters
or Meta-Clusters. The Meta-Cluster is maintained by the National
Scalable Cluster Project (NSCP). The NSCP-1 Meta-Cluster was completed
in 1996 and linked geographically distributed clusters using the
commodity internet. The NSCP-2 Meta-Cluster was completed in 1998 and
used OC-3 networks to link the clusters. The NSCP-1 and NSCP-2
Meta-Clusters have been used by a variety of scientists and engineers
working on applications in high energy physics, computational chemistry,
nonlinear simulation, bioinformatics, medical imaging, network traffic
analysis, digital libraries of video data, economic data, business data
and e-business data.
Currently, the Terabyte Challenge Network consists of approximately
100 nodes and 2 terabytes of disk geographically distributed among the
participating sites, and connected by laboratory, campus, and national
high performance networks. The Terabyte Challenge involves scientists
and engineers from a number of organizations, including the University
of Illinois at Chicago, University of Pennsylvania, University of
California at Davis, Imperial College, Australian National University,
and Magnify.
The Global Discovery Network (GDN)
The Global Discovery Network is a collaboration between the
Laboratory for Advanced Computing/National Center for Data Mining and
the Discovery Net. The Discovery Net is a newly announced project at
Imperial College, London, lead by Dr. Yike Guo, to develop an E-Science
platform for scientific discovery from high throughput informatics. The
new Global Discovery Network will link the Discovery Net to the Terra
Wide Data Mining Network to create a combined global network with a
critical mass of data.
The Global Discovery Network is the first global high performance
network for remote data analysis and distributed data mining and holds
the promise of providing scientists and engineers easier ways to work
with distributed data.
The GDN is seeking proposals from industry to be the first supported
applications on the network. Please contact us for more information.
|