National Center for Data Mining Sabul FAQ

Q. What is high performance data transport?

Data transport refers to the process of moving data from location on the internet to another. To move data at high rates requires not only high performance links, but also the appropriate protocols for transporting data. The product of the bandwidth and the round trip time for a packet to go to the target and return is called the bandwidth delay product or BDP. Many data protocols are not effective when transporting data over links with high BDPs.

This note discuss two protocols called SABUL and UDT which can be used for high performance data transport, even for links with high BDPs. These protocols can be used by applications directly, as part of high performance web services, or as part of specialized software designed for what are called lambda grids.

Q. What is SABUL?

SABUL is a protocol for moving data very efficiently over long haul, high performance networks. SABUL is also the name of an open source library implementing SABUL.

Q. How fast is SABUL?

When moving data from a single computer using a 1 Gb/s NIC and connected with GigE, SABUL can move data at over 950 Mb/s between nodes located in Chicago and Amsterdam over high performance networks with 110 ms RTT. SABUL can also be used on a cluster of such computers. For example, using two three-node clusters, SABUL moved data between Chicago and Amsterdam at over 2.8 Gb/s during the iGrid 2002 Conference.

Q. Can you give me some details of why SABUL is faster than competing protocols?

It has long been recognized that current implementations of TCP as generally deployed do not provide good performance for applications on networks with a high bandwidth delay product.

One approach to improving TCP performance for data intensive applications is to adjust the TCP window size to be the product of the bandwidth and the RTT delay of the network. This approach requires tuning the network and in practice can be quite difficult.

Another approach to overcoming the limitations of TCP is to stripe TCP over several standard TCP network connections. In contrast to the first approach, this can be done at the data middleware or application level. The performance of striped TCP begins to level off as the number of sockets increases, sometimes after only 25-75 sockets, effectively limiting its speed to approximately 300-350 Mb/s for networks with large bandwidth delay products, such as networks across the Atlantic.

Another approach is to improve the design of TCP. There has recently been significant progress in this area, including HighSpeed TCP, Scalable TCP and the FAST protocol.

Another approach is to design completely new algorithms, such as XCP.

The approach we took was to use one UDP-based channel for data in order to send data at high rates, and another TCP-based channel is used to make the protocol reliable. The TCP channel is also used by the protocol for rate control and congestion control. The rate control mechanism is used to discover bandwidth quickly, for fast loss recovery, and to support intra-protocol fairness. The flow control mechanism is used to reduce packet loss and oscillation and to avoid congestion collapse. The library that implements this is called SABUL.

This is our third version of SABUL over the past three years. Over this time we have improved the implementation of SABUL so that we can usually achieve over 900 Mb/s per node using 1 Gb/s NIC for networks with high BDPs. By using clusters, with each node directly connected to the router with 1 Gb/s link, we can scale this to several Gb/s, even over long haul networks with high BDPs.

Q. What is UDT?

Starting in 2003, we began developing a new version of SABUL called UDP-based Data Transport Protocol or UDT, which uses UDP for both the control and data channel. An open challenge is to design protocols for high performance data transport so that they are friendly to both other flows using the same protocol (intra-protocol fairness) and to other flows employing different protocols, such as TCP (TCP friendliness). In both simulation and experimental studies using UDT, we have found UDT to be fair to both dozens of other UDT flows as well as friendly to hundreds of concurrent TCP flows.

Q. How fast is UDT?

Recently, using our striped version of UDT, we were able to transport 1.4 TB of data between Amsterdam and Chicago in 30 minutes, reaching a peak speed of 6.8 Gb/s and sustaining an average speed of 6.2 Gb/s. In addition, in testing by the University of Amsterdam, single flows of UDT over long haul 10 GE networks with high BDP have been able to transmit data as fast as 5.4 Gb/s.

Q. What are Photonic Data Services?

Photonic data services (PDS) are a layered series of protocols designed to work with data over lambda grids, that is advanced computational infrastructures that enable applications to set up, monitor and tear down their photonic paths (lambdas) on demand. PDS consists of the following three layers:

  • Photonic path services layer: these are services to set up, tear down, and check the status of photonic paths, providing high performance photonic circuits. These services were developed by iCAIR at Northwestern University. With photonic path services, applications can request specialized photonic paths as they are needed.

  • Network protocol layer: This is the layer where SABUL/UDT and other performance network transport protocols live. As mentioned above, SABUL and UDT employ separate data and control channels. Both employ UDP for the data channel. SABUL uses TCP for the control channel, while UDT uses UDP for the control channel. SABUL and UDT were developed by the Laboratory for Advanced Computing and National Center for Data Mining at UIC.

  • Data services layer: in prior work, we developed a specialized protocol to create data webs called the DataSpace Transfer Protocol or DSTP. DSTP is compatible with web services, but also has specialized functionality to work with data (it supports keys, metadata and data), can sample data, select rows and columns of data, etc. DSTP directly supports functionality for remote data analysis and distributed data mining. DSTP was developed by the Laboratory for Advanced Computing and National Center for Data Mining at UIC.

The speed and bandwidth come from b). The functionality for data comes from c). The flexibility and ability to do this on a per application basis comes from a). The goal of photonic data services is to take the lambdas to the data.

Q. What is Open DMIX?

Open DMIX are standards based web services for mining, integrating and exploring remote and distributed data. Open DMIX uses standard SOAP/XML based web services for working with small amounts of remote and distributed data. Open DMIX also supports high performance web services that use UDT and specialized streaming formats so that large remote and distributed data sets can also be mined, integrated and explored with web services.

Q. What is the impact of these protocols and services?

Open DMIX is designed so that working with remote data over the web is as easy as working with remote documents. Using Open DMIX and similar services, remote Gigabyte and Terabyte size data sets can be explored as if they were local.

Q. What are some of the advantages of using SABUL/UDT?

  1. SABUL/UDT are fast, even on networks with high BDPs.

  2. UDT is fair to other UDT flows and friendly to TCP flows.

  3. SABUL/UDT do not require any changes to existing network infrastructure, extensive networking tuning, or changes to operating systems. SABUL/UDT can be deployed directly as an application library, as part of Photonic Data Services on lambda grids, or as part of Open DMIX on data grids and data webs.

  4. SABUL/UDT are open source and the source code is freely available at sourceforge ( sourceforge.net/projects/dataspace)

Q. What is the status of these protocols?

The current release of SABUL is version 2.3. The current release of DSTP is version 3.0.

More recently we began working on UDT and Open DMIX. The current release of UDT is 1.1. The first release of Open DMIX is planned for 1Q04.

All of these are available via the source forge project dataspace at www.sourceforge.net/projects/dataspace.

The path services are currently integrated with SABUL for a single administrative domain in an experimental release of SABUL, which is not publicly available. A public release is planned for sometime in early 2004.

Q. How can I get more technical information?

Technical information is available from the UIC NCDM web site www.ncdm.uic.edu

Technical information about SABUL and UDT are available from the following publications:

  • H. Sivakumar, R. L. Grossman, M. Mazzucco, Y. Pan, Q. Zhang, Simple Available Bandwidth Utilization Library for High-Speed Wide Area Networks, Journal of Supercomputing, 2003, to appear. pdf

  • Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei Hong, Dave Lillethun, Jorge Levera, Joe Mambretti, Marco Mazzucco, and Jeremy Weinberger, Experimental Studies Using Photonic Data Services at IGrid 2002, FGCS, 2003. pdf

  • Yunhong Gu and Robert Grossman, Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product Networks, submitted for publication.   pdf

  • Yunhong Gu and Robert Grossman, End-to-End Congestion Control for High Performance Data Transfer, submitted for publication.   pdf

Q. Who funded the research?

Early versions of SABUL were funded by NSF and DOE. During the past two years, the work has been supported by the NSF.

telephone (312) 996-0305
e-mail staff@teraflowtestbed.net
address 700 SEO MC 249, 851 S. Morgan St. Chicago, IL. 60607