Sector: simplifying distributed computing

Sector is a system infrastructure software that provides functionality for distributed data storage, access, and analysis/processing. It automatically manages large volumetric data across servers or clusters, even those over wide area high speed networks. Sector provides simple tools and APIs to access and/or process the data. Data and server locations are transparent to users, as the whole Sector network is a single networked super computer to the users. Sector can be categorized as a cloud computing system and its functionality is comparable to Hadoop.

Sector can be viewed as a distributed file system as users can upload/download/read/write files similar as a local file system. Sector provides persistent data storage by automatically replicating files over multiple servers. In addition, Sector also places files according to system topology (location-aware) in order to increase access performance and data safety. By this way, Sector can be used as a content distribution network, as it is used for distributing 13TB Sloan Digital Sky Survey data.

Users can use the Sector client API to write distributed applications. Because Sector provides uniform data access across the system, there is no need to locate and move data. However, what makes Sector better is that it can automatically locate and schedule processors to run user-defined data processing functions in a data-parallel fashion, therefore there is no need to write any code for explicit communications, scheduling, and fault tolerance. Sector significantly simplifies the development of certain data intensive distributed applications.

Here is a brief summary about Sector and its comparison to Hadoop.

  Hadoop Sector
Storage Unit Blocks. Better granularity, better disk usage; may reduce performance due to block lookup and movement; may waste disk space for small files. Files. Good performance for lookup and wide area data transfer. Robust (no permanent metadata required). Requires users' knowledge to split files; may waste disk space when disks are near full.
Data replication Real time. Emphasizes data reliability, but slow.
Periodically. Favors fast IO with less reliability (but still provides long term replicas).
Programming Model MapReduce Stream processing paradigm. To support MapReduce in the near future.
Programming Language System written by Java. Native programming language is Java, but support any executables with Hadoop Streaming. System written by C++. Native programming language is C++, but any program can be called by Sphere for data processing.
Data Transfer and Message Passing TCP. Inefficient over wide area; sometimes requires parameters tunning. UDP/UDT. High performance, firewall friendly, more secure, and tunning-free.

Sector for SDSS Sector has already been used for distributing Sloan Digital Sky Survey data, total 13TB. The SDSS-Sector server network is running over the 10GE wide area Teraflow Network, and global astronomers use Sector to access the data sets. For more information, please visit sdss.ncdm.uic.edu.


UDTUDT is another project that has been developed by us. Sector uses UDT for high speed data transfer between Sector servers and between a Sector server and clients. UDT is an application level data transfer protocol on top of UDP. UDT is reliable, connection oriented, and can be used in shared network environments, For more information about UDT, please visit udt.sf.net.

SECTOR | Contact Us | ©2008 National Center for Data Mining