homeresources & publications › throughputscheduler: learning to schedule on heterogeneous hadoop clusters

TECHNICAL PUBLICATIONS:

ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters

 
Hadoop clusters are the technology of choice for big data analytics. The performance of these clusters is critical. Presently available schedulers for Hadoop clusters assign tasks to nodes without regard to the capability of the nodes. We propose a new scheduler, we call the Throughput Scheduler, which reduces overall job completion time on a heterogeneous cluster by actively assigning tasks to server nodes based on server capabilities and job task requirements. Server capabilities are learned by running probes jobs on the cluster. A Bayesian rule based efficient active learning scheme is derived to learn the resource requirements of Hadoop tasks online. An empirical evaluation on a simple problem demonstrates that the Throughput Scheduler can reduce total job completion time by almost 20% over the Hadoop Fair scheduler and 40% over the Hadoop FIFO scheduler. Throughput scheduler also reduces average mapping time by 33% compared to both existing schedulers.
 
citation

Gupta, S.; Fritz, C.; Price, R.; Hoover, R.; de Kleer, J.; Witteveen, C. ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters. International Conference on Autonomic Computing (ICAC '13); 2013 June 26-28; San Jose, CA USA.