homeresources & publications › throughputscheduler: learning to schedule on heterogeneous hadoop clusters


ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters

Hadoop clusters are the technology of choice for big data analytics. The performance of these clusters is critical. Presently available schedulers for Hadoop clusters assign tasks to nodes without regard to the capability of the nodes. We propose a new scheduler, we call the Throughput Scheduler, which reduces overall job completion time on a heterogeneous cluster by actively assigning tasks to server nodes based on server capabilities and job task requirements. Server capabilities are learned by running probes jobs on the cluster. A Bayesian rule based efficient active learning scheme is derived to learn the resource requirements of Hadoop tasks online. An empirical evaluation on a simple problem demonstrates that the Throughput Scheduler can reduce total job completion time by almost 20% over the Hadoop Fair scheduler and 40% over the Hadoop FIFO scheduler. Throughput scheduler also reduces average mapping time by 33% compared to both existing schedulers.

Gupta, S.; Fritz, C.; Price, R.; Hoover, R.; de Kleer, J.; Witteveen, C. ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters. International Conference on Autonomic Computing (ICAC '13); 2013 June 26-28; San Jose, CA USA.