ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters
Details
2013 June 26-28; San Jose, CA USA. Date of Talk: 6/27/2013
Speakers
Event
ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters
Hadoop clusters are the technology of choice for big data analytics. The performance of these clusters is critical. Presently available schedulers for Hadoop clusters assign tasks to nodes without regard to the capability of the nodes. We propose a new scheduler, we call the Throughput Scheduler, which reduces overall job completion time on a heterogeneous cluster by actively assigning tasks to server nodes based on server capabilities and job task requirements. Server capabilities are learned by running probes jobs on the cluster. A Bayesian rule based efficient active learning scheme is derived to learn the resource requirements of Hadoop tasks online. An empirical evaluation on a simple problem demonstrates that the Throughput Scheduler can reduce total job completion time by almost 20% over the Hadoop Fair scheduler and 40% over the Hadoop FIFO scheduler. Throughput scheduler also reduces average mapping time by 33% compared to both existing schedulers.
Additional information
Focus Areas
Our work is centered around a series of Focus Areas that we believe are the future of science and technology.
Licensing & Commercialization Opportunities
We’re continually developing new technologies, many of which are available for Commercialization.
News
PARC scientists and staffers are active members and contributors to the science and technology communities.