ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters

Details

2013 June 26-28; San Jose, CA USA. Date of Talk: 6/27/2013

Speakers

Event

ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters

Hadoop clusters are the technology of choice for big data analytics. The performance of these clusters is critical. Presently available schedulers for Hadoop clusters assign tasks to nodes without regard to the capability of the nodes. We propose a new scheduler, we call the Throughput Scheduler, which reduces overall job completion time on a heterogeneous cluster by actively assigning tasks to server nodes based on server capabilities and job task requirements. Server capabilities are learned by running probes jobs on the cluster. A Bayesian rule based efficient active learning scheme is derived to learn the resource requirements of Hadoop tasks online. An empirical evaluation on a simple problem demonstrates that the Throughput Scheduler can reduce total job completion time by almost 20% over the Hadoop Fair scheduler and 40% over the Hadoop FIFO scheduler. Throughput scheduler also reduces average mapping time by 33% compared to both existing schedulers.

Additional information

Focus Areas

Our work is centered around a series of Focus Areas that we believe are the future of science and technology.

FIND OUT MORE
Licensing & Commercialization Opportunities

We’re continually developing new technologies, many of which are available for Commercialization.

FIND OUT MORE
News

PARC scientists and staffers are active members and contributors to the science and technology communities.

FIND OUT MORE