arXiv:1902.00465 Abstract | arXiv Analytics

arXiv:1902.00465 [cs.LG]Abstract References Reviews Resources

TF-Replicator: Distributed Machine Learning for Researchers

Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Aedan Pope, Fabio Viola, Dan Belov

Published 2019-02-01Version 1

We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF-Replicator, we implement and benchmark three very different models: (1) A ResNet-50 for ImageNet classification, (2) a SN-GAN for class-conditional ImageNet image generation, and (3) a D4PG reinforcement learning agent for continuous control. Our results show strong scalability performance without demanding any distributed systems expertise of the user. The TF-Replicator programming model will be open-sourced as part of TensorFlow 2.0 (see https://github.com/tensorflow/community/pull/25).

Categories: cs.LG, cs.AI, cs.DC, stat.ML

Keywords: distributed machine learning, researchers, class-conditional imagenet image generation, tf-replicator simplifies writing data-parallel, model-parallel research code

Related articles: Most relevant | Search more

arXiv:1912.09789 [cs.LG] (Published 2019-12-20)

A Survey on Distributed Machine Learning

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer

arXiv:1310.5426 [cs.LG] (Published 2013-10-21, updated 2013-10-25)

MLI: An API for Distributed Machine Learning

Evan R. Sparks et al.

arXiv:1810.03264 [cs.LG] (Published 2018-10-08)

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, Eric P. Xing

arXiv Analytics

arXiv:1902.00465 [cs.LG]Abstract References Reviews Resources

TF-Replicator: Distributed Machine Learning for Researchers

Links

Toolbox

arXiv:1902.00465 [cs.LG]AbstractReferencesReviewsResources

TF-Replicator: Distributed Machine Learning for Researchers

Links

Toolbox

arXiv:1902.00465 [cs.LG]Abstract References Reviews Resources