美文网首页
Project Adam: Building an Effici

Project Adam: Building an Effici

作者: 世间五彩我执纯白 | 来源:发表于2016-04-06 15:51 被阅读0次

1. Introduction

main contributions:

  • Optimizing and balancing both computation and communication through whole system co-design
  • Achieving high performance and scalability by exploiting the ability of machine learning training to tolerate inconsistencies well
  • Demonstrating that system efficiency, scaling, and asynchrony all contribute to improvements in trained model accuracy

2. System architecture

  • 通过parameter server,异步更新shared model
  • Adam是一个general-purpose系统,因为SGD能够训练任何基于BP的DNN模型

2.1. Fast data server

  • 一些机器当作data serving machine,提供数据服务,减少了model training machine的负载

2.2. Model training

  • We partition our models vertically across the model worker machines,垂直划分模型能够将卷积层cross-machine的通信最小化
  • 一个机器上是多线程,共享一份模型,用无锁的方式更新本地的shared model
  • 其他优化方法:pass a pointer rather than copy data, cache locality
  • 减轻straggler的影响:为了避免快的machine等待慢的machine的数据,允许线程并行处理多个images;只要一定数量的image被处理完,就认为一个epoch已经结束
  • PS通信:两种通信策略。accumulate updates,定期发送给PS,PS直接将更新加到全局参数上,这对于卷积层有效,因为weight sharing;对于全连接层,参数更大,发送activation和gradient vector到PS,矩阵乘法在PS上计算,这能够较少通信开销。

2.3. 全局Parameter Server

  • Hash存储。shards are hashed into storage buckets that are distributed equally among the parameter server machines.
  • batch updates。applying all updates in a batch to a block of parameters before moving to next block in the shard.
  • lock free。We use lock free data structures for queues and hash tables. In addition, we implement lock free memory allocation
  • inconsistency. DNN models are capable of learning even in the presence of small amounts of lost
    updates. 能够容忍少量的delayed updates
  • Fault tolerance. 每个parameter shard有3份copies;primary给secondary machine发送updates时使用2-phase commit protocol

相关文章

网友评论

      本文标题:Project Adam: Building an Effici

      本文链接:https://www.haomeiwen.com/subject/ynfmlttx.html