distributed training with SyncReplicasOptimizer got stuck after a number of iterations · Issue #20342 · tensorflow/tensorflow · GitHub
Skip to content

distributed training with SyncReplicasOptimizer got stuck after a number of iterations #20342

Description

@codescv

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.8.0
  • Python version: 3.6.5
  • Bazel version (if compiling from source): NA
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: NA
  • GPU model and memory: NA
  • Exact command to reproduce: NA

Describe the problem

I am running distributed training using SyncReplicasOptimizer, after about 10k iterations, the workers got stuck. CPU usage drops to 0 percent.

The arguments for SyncReplicasOptimizer:
replicas_to_aggregate = 60, total_num_replicas = 64 (I have 64 workers)

It might also be worth noting that this happens after 27 workers finish their training data.

Connecting to one of the stuck worker processes using gdb I get the following backtraces:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f5813609de4 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_, timespec) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#2 0x00007f58136095b1 in nsync::nsync_sem_wait_with_cancel
(nsync::waiter
, timespec, nsync::nsync_note_s_) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#3 0x00007f5813606af4 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s
, void*, void ()(void), void ()(void), timespec, nsync::nsync_note_s_) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.so
#4 0x00007f5813607015 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s
, nsync::nsync_mu_s_, timespec, nsync::nsync_note_s_) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f58111a4b23 in tensorflow::(anonymous namespace)::WaitForNotification(tensorflow::CallOptions*, long long, tensorflow::Notification*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f58111a54ab in tensorflow::LocalMaster::RunStep(tensorflow::CallOptions*, tensorflow::RunStepRequestWrapper*, tensorflow::MutableRunStepResponseWrapper*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007f5811187256 in tensorflow::GrpcSession::RunProto(tensorflow::CallOptions*, tensorflow::MutableRunStepRequestWrapper*, tensorflow::MutableRunStepResponseWrapper*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f581118767d in tensorflow::GrpcSession::RunHelper(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata, std::string const&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f5811187ceb in tensorflow::GrpcSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f5811468dba in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, TF_Tensor**, std::vector<std::string, std::allocatorstd::string > const&, TF_Buffer*, TF_Status*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f58114699b6 in TF_SessionRun () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f5811119256 in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f581111939a in tensorflow::TF_SessionRun_wrapper(TF_Session
, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f58110d5b3e in _wrap_TF_SessionRun_wrapper () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

Any ideas ? Thanks!

Metadata

Metadata

Assignees

Labels

comp:opsOPs related issuesstaleThis label marks the issue/pr stale - to be closed automatically if no activitystat:awaiting responseStatus - Awaiting response from authortype:bugBug

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions