System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes, I made custom distributed inception code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 14.04
TensorFlow installed from (source or binary): Unmodified source with RDMA Verbs enabled
TensorFlow version (use command below): 1.3.0-rc1
Python version: 2.7.12
Bazel version (if compiling from source): 0.5.1
CUDA/cuDNN version: 8.0/5.1.5
GPU model and memory: NVIDIA TITAN Xp PCIe 12GB (4 per node)
The code works for normal grpc, but stuck between some nodes, not between all nodes.
I've tested all nodes with ib_write_bw and ibv_rc_pingpong, communication between all of the nodes works fine.
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes, I made custom distributed inception code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 14.04
TensorFlow installed from (source or binary): Unmodified source with RDMA Verbs enabled
TensorFlow version (use command below): 1.3.0-rc1
Python version: 2.7.12
Bazel version (if compiling from source): 0.5.1
CUDA/cuDNN version: 8.0/5.1.5
GPU model and memory: NVIDIA TITAN Xp PCIe 12GB (4 per node)
The code works for normal grpc, but stuck between some nodes, not between all nodes.
I've tested all nodes with ib_write_bw and ibv_rc_pingpong, communication between all of the nodes works fine.