RDMA+verbs stuck in some nodes · Issue #12035 · tensorflow/tensorflow · GitHub
Skip to content

RDMA+verbs stuck in some nodes #12035

Description

@sj6077

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes, I made custom distributed inception code
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 14.04
TensorFlow installed from (source or binary): Unmodified source with RDMA Verbs enabled
TensorFlow version (use command below): 1.3.0-rc1
Python version: 2.7.12
Bazel version (if compiling from source): 0.5.1
CUDA/cuDNN version: 8.0/5.1.5
GPU model and memory: NVIDIA TITAN Xp PCIe 12GB (4 per node)

The code works for normal grpc, but stuck between some nodes, not between all nodes.
I've tested all nodes with ib_write_bw and ibv_rc_pingpong, communication between all of the nodes works fine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions