Skip to content

[ELECTRA / TF2] One GPU-Util falls 0% periodically when using multi-GPU Training with horovod #1276

@goreng2

Description

@goreng2

Related to Model/Framework(s)
ELECTRA / TF2

Describe the bug
Firstly, There is no problem when using GPU A6000 (Not NV-Link) 4ea
But in A100 (+ NV-Link) 2ea, One GPU-Util (in my case #1 GPU, Not #0) periodically fall 0% when multi-GPU Training on nvcr.io/nvidia/tensorflow:YY.MM-tf2-py3 docker image

There is no problem when single-GPU training. I think All of GPU Hardware is normal
Also, I switched GPUs each other But the problem remained (GPU is different, But #1 GPU still not work properly)

I also suspect CUDA version mis-match. (Driver=11.4, Docker=11.7 (nvcr.io/nvidia/tensorflow:22.04-tf2-py3))
So, I tested Docker 11.4(21.07), 11.3(21.06) But the problem is remaind too

To Reproduce
Same Quick Start Guide

Expected behavior
All of GPU-Util are steadily occupied 100% like when using GPU A6000

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3): I tested nvcr.io/nvidia/tensorflow:22.04-tf2-py3, nvcr.io/nvidia/tensorflow:21.07-tf2-py3, nvcr.io/nvidia/tensorflow:21.06-tf2-py3,
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): Nvidia A100 80GB
  • CUDA driver version (e.g. 418.67): 470.103.01
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   70C    P0   120W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   55C    P0    87W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions