-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Related to Model/Framework(s)
ELECTRA / TF2
Describe the bug
Firstly, There is no problem when using GPU A6000 (Not NV-Link) 4ea
But in A100 (+ NV-Link) 2ea, One GPU-Util (in my case #1 GPU, Not #0) periodically fall 0% when multi-GPU Training on nvcr.io/nvidia/tensorflow:YY.MM-tf2-py3 docker image
There is no problem when single-GPU training. I think All of GPU Hardware is normal
Also, I switched GPUs each other But the problem remained (GPU is different, But #1 GPU still not work properly)
I also suspect CUDA version mis-match. (Driver=11.4, Docker=11.7 (nvcr.io/nvidia/tensorflow:22.04-tf2-py3))
So, I tested Docker 11.4(21.07), 11.3(21.06) But the problem is remaind too
To Reproduce
Same Quick Start Guide
Expected behavior
All of GPU-Util are steadily occupied 100% like when using GPU A6000
Environment
Please provide at least:
- Container version (e.g. pytorch:19.05-py3): I tested
nvcr.io/nvidia/tensorflow:22.04-tf2-py3,nvcr.io/nvidia/tensorflow:21.07-tf2-py3,nvcr.io/nvidia/tensorflow:21.06-tf2-py3, - GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB):
Nvidia A100 80GB - CUDA driver version (e.g. 418.67):
470.103.01
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:31:00.0 Off | 0 |
| N/A 70C P0 120W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:4B:00.0 Off | 0 |
| N/A 55C P0 87W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+