[ELECTRA / TF2] One GPU-Util falls 0% periodically when using multi-GPU Training with horovod

Related to **Model/Framework(s)** 
**ELECTRA / TF2**

**Describe the bug**
Firstly, There is no problem when using GPU **A6000 (Not NV-Link) 4ea**
But in **A100 (+ NV-Link) 2ea**, One GPU-Util (in my case #1 GPU, Not #0) periodically fall 0% when multi-GPU Training on `nvcr.io/nvidia/tensorflow:YY.MM-tf2-py3` docker image

There is no problem when single-GPU training. I think All of GPU Hardware is normal
Also, I switched GPUs each other But the problem remained (GPU is different, But #1 GPU still not work properly)

I also suspect CUDA version mis-match. (Driver=11.4, Docker=11.7 (`nvcr.io/nvidia/tensorflow:22.04-tf2-py3`))
So, I tested Docker 11.4(`21.07`), 11.3(`21.06`) But the problem is remaind too

**To Reproduce**
Same [Quick Start Guide](https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/LanguageModeling/ELECTRA/README.md#quick-start-guide)

**Expected behavior**
All of GPU-Util are steadily occupied 100% like when using GPU A6000

**Environment**
Please provide at least:
* Container version (e.g. pytorch:19.05-py3): I tested `nvcr.io/nvidia/tensorflow:22.04-tf2-py3`, `nvcr.io/nvidia/tensorflow:21.07-tf2-py3`, `nvcr.io/nvidia/tensorflow:21.06-tf2-py3`, 
* GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): `Nvidia A100 80GB`
* CUDA driver version (e.g. 418.67): `470.103.01`
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   70C    P0   120W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   55C    P0    87W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ELECTRA / TF2] One GPU-Util falls 0% periodically when using multi-GPU Training with horovod #1276

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ELECTRA / TF2] One GPU-Util falls 0% periodically when using multi-GPU Training with horovod #1276

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions