-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Related to Model/Framework(s)
Resnet/Mxnet
Describe the bug
On Tesla GPU T4 , when trying to increase batch size from 192 to 256 with 2 iterations , cuda malloc error occurs , below is the trace , FYI - Batch size 192 with 500 iterations working fine .
Requirement - Need to run 10k Batch size with 10 iterations on 1,00,000 training images
root@061df1cec673:/workspace/rn50# python3 benchmark.py -n 1 -b 256 --data-root /data/imagenet/train-val-recordio-passthrough/tmp --dtype float16 -o benchmark_report_fp16.json -i 2 -e 1 --mode train
[1,0]:2020-09-17 11:55:59,675:INFO: Start with arguments Namespace(amp=False, arch='resnetv15', batch_size=256, batchnorm_eps=1e-05, batchnorm_layout='NHWC', batchnorm_mom=0.9, begin_epoch=0, benchmark_iters=2, brightness=0, contrast=0, conv_layout='NHWC', dali_fuse_decoder=1, dali_nvjpeg_memory_padding=64, dali_prefetch_queue=2, dali_separ_val=False, dali_threads=3, dali_validation_threads=10, data_backend='dali-gpu', data_mxnet_threads=40, data_pred=None, data_train='/data/imagenet/train-val-recordio-passthrough/tmp/train.rec', data_train_idx='/data/imagenet/train-val-recordio-passthrough/tmp/train.idx', data_val='/data/imagenet/train-val-recordio-passthrough/tmp/val.rec', data_val_idx='/data/imagenet/train-val-recordio-passthrough/tmp/val.idx', data_val_resize=256, disp_batches=20, dtype='float16', fuse_bn_add_relu=1, fuse_bn_relu=1, gpus=[0], image_shape=[4, 224, 224], input_layout='NCHW', kv_store='horovod', label_smoothing=0.1, load=None, log='log.log', lr=0.256, lr_factor=0.256, lr_schedule='cosine', lr_steps=[], max_crop_size=-1, max_random_area=1, max_random_aspect_ratio=1.33, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_crop_size=-1, min_random_area=0.05, min_random_aspect_ratio=0.75, min_random_scale=1, mixup=0, mode='train', model_prefix='model', mom=0.875, no_metrics=True, num_classes=1000, num_epochs=1, num_examples=1281167, num_groups=32, num_layers=50, optimizer='sgd', pca_noise=0, pooling_layout='NHWC', random_crop=0, random_mirror=1, random_resized_crop=1, report='benchmark_report_fp16.json-1,256', rgb_mean=[123.68, 116.779, 103.939], rgb_std=[58.393, 57.12, 57.375], saturation=0, save_frequency=-1, seed=None, test_io=False, test_io_mode='train', warmup_epochs=5, wd=3.0517578125e-05)
[1,0]:2020-09-17 11:56:05,036:WARNING: DALI iterator does not support resetting while epoch is not finished. Ignoring...
[1,0]:2020-09-17 11:56:05,037:INFO: Starting epoch 0
[1,0]:[11:56:05] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:120: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO NET/IB : No device found.
[1,0]:NCCL version 2.4.7+cuda10.1
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO Setting affinity for GPU 0 to aaaaaa,aaaaaaaa,aaaaaaaa
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
[1,0]:061df1cec673:37265:37480 [0] NCCL INFO comm 0x7fd90c270d30 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
[1,0]:Traceback (most recent call last):
[1,0]: File "train.py", line 70, in
[1,0]: fit.fit(args, model, data_loader)
[1,0]: File "/workspace/rn50/fit.py", line 518, in fit
[1,0]: mx.nd.waitall()
[1,0]: File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 166, in waitall
[1,0]: check_call(_LIB.MXNDArrayWaitAll())
[1,0]: File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call
[1,0]: raise MXNetError(py_str(_LIB.MXGetLastError()))
[1,0]:mxnet.base.MXNetError: [11:56:09] src/storage/./pooled_storage_manager.h:157: cudaMalloc failed: out of memory
[1,0]:Stack trace:
[1,0]: [bt] (0) /usr/local/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7fdab7c3d783]
[1,0]: [bt] (1) /usr/local/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0x21c) [0x7fdaba6ca2ec]
[1,0]: [bt] (2) /usr/local/lib/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x5a) [0x7fdaba6cc73a]
[1,0]: [bt] (3) /usr/local/lib/libmxnet.so(mxnet::NDArray::CheckAndAlloc() const+0x804) [0x7fdab7d3cb74]
[1,0]: [bt] (4) /usr/local/lib/libmxnet.so(mxnet::exec::StorageFallbackOpExecutor::PreFCompute(bool)+0x137a) [0x7fdab9e52a1a]
[1,0]: [bt] (5) /usr/local/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x34) [0x7fdab9e53154]
[1,0]: [bt] (6) /usr/local/lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::imperative::CreateEngineOp(mxnet::Context const&, std::vector<std::shared_ptrmxnet::exec::OpExecutor, std::allocator<std::shared_ptrmxnet::exec::OpExecutor > > const&)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x15e) [0x7fdab9f6e5be]
[1,0]: [bt] (7) /usr/local/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x5b5) [0x7fdaba6ae155]
[1,0]: [bt] (8) /usr/local/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptrdmlc::ManualEvent const&)+0x176) [0x7fdaba6c40f6]
[1,0]:
[1,0]:
To Reproduce
Steps to reproduce the behavior:
- python3 benchmark.py -n 1 -b 256 --data-root /data/imagenet/train-val-recordio-passthrough/tmp --dtype float16 -o benchmark_report_fp16.json -i 2 -e 1 --mode train
Expected behavior
Need to run 10k Batch size with 10 iterations on 1,00,000 training images
Environment
Please provide at least:
- Container version : MXNet 19.07-py3 NGC container
- GPUs in the system: tesla t4
- CUDA driver version : 10.1