NVIDIA TITAN RTX Deep Learning Benchmarks 2019 – Performance improvements with XLA, AMP and NVLink in TensorFlow

14 May 2019

Introduction


In this post, we are going to benchmark the latest software and hardware features and see how they will affect the performance.

Latest software features are AMP (Automatic mixed precision) and XLA (Accelerated Linear Algebra).

Also, we will check how much performance increase we will get with and without NVLink bridge connected and how the batch size will affect the performance.


Hardware


Test bench:

BIZON G7000 8 GPU deep learning server
More details: https://bizon-tech.com/us/bizon-g7000

Tech specs:
  • CPU: Dual 18-Core 2.30 GHz Intel Xeon Gold 6140 (latest generation).
  • Graphics cards: 4 x NVIDIA TITAN RTX
  • 2 x NVLINK
  • RAM: 256 DDR4 2666 MHz ECC.
  • Storage: 1 TB PCIe SSD.


Software

  • NVIDIA drivers 418.56
  • CUDA drivers: 10.1
  • Model: Resnet50 v1.5
  • Data: Synthetic
  • Tensorflow: 1.14
  • XLA, AMP, FP32 & FP16
  • Batch size: 64 – 128
  • Num iterations: 100
  • Mode: Training_benchmark

Notes:


Test 1: NVIDIA TITAN RTX Deep Learning Benchmarks:
No NVLink. Batch size: 64. XLA, AMP: on / off


nvidia 2080 ti TITAN RTX deep learning benchmarks – No NVLINK, batch size 64

Test 2: NVIDIA TITAN RTX Deep Learning Benchmarks:
2 x NVLink. Batch size: 64. XLA, AMP: on / off


nvidia 2080 ti TITAN RTX deep learning benchmarks – NVLINK, batch size 64

Test 3: NVIDIA TITAN RTX Deep Learning Benchmarks:
Overall chart. No NVLink vs. NVLink. Batch size: 64. XLA, AMP: on / off


nvidia 2080 ti TITAN RTX deep learning benchmarks – Overall chart

Test 4: NVIDIA TITAN RTX Deep Learning Benchmarks:
No NVLink. Batch size: 128. XLA, AMP: on / off


nvidia 2080 ti TITAN RTX deep learning benchmarks – No NVLINK, batch size 128

Test 5: NVIDIA TITAN RTX Deep Learning Benchmarks:
2 x NVLink. Batch size: 128. XLA, AMP: on / off


nvidia 2080 ti TITAN RTX deep learning benchmarks – NVLINK, batch size 128

Test 6: NVIDIA TITAN RTX Deep Learning Benchmarks:
Overall chart. No NVLink vs. NVLink. Batch size: 128. XLA, AMP: on / off



nvidia 2080 ti TITAN RTX deep learning benchmarks – Overall chart, batch size 128

Test 7: NVIDIA TITAN RTX Deep Learning Benchmarks:
VRAM Usage. Batch size: 64 - 128. XLA, AMP: on / off


nvidia 2080 ti TITAN RTX deep learning benchmarks – VRAM Size test

Conclusion

  • NVLink significantly increases performance. We recommend making sure you add the NVLink when placing the order.
  • Using bigger batch size increases performance. NVIDIA RTX 2080 Ti 11 GB cannot fit the large batch sizes and TITAN RTX 24 GB support up to 128 due to the VRAM size (11 GB vs. 24 GB).
  • XLA significantly increases the amount of Img/sec.

All of the software and hardware features listed in this article are available for all BIZON deep learning workstations and servers. BIZON Stack comes with latest versions of Tensorflow and XLA, AMP support.