STAC-AI™ LANG6 Benchmark Results on Supermicro SuperServer SYS-222C-TN with 2x NVIDIA RTX PRO 6000 Blackwell Series GPUs

Type: Audited

Specs: STAC-AI™ LANG6

 

STAC recently completed a STAC-AI™ LANG6 (Inference-only) benchmark audit on a Supermicro SYS-222C-TN server hosting 2x NVIDIA RTX PRO 6000 GPUs managed by Red Hat OpenShift.

Stack Under Test (SUT):

  • STAC-AI™ LANG6 (Inference-Only) Pack for NVIDIA TensorRT-LLM (Rev D)
  • NVIDIA TensorRT-LLM 1.2.0rc2 with PyTorch backend 
  • NVIDIA TensorRT 10.13.3.9
  • NVIDIA Model Optimizer (nvidia-modelopt) 0.37.0 for NVFP4 quantization 
  • PyTorch 2.9.0a0 (NVIDIA PyTorch container 25.10) 
  • Red Hat Enterprise Linux CoreOS 9.6
  • Red Hat OpenShift Container Platform 4.20 
  • Supermicro Super Server SYS-222C-TN (2U CloudDC with DC-MHS) 
    • 32 x 64GiB DDR5 DIMMs @ 5200MTs (2TiB total)
    • 2 x Intel® Xeon® 6730P CPUs
  • 2x NVIDIA RTX PRO 6000 Blackwell Series GPUs, each with 96GiB of memory

 

Key Results Summary:

EDGAR4a Batch mode

  • The system achieved 32.9 inferences/s and 5,549 words/s on Llama-3.1-8B EDGAR4a

EDGAR4a Interactive mode

  • The system achieved a 4.00x increase in arrival rate, from 7.50 to 30.0 inferences/s with:
    • increased 95p reaction time by 2.44x, from 0.131 s to 0.320 s,
    • increased 95p response time by 4.93x, from 2.96 s to 14.6 s.
  • At 30.0 inferences/s, the system still operated at about 91% of the 32.9 inferences/s batch-mode rate

EDGAR5a Batch mode

  • The system achieved 0.345 inferences/s and 139 words/s on Llama-3.1-8B EDGAR5a

EDGAR5a Interactive mode

  • The system achieved a 4.00x increase in arrival rate, from 0.0800 to 0.320 inferences/s with:
    • increased 95p reaction time by 2.96x, from 9.82 s to 29.1 s
    • increased 95p response time by 4.58x, from 27.5 s to 126 s
  •  At 0.320 inferences/s, the system still operated at about 93% of the 0.345 inferences/s batch-mode rate

EDGAR4b: Batch mode

  • The system achieved 5.28 inferences/s and 834 words/s on Llama-3.1-70B EDGAR4b

EDGAR4b: Interactive mode

  • The system achieved a 4.00x increase in arrival rate, from 1.25 to 5.00 inferences/s with:
    • increased 95p reaction time by 2.47x, from 0.916 s to 2.26 s
    • increased 95p response time by 2.80x, from 16.0 s to 44.8 s
  • At 5.00 inferences/s, the system still operated at about 95% of the 5.28 inferences/s batch-mode rate

 

The benchmark report is available to all STAC Observer members. STAC Insights subscribers gain access to detailed visualizations, configuration data, benchmark code, and the ability to run these tests in their own labs. Please log in to access the reports. For subscription options, contact us.

Please log in to see file attachments. If you are not registered, you may register for no charge.

The STAC-AI Working Group focuses on benchmarking artificial intelligence (AI) technologies in finance. This includes deep learning, large language models (LLMs), and other AI-driven approaches that help firms unlock new efficiencies and insights.