- SUT ID: SMCI260303
- STAC-AI
STAC-AI™ LANG6 Benchmark Results on Supermicro SuperServer SYS-222C-TN with 2x NVIDIA RTX PRO 6000 Blackwell Series GPUs
Type: Audited
Specs: STAC-AI™ LANG6
STAC recently completed a STAC-AI™ LANG6 (Inference-only) benchmark audit on a Supermicro SYS-222C-TN server hosting 2x NVIDIA RTX PRO 6000 GPUs managed by Red Hat OpenShift.
Stack Under Test (SUT):
- STAC-AI™ LANG6 (Inference-Only) Pack for NVIDIA TensorRT-LLM (Rev D)
- NVIDIA TensorRT-LLM 1.2.0rc2 with PyTorch backend
- NVIDIA TensorRT 10.13.3.9
- NVIDIA Model Optimizer (nvidia-modelopt) 0.37.0 for NVFP4 quantization
- PyTorch 2.9.0a0 (NVIDIA PyTorch container 25.10)
- Red Hat Enterprise Linux CoreOS 9.6
- Red Hat OpenShift Container Platform 4.20
- Supermicro Super Server SYS-222C-TN (2U CloudDC with DC-MHS)
- 32 x 64GiB DDR5 DIMMs @ 5200MTs (2TiB total)
- 2 x Intel® Xeon® 6730P CPUs
- 2x NVIDIA RTX PRO 6000 Blackwell Series GPUs, each with 96GiB of memory
Key Results Summary:
EDGAR4a Batch mode
- The system achieved 32.9 inferences/s and 5,549 words/s on Llama-3.1-8B EDGAR4a
EDGAR4a Interactive mode
- The system achieved a 4.00x increase in arrival rate, from 7.50 to 30.0 inferences/s with:
- increased 95p reaction time by 2.44x, from 0.131 s to 0.320 s,
- increased 95p response time by 4.93x, from 2.96 s to 14.6 s.
- At 30.0 inferences/s, the system still operated at about 91% of the 32.9 inferences/s batch-mode rate
EDGAR5a Batch mode
- The system achieved 0.345 inferences/s and 139 words/s on Llama-3.1-8B EDGAR5a
EDGAR5a Interactive mode
- The system achieved a 4.00x increase in arrival rate, from 0.0800 to 0.320 inferences/s with:
- increased 95p reaction time by 2.96x, from 9.82 s to 29.1 s
- increased 95p response time by 4.58x, from 27.5 s to 126 s
- At 0.320 inferences/s, the system still operated at about 93% of the 0.345 inferences/s batch-mode rate
EDGAR4b: Batch mode
- The system achieved 5.28 inferences/s and 834 words/s on Llama-3.1-70B EDGAR4b
EDGAR4b: Interactive mode
- The system achieved a 4.00x increase in arrival rate, from 1.25 to 5.00 inferences/s with:
- increased 95p reaction time by 2.47x, from 0.916 s to 2.26 s
- increased 95p response time by 2.80x, from 16.0 s to 44.8 s
- At 5.00 inferences/s, the system still operated at about 95% of the 5.28 inferences/s batch-mode rate
The benchmark report is available to all STAC Observer members. STAC Insights subscribers gain access to detailed visualizations, configuration data, benchmark code, and the ability to run these tests in their own labs. Please log in to access the reports. For subscription options, contact us.
Please log in to see file attachments. If you are not registered, you may register for no charge.
