SUT ID: SMCI260303
STAC-AI

STAC-AI™ LANG6 Benchmark Results on Supermicro SuperServer SYS-222C-TN with 2x NVIDIA RTX PRO 6000 Blackwell Series GPUs

Type: Audited

Specs: STAC-AI™ LANG6

STAC recently completed a STAC-AI™ LANG6 (Inference-only) benchmark audit on a Supermicro SYS-222C-TN server hosting 2x NVIDIA RTX PRO 6000 GPUs managed by Red Hat OpenShift.

Stack Under Test (SUT):

STAC-AI™ LANG6 (Inference-Only) Pack for NVIDIA TensorRT-LLM (Rev D)
NVIDIA TensorRT-LLM 1.2.0rc2 with PyTorch backend
NVIDIA TensorRT 10.13.3.9
NVIDIA Model Optimizer (nvidia-modelopt) 0.37.0 for NVFP4 quantization
PyTorch 2.9.0a0 (NVIDIA PyTorch container 25.10)
Red Hat Enterprise Linux CoreOS 9.6
Red Hat OpenShift Container Platform 4.20
Supermicro Super Server SYS-222C-TN (2U CloudDC with DC-MHS)
- 32 x 64GiB DDR5 DIMMs @ 5200MTs (2TiB total)
- 2 x Intel® Xeon® 6730P CPUs
2x NVIDIA RTX PRO 6000 Blackwell Series GPUs, each with 96GiB of memory

Key Results Summary:

EDGAR4a Batch mode

The system achieved 32.9 inferences/s and 5,549 words/s on Llama-3.1-8B EDGAR4a

EDGAR4a Interactive mode

The system achieved a 4.00x increase in arrival rate, from 7.50 to 30.0 inferences/s with:
- increased 95p reaction time by 2.44x, from 0.131 s to 0.320 s,
- increased 95p response time by 4.93x, from 2.96 s to 14.6 s.
At 30.0 inferences/s, the system still operated at about 91% of the 32.9 inferences/s batch-mode rate

EDGAR5a Batch mode

The system achieved 0.345 inferences/s and 139 words/s on Llama-3.1-8B EDGAR5a

EDGAR5a Interactive mode

The system achieved a 4.00x increase in arrival rate, from 0.0800 to 0.320 inferences/s with:
- increased 95p reaction time by 2.96x, from 9.82 s to 29.1 s
- increased 95p response time by 4.58x, from 27.5 s to 126 s
At 0.320 inferences/s, the system still operated at about 93% of the 0.345 inferences/s batch-mode rate

EDGAR4b: Batch mode

The system achieved 5.28 inferences/s and 834 words/s on Llama-3.1-70B EDGAR4b

EDGAR4b: Interactive mode

The system achieved a 4.00x increase in arrival rate, from 1.25 to 5.00 inferences/s with:
- increased 95p reaction time by 2.47x, from 0.916 s to 2.26 s
- increased 95p response time by 2.80x, from 16.0 s to 44.8 s
At 5.00 inferences/s, the system still operated at about 95% of the 5.28 inferences/s batch-mode rate

The benchmark report is available to all STAC Observer members. STAC Insights subscribers gain access to detailed visualizations, configuration data, benchmark code, and the ability to run these tests in their own labs. Please log in to access the reports. For subscription options, contact us.

Please log in to see file attachments. If you are not registered, you may register for no charge.

STAC-AI™ LANG6 Benchmark Results on Supermicro SuperServer SYS-222C-TN with 2x NVIDIA RTX PRO 6000 Blackwell Series GPUs

Key Results Summary:

User login