Two defining traits stand out for the NVIDIA DGX Spark: 128GB unified memory in a $4,000 desktop unit, and a built-in 200Gb datacenter-grade network. The high-speed fabric differentiates it from regular workstations, enabling multi-node clustering once exclusive to rack-mounted servers. This review benchmarks distributed inference across Dell, GIGABYTE, and HP Spark variants in two-node 200GbE clusters across diverse models and workloads. It also analyzes pipeline parallelism (PP), an alternative splitting method outperforming NVIDIA’s default tensor parallelism (TP).
200Gb Network Fabric
Each Spark equips two QSFP56 cages paired with an integrated ConnectX-7 SmartNIC. Limited by PCIe Gen5 x4 bandwidth, the usable network speed caps at 200Gb, with one port sufficient for full bandwidth; the second port offers topology flexibility. Three common configurations are available: direct Spark-to-Spark 200Gb links, switch-free ring topology via dual 100Gb ports, and hybrid clustering with NVMe-oF high-speed storage access. NVIDIA sells single-unit desktops, validated two-node clusters, and newly released four-node setups. The dual-Spark configuration is the most practical for production-style inference and the focus of this test.
Rationale for Spark Clustering
The primary benefit is expanding model capacity: two linked Sparks can run 120B-parameter models that exceed single-unit memory limits. More importantly, the platform serves as an affordable educational tool. NVIDIA designs Spark for beginners to learn AI workflows, with official guides covering model deployment, fine-tuning, and PyTorch/JAX development. Dual-node clusters further teach multi-node parallelism and network bottleneck analysis without costly datacenter hardware. Notably, Spark is not optimized for production inference. Restricted by memory bandwidth and inter-node latency, its 200GbE link is slower than internal PCIe connections. Larger clusters suffer severe performance degradation, with low token throughput, limiting them to educational use rather than commercial serving.
Performance Testing: PP vs TP
Parallelism Strategy Selection
NVIDIA defaults to TP, which splits each transformer layer across two GPUs with frequent all-reduce data exchanges. By contrast, PP divides models by layer, transferring activations only once between nodes. On 200GbE links, PP minimizes cross-node communication. For large models at high batch sizes, PP vastly outperforms TP; TP only excels in single-request low-latency chat scenarios.
Tests on GPT-OSS-120B confirm this gap. At batch size 128, PP hits 554.69 tok/s (2.20× faster than TP) in balanced workloads, 310.63 tok/s vs 164.99 tok/s in prefill-heavy tasks. TP leads only at batch size 1. For small models like Llama-3.1-8B, TP dominates most batch sizes due to lightweight layer computation, with PP overtaking TP merely at high concurrency.
Multi-Model Benchmark Results (PP=2)
GPT-OSS Series
For GPT-OSS-120B, HP topped peak throughput in balanced (504.88 tok/s) and prefill-heavy (441.63 tok/s) workloads; GIGABYTE led decode-heavy tests (494.37 tok/s). For GPT-OSS-20B, Dell dominated balanced (976.77 tok/s) and prefill-heavy (852.39 tok/s) scenarios, while GIGABYTE led decode tasks (945.55 tok/s).
Llama 3.1 8B Variants
In BF16 precision, Dell led balanced (689.53 tok/s) and decode-heavy (581.43 tok/s) workloads; GIGABYTE won prefill-heavy tests (539.27 tok/s). FP4 optimization boosted throughput sharply: GIGABYTE led balanced (1458.86 tok/s) and prefill-heavy (954.23 tok/s) tasks. For FP8, Dell maintained narrow leads in balanced (1105.42 tok/s) and decode-heavy (862.33 tok/s) scenarios.
Mistral & Qwen Models
Mistral Small 3.1 24B saw minimal gaps: GIGABYTE peaked at 255.09 tok/s in balanced workloads. For Qwen3 Coder 30B (A3B Base), GIGABYTE led prefill-heavy tasks (1862.40 tok/s); Dell excelled in decode scenarios. Under FB8 quantization, GIGABYTE topped prefill-heavy throughput (3088.62 tok/s), while Dell led decode tasks (705.77 tok/s).
Dual Spark Systems Peak Output Summary
|
Model
|
Scenario (BS – 64)
|
Dell Peak Output
|
GIGABYTE Peak Output
|
HP Peak Output
|
|---|---|---|---|---|
|
GPT-OSS-120B
|
Equal ISL/OSL
|
463.97 tok/s
|
497.26 tok/s
|
504.88 tok/s
|
|
GPT-OSS-120B
|
Prefill Heavy
|
419.56 tok/s
|
417.34 tok/s
|
441.63 tok/s
|
|
GPT-OSS-120B
|
Decode Heavy
|
451.18 tok/s
|
494.37 tok/s
|
474.85 tok/s
|
|
GPT-OSS-20B
|
Equal ISL/OSL
|
976.77 tok/s
|
952.31 tok/s
|
915.72 tok/s
|
|
GPT-OSS-20B
|
Prefill Heavy
|
852.39 tok/s
|
802.37 tok/s
|
757.05 tok/s
|
|
GPT-OSS-20B
|
Decode Heavy
|
938.65 tok/s
|
945.55 tok/s
|
865.78 tok/s
|
|
Llama-3.1-8B-Instruct
|
Equal ISL/OSL
|
689.53 tok/s
|
687.48 tok/s
|
618.87 tok/s
|
|
Llama-3.1-8B-Instruct
|
Prefill Heavy
|
515.45 tok/s
|
539.27 tok/s
|
463.39 tok/s
|
|
Llama-3.1-8B-Instruct
|
Decode Heavy
|
581.43 tok/s
|
576.91 tok/s
|
531.07 tok/s
|
|
Llama-3.1-8B-FP4
|
Equal ISL/OSL
|
1427.39 tok/s
|
1458.86 tok/s
|
1413.51 tok/s
|
|
Llama-3.1-8B-FP4
|
Prefill Heavy
|
884.22 tok/s
|
954.23 tok/s
|
843.57 tok/s
|
|
Llama-3.1-8B-FP4
|
Decode Heavy
|
1008.98 tok/s
|
1007.23 tok/s
|
943.73 tok/s
|
|
Llama-3.1-8B-FP8
|
Equal ISL/OSL
|
1105.42 tok/s
|
1089.85 tok/s
|
1076.68 tok/s
|
|
Llama-3.1-8B-FP8
|
Prefill Heavy
|
759.50 tok/s
|
827.40 tok/s
|
725.51 tok/s
|
|
Llama-3.1-8B-FP8
|
Decode Heavy
|
862.33 tok/s
|
855.81 tok/s
|
800.78 tok/s
|
|
Mistral-Small-3.1-24B
|
Equal ISL/OSL
|
249.77 tok/s
|
255.09 tok/s
|
239.09 tok/s
|
|
Mistral-Small-3.1-24B
|
Prefill Heavy
|
216.01 tok/s
|
214.38 tok/s
|
197.92 tok/s
|
|
Mistral-Small-3.1-24B
|
Decode Heavy
|
238.44 tok/s
|
237.97 tok/s
|
221.41 tok/s
|
Conclusion
Dell, GIGABYTE, and HP Spark units deliver negligible performance gaps, with minor batch-specific leads. Purchase decisions should prioritize chassis design, thermal performance, warranty, and after-sales support over trivial benchmark differences. Parallelism strategy exerts far greater impact than OEM variations: PP outperforms TP for batched inference, while TP suits single-stream low-latency interaction. NVIDIA’s TP recommendation aligns with Spark’s positioning as an interactive learning device rather than production infrastructure. A dual-node Spark cluster serves as an affordable teaching platform for distributed AI. Future tests will cover larger clusters and end-to-end small-model training, pending lab 800Gb switch deployment.
Beijing Qianxing Jietong Technology Co., Ltd.
Sandy Yang/Global Strategy Director
WhatsApp / WeChat: +86 13426366826
Email: yangyd@qianxingdata.com
Website: www.qianxingdata.com/www.storagesserver.com
Business Focus:
ICT Product Distribution/System Integration & Services/Infrastructure Solutions
With 20+ years of IT distribution experience, we partner with leading global brands to deliver reliable products and professional services.
“Using Technology to Build an Intelligent World”Your Trusted ICT Product Service Provider!
Sandy Yang/Global Strategy Director
WhatsApp / WeChat: +86 13426366826
Email: yangyd@qianxingdata.com
Website: www.qianxingdata.com/www.storagesserver.com
Business Focus:
ICT Product Distribution/System Integration & Services/Infrastructure Solutions
With 20+ years of IT distribution experience, we partner with leading global brands to deliver reliable products and professional services.
“Using Technology to Build an Intelligent World”Your Trusted ICT Product Service Provider!



