logo
Home Cases

NVIDIA DGX Spark Cluster Review: Distributed Inference on Dell, GIGABYTE, and HP

Certification
China Beijing Qianxing Jietong Technology Co., Ltd. certification
China Beijing Qianxing Jietong Technology Co., Ltd. certification
Customer Reviews
The sales staff of Beijing Qianxing Jietong Technology Co.,Ltd are very professional and patient. They can provide quotations quickly. The quality and packaging of the products are also very good. Our cooperation is very smooth.

—— 《Festfing DV》LLC

When I was looking for intel CPU and Toshiba SSD urgently, Sandy from Beijing Qianxing Jietong Technology Co., Ltd gave me a lot of help and got me the products I needed quickly. I really appreciate her.

—— Kitty Yen

Sandy of Beijing Qianxing Jietong Technology Co.,Ltd is a very careful salesman, who can remind me of configuration errors in time when I buy a server. The engineers are also very professional and can quickly complete the testing process.

—— Strelkin Mikhail Vladimirovich

We are very happy with our experience working with Beijing Qianxing Jietong. The product quality is excellent, and delivery is always on time. Their sales team is professional, patient, and very helpful with all our questions. We truly appreciate their support and look forward to a long-term partnership. Highly recommended!

—— Ahmad Navid

Quality: “Great experience with my supplier. The MikroTik RB3011 was already used, but it was in very good condition and everything works perfectly. Communication was fast and smooth, and all my concerns were addressed quickly. Very reliable supplier—highly recommended.”

—— Geran Colesio

I'm Online Chat Now

NVIDIA DGX Spark Cluster Review: Distributed Inference on Dell, GIGABYTE, and HP

May 15, 2026
Two defining traits stand out for the NVIDIA DGX Spark: 128GB unified memory in a $4,000 desktop unit, and a built-in 200Gb datacenter-grade network. The high-speed fabric differentiates it from regular workstations, enabling multi-node clustering once exclusive to rack-mounted servers. This review benchmarks distributed inference across Dell, GIGABYTE, and HP Spark variants in two-node 200GbE clusters across diverse models and workloads. It also analyzes pipeline parallelism (PP), an alternative splitting method outperforming NVIDIA’s default tensor parallelism (TP).

latest company case about NVIDIA DGX Spark Cluster Review: Distributed Inference on Dell, GIGABYTE, and HP  0

200Gb Network Fabric


Each Spark equips two QSFP56 cages paired with an integrated ConnectX-7 SmartNIC. Limited by PCIe Gen5 x4 bandwidth, the usable network speed caps at 200Gb, with one port sufficient for full bandwidth; the second port offers topology flexibility. Three common configurations are available: direct Spark-to-Spark 200Gb links, switch-free ring topology via dual 100Gb ports, and hybrid clustering with NVMe-oF high-speed storage access. NVIDIA sells single-unit desktops, validated two-node clusters, and newly released four-node setups. The dual-Spark configuration is the most practical for production-style inference and the focus of this test.

latest company case about NVIDIA DGX Spark Cluster Review: Distributed Inference on Dell, GIGABYTE, and HP  1

Rationale for Spark Clustering


The primary benefit is expanding model capacity: two linked Sparks can run 120B-parameter models that exceed single-unit memory limits. More importantly, the platform serves as an affordable educational tool. NVIDIA designs Spark for beginners to learn AI workflows, with official guides covering model deployment, fine-tuning, and PyTorch/JAX development. Dual-node clusters further teach multi-node parallelism and network bottleneck analysis without costly datacenter hardware. Notably, Spark is not optimized for production inference. Restricted by memory bandwidth and inter-node latency, its 200GbE link is slower than internal PCIe connections. Larger clusters suffer severe performance degradation, with low token throughput, limiting them to educational use rather than commercial serving.

Performance Testing: PP vs TP


Parallelism Strategy Selection


NVIDIA defaults to TP, which splits each transformer layer across two GPUs with frequent all-reduce data exchanges. By contrast, PP divides models by layer, transferring activations only once between nodes. On 200GbE links, PP minimizes cross-node communication. For large models at high batch sizes, PP vastly outperforms TP; TP only excels in single-request low-latency chat scenarios.
Tests on GPT-OSS-120B confirm this gap. At batch size 128, PP hits 554.69 tok/s (2.20× faster than TP) in balanced workloads, 310.63 tok/s vs 164.99 tok/s in prefill-heavy tasks. TP leads only at batch size 1. For small models like Llama-3.1-8B, TP dominates most batch sizes due to lightweight layer computation, with PP overtaking TP merely at high concurrency.

Multi-Model Benchmark Results (PP=2)


GPT-OSS Series


For GPT-OSS-120B, HP topped peak throughput in balanced (504.88 tok/s) and prefill-heavy (441.63 tok/s) workloads; GIGABYTE led decode-heavy tests (494.37 tok/s). For GPT-OSS-20B, Dell dominated balanced (976.77 tok/s) and prefill-heavy (852.39 tok/s) scenarios, while GIGABYTE led decode tasks (945.55 tok/s).

Llama 3.1 8B Variants


In BF16 precision, Dell led balanced (689.53 tok/s) and decode-heavy (581.43 tok/s) workloads; GIGABYTE won prefill-heavy tests (539.27 tok/s). FP4 optimization boosted throughput sharply: GIGABYTE led balanced (1458.86 tok/s) and prefill-heavy (954.23 tok/s) tasks. For FP8, Dell maintained narrow leads in balanced (1105.42 tok/s) and decode-heavy (862.33 tok/s) scenarios.

Mistral & Qwen Models


Mistral Small 3.1 24B saw minimal gaps: GIGABYTE peaked at 255.09 tok/s in balanced workloads. For Qwen3 Coder 30B (A3B Base), GIGABYTE led prefill-heavy tasks (1862.40 tok/s); Dell excelled in decode scenarios. Under FB8 quantization, GIGABYTE topped prefill-heavy throughput (3088.62 tok/s), while Dell led decode tasks (705.77 tok/s).

Dual Spark Systems Peak Output Summary


Model
Scenario (BS – 64)
Dell Peak Output
GIGABYTE Peak Output
HP Peak Output
GPT-OSS-120B
Equal ISL/OSL
463.97 tok/s
497.26 tok/s
504.88 tok/s
GPT-OSS-120B
Prefill Heavy
419.56 tok/s
417.34 tok/s
441.63 tok/s
GPT-OSS-120B
Decode Heavy
451.18 tok/s
494.37 tok/s
474.85 tok/s
GPT-OSS-20B
Equal ISL/OSL
976.77 tok/s
952.31 tok/s
915.72 tok/s
GPT-OSS-20B
Prefill Heavy
852.39 tok/s
802.37 tok/s
757.05 tok/s
GPT-OSS-20B
Decode Heavy
938.65 tok/s
945.55 tok/s
865.78 tok/s
Llama-3.1-8B-Instruct
Equal ISL/OSL
689.53 tok/s
687.48 tok/s
618.87 tok/s
Llama-3.1-8B-Instruct
Prefill Heavy
515.45 tok/s
539.27 tok/s
463.39 tok/s
Llama-3.1-8B-Instruct
Decode Heavy
581.43 tok/s
576.91 tok/s
531.07 tok/s
Llama-3.1-8B-FP4
Equal ISL/OSL
1427.39 tok/s
1458.86 tok/s
1413.51 tok/s
Llama-3.1-8B-FP4
Prefill Heavy
884.22 tok/s
954.23 tok/s
843.57 tok/s
Llama-3.1-8B-FP4
Decode Heavy
1008.98 tok/s
1007.23 tok/s
943.73 tok/s
Llama-3.1-8B-FP8
Equal ISL/OSL
1105.42 tok/s
1089.85 tok/s
1076.68 tok/s
Llama-3.1-8B-FP8
Prefill Heavy
759.50 tok/s
827.40 tok/s
725.51 tok/s
Llama-3.1-8B-FP8
Decode Heavy
862.33 tok/s
855.81 tok/s
800.78 tok/s
Mistral-Small-3.1-24B
Equal ISL/OSL
249.77 tok/s
255.09 tok/s
239.09 tok/s
Mistral-Small-3.1-24B
Prefill Heavy
216.01 tok/s
214.38 tok/s
197.92 tok/s
Mistral-Small-3.1-24B
Decode Heavy
238.44 tok/s
237.97 tok/s
221.41 tok/s


Conclusion


Dell, GIGABYTE, and HP Spark units deliver negligible performance gaps, with minor batch-specific leads. Purchase decisions should prioritize chassis design, thermal performance, warranty, and after-sales support over trivial benchmark differences. Parallelism strategy exerts far greater impact than OEM variations: PP outperforms TP for batched inference, while TP suits single-stream low-latency interaction. NVIDIA’s TP recommendation aligns with Spark’s positioning as an interactive learning device rather than production infrastructure. A dual-node Spark cluster serves as an affordable teaching platform for distributed AI. Future tests will cover larger clusters and end-to-end small-model training, pending lab 800Gb switch deployment.

Beijing Qianxing Jietong Technology Co., Ltd.
Sandy Yang/Global Strategy Director
WhatsApp / WeChat: +86 13426366826
Email: yangyd@qianxingdata.com
Website: www.qianxingdata.com/www.storagesserver.com
Business Focus:
ICT Product Distribution/System Integration & Services/Infrastructure Solutions
With 20+ years of IT distribution experience, we partner with leading global brands to deliver reliable products and professional services.
“Using Technology to Build an Intelligent World”Your Trusted ICT Product Service Provider!
Contact Details
Beijing Qianxing Jietong Technology Co., Ltd.

Contact Person: Ms. Sandy Yang

Tel: 13426366826

Send your inquiry directly to us (0 / 3000)