Home Cases

Scaling AI Checkpoints: The Impact of High-Capacity SSDs on Model Training

All Products

Rack Storage Server
(165)

Huawei Fusion Server
(31)

Dell Poweredge Server
(59)

H3C Server
(30)

Datacom Switches
(97)

WLAN Device
(21)

Smart Wireless Router
(10)

Hard Drive HDD
(24)

Internal Hard Drive SSD
(16)

Geforce Graphic Card
(27)

INTEL CPU Processor
(20)

Server Memory RAM
(6)

Refurbished Storage Server
(6)

SFP Transceiver Module
(4)

Fibre Channel Switch
(42)

Certification

Customer Reviews

The sales staff of Beijing Qianxing Jietong Technology Co.,Ltd are very professional and patient. They can provide quotations quickly. The quality and packaging of the products are also very good. Our cooperation is very smooth.

—— 《Festfing DV》LLC

When I was looking for intel CPU and Toshiba SSD urgently, Sandy from Beijing Qianxing Jietong Technology Co., Ltd gave me a lot of help and got me the products I needed quickly. I really appreciate her.

—— Kitty Yen

Sandy of Beijing Qianxing Jietong Technology Co.,Ltd is a very careful salesman, who can remind me of configuration errors in time when I buy a server. The engineers are also very professional and can quickly complete the testing process.

—— Strelkin Mikhail Vladimirovich

We are very happy with our experience working with Beijing Qianxing Jietong. The product quality is excellent, and delivery is always on time. Their sales team is professional, patient, and very helpful with all our questions. We truly appreciate their support and look forward to a long-term partnership. Highly recommended!

—— Ahmad Navid

Quality： “Great experience with my supplier. The MikroTik RB3011 was already used, but it was in very good condition and everything works perfectly. Communication was fast and smooth, and all my concerns were addressed quickly. Very reliable supplier—highly recommended.”

—— Geran Colesio

I'm Online Chat Now

Scaling AI Checkpoints: The Impact of High-Capacity SSDs on Model Training

March 13, 2026

Checkpointing is essential to AI model training, as it ensures resilience, operational efficiency, and the ability to resume or fine-tune training from saved states. However, the demands of modern AI workloads—characterized by increasingly complex models and expansive training datasets—are pushing storage systems to their absolute limits.

The Role of Checkpoints in AI Workflows

Checkpointing in AI training is a vital process that involves periodically saving the complete state of a model during its training cycle. This state encompasses the model’s weights and parameters, optimizer states, learning rate schedules, and training metadata. By creating a comprehensive snapshot of the training process at specific intervals, checkpointing guarantees training continuity and enables recovery in the event of interruptions.

Checkpoints are typically captured at iteration-based intervals (e.g., every one thousand training steps). Modern large language model (LLM) training— which can span weeks or even months and consume massive computational resources—relies heavily on these checkpoints as a safety net against potential failures. For example, training a GPT-4-class model can generate checkpoints ranging from several hundred gigabytes to multiple terabytes, depending on the model size and training configuration.

Training Process Generated by DALL-E

The primary purpose of checkpointing goes beyond mere backup functionality. It serves as a critical mechanism for training resilience, allowing training to resume from the last saved state rather than restarting from scratch in cases of system failures, power outages, or hardware issues. Additionally, checkpoints are invaluable for model analysis: they enable researchers to examine the model’s evolution at different training stages and potentially roll back to previous states if performance degradation is detected.

From a storage perspective, the write patterns during checkpointing are particularly noteworthy. When a checkpoint is triggered, the system must write enormous volumes of data in a burst pattern. This creates a distinct I/O profile: periods of relatively low storage activity during training computations, followed by intense, high-bandwidth write operations during checkpointing. These write operations are typically sequential and can benefit significantly from storage systems optimized for high-bandwidth sequential writes.

Different parallelism strategies in distributed training can have a substantial impact on checkpointing behavior. These strategies influence when checkpointing occurs during training and which portion of the model is saved. In modern distributed training setups, multiple GPUs may simultaneously write different parts of the same layer, creating complex I/O patterns. This parallel writing capability is key to efficiency but requires careful coordination and robust storage systems that can handle concurrent write operations while maintaining data consistency. Any bottleneck in this process can lead to widespread training delays.

Slow checkpointing can create significant training bottlenecks, as the entire training process must pause while the checkpoint is written to storage. For instance, in a large-scale training setup, if checkpointing takes 30 minutes every few hours, this could result in several hours of accumulated downtime over the entire training period. This directly impacts training efficiency and increases operational costs—especially in cloud environments where computing resources are billed by the hour.

Faster checkpointing also allows teams to create checkpoints more frequently, reducing the maximum potential data loss in the event of failures. This enables more aggressive training approaches and improved experimental iteration cycles. Furthermore, rapid checkpoint loading times facilitate quicker experimentation with different training configurations and model architectures, as researchers can more easily restore from previous states to test alternative approaches.

The storage system’s ability to efficiently handle these checkpoint operations becomes a pivotal factor in the overall training infrastructure. High-performance storage solutions that can manage both the burst write patterns of checkpointing and the sustained read/write operations of training can significantly reduce the total time and cost of training large language models. Thus, the storage subsystem’s performance characteristics—particularly its ability to handle large sequential writes and maintain consistent high bandwidth—are crucial considerations when designing LLM training infrastructure.

For this report, we sought to evaluate SSD performance for AI checkpointing, assessing the benefits of the latest Gen5 SSDs when checkpoint speed is critical, compared to the largest QLC SSDs on the market— which can store vast numbers of checkpoints if that is more beneficial for the model being trained.

Checkpoint Performance – Benchmarking with DLIO

To evaluate the Solidigm SSD’s real-world performance in AI training environments, we used the Data and Learning Input/Output (DLIO) benchmark tool. Developed by Argonne National Laboratory, DLIO is specifically designed to test I/O patterns in deep learning workloads, providing insights into how storage systems handle checkpointing, data ingestion, and model training challenges.

Using DLIO, we aimed to measure the drive’s throughput, latency, and reliability under intensive checkpointing scenarios. While this testing was conducted on the 61.44TB D5-P5336, initial performance data indicates that the Solidigm D5-P5336 122TB version offers a similar performance profile. We also included results from a TLC-based D7-PS1010 to demonstrate the advantages of PCIe Gen5 in this test. We selected these two drives to showcase both perspectives on checkpoints: one focusing on the fastest possible checkpoint time, and the other on storing the maximum number of checkpoints on a single SSD.

The platform chosen for this work was our Dell PowerEdge R760 running Ubuntu 22.04.02 LTS. We used DLIO benchmark version 2.0 from the August 13, 2024 release. Our system configuration is outlined below:

2 x Intel Xeon Gold 6430 (32-Core, 2.1GHz)
16 x 64GB DDR5-4400
480GB Dell BOSS SSD
Serial Cables Gen5 JBOF
- 7.68TB Solidigm D7-PS1010
- 61.44TB Solidigm D5-P5336

To ensure our benchmarking reflected real-world scenarios, we based our testing on the LLAMA 3.1 405B model architecture, implementing checkpointing through torch.save() to capture model parameters, optimizer states, and layer states. Our setup simulated an 8-GPU system, implementing a hybrid parallelism strategy with 4-way tensor parallel and 2-way pipeline parallel processing distributed across the eight GPUs. This configuration resulted in checkpoint sizes of 1,636 GB, representative of modern large language model training requirements.

Our testing process for the DLIO checkpoint workload consisted of filling each drive to a similar utilization level. For the 61.44TB Solidigm D5-P5336, each pass included 33 checkpoint intervals, totaling 54TB. The smaller 7.68TB D7-PS1010 comfortably fit three checkpoint intervals, with a total footprint of 4.9TB. One additional checkpoint could fit into the D7-PS1010, although it brought its utilization slightly higher than we wanted.

The DLIO checkpoint workload yielded interesting results when we compared the Gen4 QLC-based 61.44TB D5-P5536 to the Gen5 TLC-based 7.68TB D7-PS1010. During the first pass, as the drives filled up, we witnessed a wider gap in performance between the two SSD models. The faster Gen5 PS1010 completed each checkpoint on average in 464 seconds, compared to 623 seconds from the Gen4 P5336. In passes two and three, the gap narrowed to 579 and 587 seconds for the PS1010 and 676 and 680 seconds for the P5336.

For businesses looking to have the smallest possible gap in checkpointing intervals, the TLC-based Gen5 PS1010 offers an advantage in the fastest completion time. If the goal is to retain many checkpoints cost-effectively, the QLC-based Gen4 P5336 can do just that. We measured a difference in average checkpoint times of less than 17% between both drives during passes two and three.

GPUDirect Storage Bandwidth

While DLIO shows flash performance in an AI workflow, the workload is wholly write-based until a checkpoint is restored. To paint a fuller picture of the Solidigm D7-PS1010 and D5-P5336 in AI workloads, we included read bandwidth measurements using GDSIO.

How GPU Direct Storage Works

Traditionally, when a GPU processes data stored on an NVMe drive, the data must first travel through the CPU and system memory before reaching the GPU. This process introduces bottlenecks, as the CPU becomes a middleman, adding latency and consuming valuable system resources. GPU Direct Storage eliminates this inefficiency by enabling the GPU to access data directly from the storage device via the PCIe bus. This direct path reduces the overhead associated with data movement, allowing faster and more efficient data transfers.

AI workloads, especially those involving deep learning, are highly data-intensive. Training large neural networks requires processing terabytes of data, and any delay in data transfer can lead to underutilized GPUs and longer training times. GPU Direct Storage addresses this challenge by ensuring that data is delivered to the GPU as quickly as possible, minimizing idle time and maximizing computational efficiency.

Like the DLIO test, the goal is to better understand and characterize the differences between high-speed Gen5 SSDs and high-capacity QLC drives. Not every AI workload is the same, and each drive offers distinct advantages, depending on the need.

Testing Configuration Matrix

We systematically tested every combination of the following parameters with an NVIDIA L4 in our test platform:

Block Sizes: 1M, 128K, 64K, 16K, 8K
Thread Counts: 128, 64, 32, 16, 8, 4, 1
Job Counts: 16
Batch Sizes: 16

Our first look was at the QLC-based D5-P5336, which topped out at 4.2GiB/s using a 1M transfer size at an IO depth of 128. The effect of block sizes produced a substantial uplift in bandwidth, moving up from 8K to 1M. The advantage of increased IO depth started to taper off at 32, where workloads began to level off.

Next, we look at the Gen5 PS-1010, which can scale up to 6.2GiB/s at a 1M block size and an IO depth of 128. Across the board, it outperformed the Gen4-based P5336, with particular workloads demonstrating a substantial uplift. One notable area of improvement came in the 128K blocksize, where at an IO depth of 64 and 128, the PS1010 offered double the read bandwidth of the P5336.

It’s important to note that both SSDs were tested using the NVIDIA L4. While the Gen4 D5-P5336 is at or near its top end, upper-model NVIDIA GPUs like the H100 demonstrated a higher performance with the D7-PS1010. A drive’s speed is the ultimate deciding factor for some customers, while others prioritize overall density. Solidigm provides solutions for both, with its QLC and TLC SSD offerings.

Conclusion

As the scale and complexity of AI training continue to surge, the underlying storage infrastructure must not only keep pace but also set the tempo. Our tests with two distinctly different SSDs highlight the importance of aligning storage solutions with specific training priorities—whether that means minimizing checkpoint latency or maximizing checkpoint density for cost-effective scalability.

In our evaluation, we tested the Solidigm D5-P5336 (61.44TB) and the D7-PS1010 (7.68TB) under realistic AI training conditions, leveraging the DLIO benchmark and an extensive hybrid-parallel LLM checkpointing workflow. We captured metrics reflecting checkpoint write performance across multiple test runs as the drives filled, underscoring the performance differences in completion times between the Gen4 QLC-based D5-P5336 and the Gen5 TLC-based D7-PS1010.

While the D7-PS1010 delivered the fastest possible checkpoint writes, the D5-P5336 demonstrated compelling cost-effectiveness and capacity advantages, with only a modest performance trade-off. We further examined GPU Direct Storage (GDS) read bandwidths using GDSIO with an NVIDIA L4 GPU. Our findings showed the Solidigm D5-P5336 delivered up to 4.2GiB/s of read bandwidth with a 1M transfer size, while the D7-PS1010 provided a substantial uplift to 6.2GiB/s. Performance would be even more impressive when leveraging a more powerful GPU, such as the NVIDIA L40s or H100/H200.

Looking ahead, the unprecedented capacity of the Solidigm D5-P5336 122TB SSD is poised to reshape AI training and deployment. As model sizes and checkpointing requirements continue to grow, these high-capacity drives unlock new levels of efficiency and flexibility, enabling training strategies that were previously unattainable. Solidigm’s leadership in high-capacity SSD solutions empowers organizations to store more data and checkpoints on fewer drives, while helping future-proof their infrastructures against the next wave of AI complexity.

Beijing Qianxing Jietong Technology Co., Ltd.
Sandy Yang/Global Strategy Director
WhatsApp / WeChat: +86 13426366826
Email: yangyd@qianxingdata.com
Website: www.qianxingdata.com/www.storagesserver.com

Business Focus:
ICT Product Distribution/System Integration & Services/Infrastructure Solutions
With 20+ years of IT distribution experience, we partner with leading global brands to deliver reliable products and professional services.
“Using Technology to Build an Intelligent World”Your Trusted ICT Product Service Provider!

PREV: The Micron 6550 ION SSD: Gen5 Performance, Energy Efficiency, and High Capacity in One Drive

NEXT: Liquid Cooling is Coming to Your Data Center: Dell Tech World Highlights the Options

Contact Details

Beijing Qianxing Jietong Technology Co., Ltd.

Contact Person: Ms. Sandy Yang

Tel: 13426366826

Scaling AI Checkpoints: The Impact of High-Capacity SSDs on Model Training

Rack Storage Server

Huawei Fusion Server

Dell Poweredge Server

H3C Server

Datacom Switches

WLAN Device

Smart Wireless Router

Hard Drive HDD

Internal Hard Drive SSD

Geforce Graphic Card

INTEL CPU Processor

Server Memory RAM

Refurbished Storage Server

SFP Transceiver Module

Fibre Channel Switch

Scaling AI Checkpoints: The Impact of High-Capacity SSDs on Model Training

GPUDirect Storage Bandwidth

How GPU Direct Storage Works

Testing Configuration Matrix

Conclusion

Rack Storage Server

12 Bays 1U Rackmount Server Lenovo ThinkSystem SR630 Rack Server

ThinkSystem SR250 V2 4SFF Rack Storage Server Intel Xeon E-2378G Processor

Intel C621A Rack Storage Server Inspur NF5180M6 1U Rack Mount Server

Huawei Fusion Server

FusionServer 5288 V6 4U Rack Server 32 DDR4 DIMMs 44 3.5 Inch Hard Disks

Ultra High Density Huawei Fusion Server 1U Network Storage Server 1288H V5

New Gen OceanStor 5310 Huawei Rack Server Hybrid Flash Storage