HiddenLayer, a Gartner recognized Cool Vendor for AI Security, is the leading provider of Security for AI. Its security platform helps enterprises safeguard the machine learning models behind their most important products. HiddenLayer is the only company to offer turnkey security for AI that does not add unnecessary complexity to models and does not require access to raw data and algorithms. Founded by a team with deep roots in security and ML, HiddenLayer aims to protect enterprise’s AI from inference, bypass, extraction attacks, and model theft. The company is backed by a group of strategic investors, including M12, Microsoft’s Venture Fund, Moore Strategic Ventures, Booz Allen Ventures, IBM Ventures, and Capital One Ventures.
Apr 03, 2025
NOTE: The below vulnerabilities have been publicly disclosed under CERT/CC vulnerability note VU#252619.
Unsafe Deserialization in DeepSpeed utility function when loading the model file
SAI Advisory Reference Number
SAI-ADV-2025-009
Summary
The convert_zero_checkpoint_to_fp32_state_dict utility function contains an unsafe torch.load which will execute arbitrary code on a user’s system when loading a maliciously crafted file.
Products Impacted
Lightning AI’s pytorch-lightning.
CVSS Score: 7.8
AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H
CWE Categorization
CWE-502: Deserialization of Untrusted Data.
Details
The cause of this vulnerability is in the convert_zero_checkpoint_to_fp32_state_dict function from lightning/pytorch/utilities/deepspeed.py:
def convert_zero_checkpoint_to_fp32_state_dict(
checkpoint_dir: _PATH, output_file: _PATH, tag: str | None = None
) -> dict[str, Any]:
"""Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be loaded with
``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed. It gets copied into the top
level checkpoint dir, so the user can easily do the conversion at any point in the future. Once extracted, the
weights don't require DeepSpeed and can be used in any application. Additionally the script has been modified to
ensure we keep the lightning state inside the state dict for being able to run
``LightningModule.load_from_checkpoint('...')```.
Args:
checkpoint_dir: path to the desired checkpoint folder.
(one that contains the tag-folder, like ``global_step14``)
output_file: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
tag: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt
to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
Examples::
# Lightning deepspeed has saved a directory instead of a file
convert_zero_checkpoint_to_fp32_state_dict(
"lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt/",
"lightning_model.pt"
)
"""
...
zero_stage = optim_state["optimizer_state_dict"]["zero_stage"]
model_file = get_model_state_file(checkpoint_dir, zero_stage)
client_state = torch.load(model_file, map_location=CPU_DEVICE)
...
The function is used to convert checkpoints into a single consolidated file. Unlike the other functions in this report, this vulnerability takes in a directory and requires an additional file named latest which contains the name of a directory containing a pytorch file with the naming convention *_optim_states.pt. This pytorch file returns a state which specifies the model state file, also located in the directory. This file is either named mp_rank_00_model_states.pt or zero_pp_rank_0_mp_rank_00_model_states.pt and is loaded in this exploit.
from lightning.pytorch.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict
checkpoint = "./checkpoint"
convert_zero_checkpoint_to_fp32_state_dict(checkpoint, "out.pt")
The pytorch file contains a data.pkl file which is unpickled during the loading process. Pickle is an inherently unsafe format which when loaded can cause arbitrary code to be executed, if the user tries to load a compromised checkpoint code can run on their system.