PyTorch Lightning
Vulnerability Report

NOTE: The below vulnerabilities have been publicly disclosed under CERT/CC vulnerability note VU#252619.

Unsafe Deserialization in DeepSpeed utility function when loading the model file

SAI Advisory Reference Number

SAI-ADV-2025-009

Summary

The convert_zero_checkpoint_to_fp32_state_dict utility function contains an unsafe torch.load which will execute arbitrary code on a user’s system when loading a maliciously crafted file.

 

Products Impacted

Lightning AI’s pytorch-lightning.

CVSS Score: 7.8

AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H

CWE Categorization

CWE-502: Deserialization of Untrusted Data.

Details

The cause of this vulnerability is in the convert_zero_checkpoint_to_fp32_state_dict function from lightning/pytorch/utilities/deepspeed.py:

def convert_zero_checkpoint_to_fp32_state_dict(
    checkpoint_dir: _PATH, output_file: _PATH, tag: str | None = None
) -> dict[str, Any]:
    """Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be loaded with
    ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed. It gets copied into the top
    level checkpoint dir, so the user can easily do the conversion at any point in the future. Once extracted, the
    weights don't require DeepSpeed and can be used in any application. Additionally the script has been modified to
    ensure we keep the lightning state inside the state dict for being able to run
    ``LightningModule.load_from_checkpoint('...')```.

    Args:
        checkpoint_dir: path to the desired checkpoint folder.
            (one that contains the tag-folder, like ``global_step14``)
        output_file: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
        tag: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt
            to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``

    Examples::

        # Lightning deepspeed has saved a directory instead of a file
        convert_zero_checkpoint_to_fp32_state_dict(
            "lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt/",
            "lightning_model.pt"
        )

    """
...
    zero_stage = optim_state["optimizer_state_dict"]["zero_stage"]
    model_file = get_model_state_file(checkpoint_dir, zero_stage)
    client_state = torch.load(model_file, map_location=CPU_DEVICE)
...

The function is used to convert checkpoints into a single consolidated file. Unlike the other functions in this report, this vulnerability takes in a directory and requires an additional file named latest which contains the name of a directory containing a pytorch file with the naming convention *_optim_states.pt. This pytorch file returns a state which specifies the model state file, also located in the directory. This file is either named mp_rank_00_model_states.pt or zero_pp_rank_0_mp_rank_00_model_states.pt and is loaded in this exploit.

from lightning.pytorch.utilities.deepspeed import convert_zero_checkpoint_to_fp32_state_dict

checkpoint = "./checkpoint"
convert_zero_checkpoint_to_fp32_state_dict(checkpoint, "out.pt")

The pytorch file contains a data.pkl file which is unpickled during the loading process. Pickle is an inherently unsafe format which when loaded can cause arbitrary code to be executed, if the user tries to load a compromised checkpoint code can run on their system.

Researcher: Kasimir Schulz, Director, Security Research, HiddenLayer