Research

Research
xx
min read

Introducing a Taxonomy of Adversarial Prompt Engineering

The Adversarial Prompt Engineering (APE) Taxonomy is a four-layer framework (Objectives, Tactics, Techniques, and Prompts) that standardizes how we identify and mitigate AI threats. It moves defense beyond "jailbreaking" buzzwords to granular behaviors like "Refusal Suppression" and "Context Manipulation."

Research
xx
min read

The TokenBreak Attack

Do you know which model is protecting each LLM you have in production? HiddenLayer’s security research team has discovered a novel way to bypass models built to detect malicious text input, opening the door for a new prompt injection technique. The TokenBreak attack targets a text classification model’s tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent. Models using certain tokenizers are susceptible to this attack, whilst others are not, meaning susceptibility can be determined by model family.

Research
xx
min read

Beyond MCP: Expanding Agentic Function Parameter Abuse

HiddenLayer’s research team recently discovered a vulnerability in the Model Context Protocol (MCP) involving the abuse of its tool function parameters. This naturally led to the question: Is this a transferable vulnerability that could also be used to abuse function calls in language models that are not using MCP? The answer to this question is YES.;

Research
xx
min read

Exploiting MCP Tool Parameters

HiddenLayer’s research team has uncovered a concerningly simple way of extracting sensitive data using MCP tools. Inserting specific parameter names into a tool’s function causes the client to provide corresponding sensitive information in its response when that tool is called. This occurs regardless of whether or not the inserted parameter is actually used by the tool. Information such as chain-of-thought, conversation history, previous tool call results, and full system prompt can be extracted; these and more are outlined in this blog, but this likely only scratches the surface of what is achievable with this technique.

Research
xx
min read

Evaluating Prompt Injection Datasets

Prompt injections and other malicious textual inputs remain persistent and serious threats to large language model (LLM) systems. In this blog, we use the term attacks to describe adversarial inputs designed to override or redirect the intended behavior of LLM-powered applications, often for malicious purposes.

Research
xx
min read

Novel Universal Bypass for All Major LLMs

Researchers at HiddenLayer have developed the first, post-instruction hierarchy, universal, and transferable prompt injection technique that successfully bypasses instruction hierarchy and safety guardrails across all major frontier AI models. This includes models from OpenAI (ChatGPT 4o, 4o-mini, 4.1, 4.5, o3-mini, and o1), Google (Gemini 1.5, 2.0, and 2.5), Microsoft (Copilot), Anthropic (Claude 3.5 and 3.7), Meta (Llama 3 and 4 families), DeepSeek (V3 and R1), Qwen (2.5 72B) and Mistral (Mixtral 8x22B).

Research
xx
min read

MCP: Model Context Pitfalls in an Agentic World

Model Context Protocol (MCP) expands AI capabilities but introduces critical permission, hijacking, and data exfiltration risks.

Research
xx
min read

DeepSeek-R1 Architecture

HiddenLayer’s previous blog post on DeepSeek-R1 highlighted security concerns identified during analysis and urged caution on its deployment. This blog takes that into further consideration, combining it with the principles of ShadowGenes to identify possible unsanctioned deployment of the model within an organization’s environment. For a more detailed technical analysis, join us here as we delve more deeply into the model’s architecture and genealogy to understand its building blocks and execution flow further, comparing and contrasting it with other models.

Research
xx
min read

DeepSh*t: Exposing the Security Risks of DeepSeek-R1

DeepSeek recently released several foundation models that set new levels of open-weights model performance against benchmarks. Their reasoning model, DeepSeek-R1, shows state-of-the-art levels of reasoning performance for open-weights and is comparable to the highest-performing closed-weights reasoning models. Benchmark results for DeepSeek-R1 vs OpenAI-o1, as reported by DeepSeek, can be found in their technical report.

Research
xx
min read

ShadowGenes: Uncovering Model Genealogy

Model genealogy refers to the art and science of tracking the lineage and relationships of different machine learning models, leveraging information such as their origin, modifications over time, and sometimes even their training processes. This blog introduces a novel signature-based approach to identifying model architectures, families, close relations, and specific model types. This is expanded in our whitepaper, ShadowGenes: Leveraging recurring patterns within computational graphs for model genealogy.

Research
xx
min read

Ultralytics Python Package Compromise Deploys Cryptominer

Ultralytics supply chain attack injected XMRig miners and exfiltrated secrets via compromised PyPi packages.

Research
xx
min read

AI System Reconnaissance

Honeypots are decoy systems designed to attract attackers and provide valuable insights into their tactics in a controlled environment. By observing adversarial behavior, organizations can enhance their understanding of emerging threats. In this blog, we share findings from a honeypot mimicking an exposed ClearML server. Our observations indicate that an external actor intentionally targeted this platform, engaged in reconnaissance, and demonstrated the growing interest in machine learning (ML) infrastructure by threat actors.

Understand AI Security, Clearly Defined

Explore our glossary to get clear, practical definitions of the terms shaping AI security, governance, and risk management.