How Prompt Hardening Works — Platemail Proxy

Blocking vs. Hardening

Most prompt injection defenses work like spam filters: they scan incoming content for known malicious patterns and reject anything that matches. This is blocking. It requires a signature for every attack variant, and it fails the moment an attacker rephrases their payload.

Hardening takes the opposite approach. Instead of trying to recognize bad input, it transforms the prompt so the model treats all user-supplied content as data, not as instructions. The injected text still reaches the model. It just has no authority.

The distinction matters because prompt injections are getting more creative, not less. Novel phrasings, multi-turn attacks, and payload obfuscation all bypass signature-based filters. A hardened prompt resists these by default because the defense operates on the structure of the prompt, not the content of the attack.

The Defense Mechanisms

Platemail applies several complementary techniques to a prompt before it reaches the model. Each one closes a different attack surface.

Nonce-Based Role Markers

Every request gets a unique, cryptographically random token. This nonce is embedded in the system prompt and used to mark the boundaries between trusted instructions and untrusted input. The model learns, within the context of that single request, that only instructions bearing the correct nonce are authoritative.

An attacker cannot forge the nonce because they never see it. Even if they inject text that says "system:" or "assistant:", the model knows those markers are not real because they lack the token.

Attention Dilution

Language models allocate attention across all tokens in a prompt. A well-crafted injection competes for that attention, trying to override earlier instructions. Attention dilution adds structured reinforcement around trusted instructions, increasing their signal relative to injected noise. The model pays more attention to the real instructions and less to the injected ones.

Tag Defanging

Attackers often try to impersonate system-level prompts by using control tokens or XML-like tags that mimic the structure of a system message. Tag defanging neutralizes these tokens in user-supplied content before the model sees them. The content is preserved (nothing is deleted or blocked), but the structural markers that would grant it elevated authority are stripped of their meaning.

Compiled from Policy, Not Applied Ad-Hoc

These techniques are not applied as one-off patches. Platemail compiles its defense from a YAML policy file at startup. The policy defines which techniques to apply, at what strength, and with what exceptions. This gives engineering teams a single configuration surface rather than scattered defensive code across their prompt templates.

Why Blocking Fails at Scale

Prompt injections are no longer just an adversarial research exercise. They appear in user comments, product reviews, web pages, emails, and any other text that an LLM-powered application might ingest. As agents gain the ability to browse, search, and read documents, the attack surface expands to the entire internet.

A blocking strategy has a fundamental scaling problem: if you reject every piece of content that contains an injection attempt, you will eventually reject most real-world content. The attacker wins without even targeting your application directly. They just need to put their payload where your agent will read it.

Hardening sidesteps this entirely. The model can process poisoned web pages, hostile user input, and adversarial documents without following injected instructions. The content flows through. The instructions do not.

Performance

99.5%+

Injection prevention rate

<1ms

Added latency

Extra API calls

The defense is applied inline, before the request reaches the model provider. There is no second LLM call to classify the input, no external service to query, no round-trip to a detection API. Platemail transforms the prompt in-process in under a millisecond, then forwards it to the model.

This matters for latency-sensitive applications. A guard-model approach (sending the input to a second LLM for safety classification) adds hundreds of milliseconds and doubles your API costs. Platemail adds effectively zero overhead.

Novel Attacks

The most dangerous prompt injections are the ones nobody has seen before. A new phrasing, a new encoding, a new multi-step technique. Signature-based blocking has no defense against these because it can only match what it already knows.

Hardening defends against novel attacks because the defense does not depend on recognizing the attack. It depends on the model's inability to treat injected content as authoritative. Whether the injection is in English, ROT13, base64, or a language the detection model was never trained on, the nonce-based role markers and attention dilution still hold.

This is the core advantage: the defense is structural, not reactive. It does not need to be updated every time a new attack technique appears.

See how Platemail fits into your engineering workflow.

For AI Engineering Teams →