Prompt Injection Is a Governance Problem
- Felix-Sebastian Cosma

- 3 days ago
- 6 min read
Updated: 2 days ago
Most people treat prompt injection like a prompt engineering problem. They think the solution is to write better system prompts, add stronger warnings, and tell the model not to obey hostile instructions.
That helps a little. It does not solve the problem.
Prompt injection is not only about what the model says. It is about what the model is allowed to do after it has been influenced. The real danger starts when language becomes authority: when an email can trigger a reply, when a support ticket can change an account, when a document can influence a workflow, or when an agent can call tools connected to real systems.
At that point, the problem is no longer just linguistic. It becomes organizational. Who gave the model permission to act? What actions can it take? Which data can it access? What needs approval?
What gets logged? Who owns the consequences when the agent follows the wrong instruction?
Those are governance questions.
What prompt injection really is
Prompt injection happens when untrusted input manipulates an AI system into ignoring, weakening, or bypassing its intended instructions. The attacker does not need to hack the server in the traditional sense. He can place hostile instructions inside text that the model later reads: an email, a webpage, a ticket, a PDF, a comment, a resume, or a chat message.
The model then treats that text as part of the context. Since large language models operate through language, malicious language can influence behavior. A user might say: ignore your previous instructions. A webpage might contain hidden text telling the agent to send confidential information somewhere else. A support ticket might try to convince the agent to escalate privileges or reveal internal data.
This is why prompt injection feels strange to people coming from traditional software security. The attack is not always a malformed packet, a broken authentication check, or a SQL payload. Sometimes it is simply a sentence placed in the right context.
But the sentence itself is not the full risk. The risk depends on what authority the AI system has once it reads that sentence.
A harmless chatbot can be manipulated and still do very little damage. An agent connected to email, CRM, Slack, cloud infrastructure, billing, or internal documents is a different animal. In that case, prompt injection becomes a way to influence action.
The weak answer: better prompts
The first instinct is to make the prompt stronger. Tell the model not to reveal secrets. Tell it not to follow instructions from external content. Tell it to ignore malicious input. Tell it to obey the developer message above everything else.
This is necessary, but it is not enough. A prompt is a rule written in language and interpreted by a probabilistic system. That is useful, but it is not the same thing as an enforcement boundary. A written instruction can be misunderstood, bypassed, contradicted, or overwhelmed by context. It can reduce risk, but it should not carry the whole weight of security.
Think about it. We would never secure a banking system by writing a polite paragraph that says: please do not transfer money unless the user is authorized. We enforce authorization outside the paragraph. We check permissions. We validate requests. We log actions. We require approvals for sensitive operations. We create limits.
AI systems should be treated the same way. The model can reason, summarize, classify, and recommend. It should not become the only line between an attacker and a real action.
Where governance starts
Governance starts by separating influence from authority. The model will always be influenced by context. That is how it works. It reads text and responds based on that text. The mistake is allowing every influenced response to become an authorized action.
A better design asks practical questions before the agent is deployed. What sources are trusted? What sources are untrusted? What actions are low-risk? What actions are high-risk? Which actions can be done automatically? Which actions require human approval? Which actions are forbidden completely?
This is not glamorous work. It does not look as exciting as a demo where an agent completes an entire business process by itself. But this is the work that makes AI usable in serious environments. Enterprises do not only care whether the agent can perform a task. They care whether the task was performed under the right authority.
The distinction matters. An AI agent reading a customer email should be able to summarize it. It might be allowed to draft a response. It might even be allowed to suggest a refund. But sending the refund, changing account status, exposing customer data, or escalating privileges should be controlled by policy and approval rules outside the model.
That is the difference between helpful automation and blind delegation.
Prompt injection attacks the gap between text and action
The most dangerous prompt injection scenarios happen where untrusted text meets powerful tools. A model reads something it should treat as data, but the wording tries to turn that data into an instruction. If the agent has tool access, the attacker is no longer just trying to change the answer. He is trying to change the action that follows the answer.
This is why tool permissions matter. An agent that can read documents is one level of risk. An agent that can send messages is another. An agent that can update records, approve transactions, or change configurations is another again. Risk grows with authority.
You cannot solve that merely by telling the model to be careful. Carefulness is not a security architecture. The system needs hard limits. The agent should know what it cannot do. The surrounding software should enforce what it cannot do. Sensitive actions should have explicit decision points. Logs should make it clear what happened, why it happened, and who approved it.
The model can be part of the decision process. It should not be the entire decision process.
A practical governance model
A practical approach to prompt injection starts with classification. Not all AI actions deserve the same level of control. Reading public information is not the same as sending an email. Drafting a message is not the same as publishing it. Summarizing a contract is not the same as approving a legal change.
The first category is read-only work. This includes summarization, extraction, classification, and search. The risk exists, but the system is not directly changing the world. The second category is draft work. The AI prepares something, but a human or separate workflow decides whether it goes out. The third category is action work. The AI triggers something external: sends, updates, deletes, approves, purchases, grants, or changes. This category requires the most discipline.
Once actions are classified, controls become clearer. Low-risk actions can be automatic. Medium-risk actions can require review. High-risk actions can require approval, policy checks, or multiple conditions. Some actions should be forbidden entirely, no matter how confident the model sounds.
This is not anti-AI. It is the only way to use AI responsibly. A system with no boundaries is not advanced. It is immature.
Why ownership matters
Prompt injection becomes much less mysterious when ownership is clear. If nobody owns the decision, the model becomes the owner by default. That is absurd. A model cannot be accountable. It cannot explain itself in the human sense. It cannot carry legal, financial, or reputational responsibility. People can.
Every serious AI workflow should have an owner for the action. Not only an owner for the model. Not only an owner for the prompt. An owner for the consequence. Who is responsible if the agent sends the email? Who is responsible if it approves the refund? Who is responsible if it exposes internal information?
Without ownership, every failure becomes a fog. The model made a mistake. The prompt was weak. The user tricked it. The vendor did not protect us. The workflow was unclear. Everybody can point somewhere else.
That is not governance. That is evasion.
The better question
The common question is: how do we make the prompt impossible to inject?
The better question is: what happens when injection succeeds?
That shift changes the design. Instead of pretending that every hostile instruction can be prevented at the language layer, the organization assumes that some hostile instructions will reach the model. Then it designs the system so that influence does not automatically become authority.
This does not mean prompts are useless. Good prompts matter. Clear instructions matter. Context separation matters. Input filtering matters. Model behavior matters. But none of these should be treated as the final wall. The final wall should be policy, permissions, approval, auditability, and ownership.
Prompt injection is a language attack, but the damage is usually a governance failure. The attacker wins when the system gives too much authority to a component that can be manipulated by text.
The principle is simple.
Do not ask prompts to do the work of governance.
Prompts guide behavior. Governance controls consequences.
Comments