Prompt Injection and Guardrails
Why untrusted text in an LLM's prompt is dangerous, how injection hijacks the model, and the guardrails that actually contain it.
- Why the Model Can't Tell Instructions From Data To an LLM, your system instructions and any text you paste into the context arrive as one undifferentiated stream of tokens — there is no privileged channel that says 'this part is the rules.'
- How Injection Actually Works Direct injection comes from the user; indirect injection hides in content the model fetches — a web page, a document, an email — and both aim to hijack actions or exfiltrate data.
- Guardrails That Hold You can't stop the model from being fooled, so you contain it: separate trust levels, least-privilege tools, validated output, and a human in the loop for anything irreversible.