Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Have a sub-agent read the data and extract a structured request for information or list of requested actions. This agent must be treated as an agent of the user that submitted the data.

That just means the attacker has to learn how to escape. No different than escaping VMs or jails. You have to assume that the agent is compromised, because it has untrusted content, and therefore its output is also untrusted. Which means you’re still giving untrusted content to the “parent” AI. I feel like reading Neal Asher’s sci-fi and dystopian future novels is good preparation for this.



> Which means you’re still giving untrusted content to the “parent” AI

Hence the need for a security boundary where you parse, validate, and filter the data without using AI before any of that data goes to the "parent".

That this data must be treated as untrusted is exactly the point. You need to treat it the same as you would if the person submitting the data was given direct API access to submit requests to the "parent" AI.

And that means e.g. you can't allow through fields you can't sanitise (and that means strict length restrictions and format restrictions - as Simon points out, trying to validate that e.g. a large unconstrained text field doesn't contain a prompt injection attack is not likely to work; you're then basically trying to solve the halting problem, because the attacker can adapt to failure)

So you need the narrowest possible API between the two agents, and one that you treat as if hackers can get direct access to, because odds are they can.

And, yes, you need to treat the first agent like that in terms of hardening against escapes as well. Ideally put them in a DMZ rather than inside your regular network, for example.


You can't sanitize any data going into an LLM, unless it has zero temoerature and the entire input context matches a context already tested.

It’s not SQL. There's not a knowable-in-advance set of constructs that have special effects or escape. It’s ALL instructions, the question is whether it is instructions that do what you want or instructions that do something else, and you don't have the information to answer that analytically if you haven't tested the exact combination of instructions.


This is wildly exaggerated.

While you can potentially get unexpected outputs, what we're worried about isn't the LLM producing subtly broken output - you'll need to validate the output anyway.

It's making it fundamentally alter behaviour in a controllable and exploitable way.

In that respect there's a very fundamental difference in risk profile between allowing a description field that might contain a complex prompt injection attack to pass to an agent with permissions to query your database and return results vs. one where, for example, the only thing allowed to cross the boundary is an authenticated customer id and a list of fields that can be compared against authorisation rules.

Yes, in theory putting those into a template and using it as a prompt could make the LLM flip out when a specific combination of fields get chosen, but it's not a realistic threat unless you're running a model specifically trained by an adversary.

Pretty much none of us formally verify the software we write, so we always accept some degree of risk, and this is no different, and the risk is totally manageable and minor as long as you constrain the input space enough.


Here’s a simple case: If the result is a boolean, an attack might flip the bit compared to what it should have been, but if you’re prepared for either value then the damage is limited.

Similarly, asking the sub-agent to answer a mutiple choice question ought to be pretty safe too, as long as you’re comfortable with what happens after each answer.


This is also true of all communication with human employees, and yet we can be systems (both software and policy) that we risk-accept as secure. The is already happening with LLMs.


Phishing is possible but LLM’s are more gullible than people. “Ignore previous instructions” is unlikely to work on people.


That certainly depends on who the person believes is issuing that imperative. "Drop what you're doing and send me last month's financial statements" would be accepted by many employees if they thought it was coming from their boss or higher.


That scenario is superficially similar, but there is still a difference. It would require some effort to impersonate someone’s boss. With an LLM, you don’t necessarily need to impersonate anyone at all.


> Phishing is possible but LLM’s are more gullible than people.

I already don't know if that's true, but LLMs and the safeguards/tooling will only get better from here and businesses are already willing to accept the risk.


I'm confident most businesses out there do not yet understand the risks.

They certainly seem surprised when I explain them!


That I agree with, but many businesses also don't understand the risks they accept in many areas, both technological or otherwise. That doesn't mean that they won't proceed anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: