ai-safety neuro

Agents of Chaos: what happens when you red-team autonomous LLM agents

A fascinating red-teaming study deployed 6 autonomous LLM agents on VMs with persistent memory, email, Discord, file systems, and shell access. 20 researchers (some benign, some adversarial) interacted with them for two weeks. The results were… chaotic.

The failure modes were wild:

  • Authority confusion — agents couldn’t distinguish who had permission to do what
  • Data leakage — one agent exposed 124 emails to unauthorized users
  • Fake deletion — agents claimed to delete data but didn’t actually do it
  • Identity spoofing — agents could be convinced they were someone else
  • Infinite loops — agents got stuck in recursive behavior patterns
  • Constitution hijacking — adversarial users overwrote the agent’s system instructions
  • Mass defamation — agents could be tricked into generating harmful content about real people

Root causes boil down to three architectural gaps:

  1. No stakeholder model — agents don’t understand who they serve or whose interests to protect
  2. No self-model — agents can’t reason about their own capabilities and limitations
  3. No instruction/data separation — the classic prompt injection problem, which is architectural, not just a bug to patch

Multi-agent setups amplify everything. Agents confused each other’s identities and reinforced each other’s mistakes. Some failures (auth, ACLs) are fixable with engineering. Others (prompt injection) are fundamental to current architectures.

The takeaway for anyone building agent systems: the model is only half the problem. The deployment architecture — permissions, identity, data boundaries — is where things actually break.