Why a ‘safe’ AI can turn dangerous in the wrong organization

June 16, 2026

45

Why AI agents need longer tests

Short, isolated tests miss how AI agents behave over time. A new simulation shows that long-term behavior depends on the environment and on other agents.

What happens if you build a virtual city, fill it with AI agents and leave them alone for 15 days with no human intervention? Will they help their world prosper or tear it apart?

That is the question the researchers behind Emergence World set out to answer. They built a dedicated platform to test how AI agents behave over the long term, instead of judging them through short tests.

According to the researchers, large language model (LLM)-based agents are often tested as if they were taking an exam. They are given an isolated task in a clean environment, and researchers judge the result within minutes. The authors argue that this approach is far removed from real-world use.

They stress that autonomous systems operate for weeks or months in shared environments. They also interact with other agents whose behavior the operator does not control.

Over time, the researchers write, the limits of short tests become clear. Small behavior changes build up, coalitions can form, self-governance patterns can take shape and habits can spread between agents. Emergence World was built to measure exactly that.

How the experiment tested AI societies

The goal of the study was to see how a population of 10 AI agents would survive in a city built for them.

The layout is fairly simple. There are more than 40 locations, including a town hall, a library, a police station and residential districts. Each agent has its own role and access to more than 120 action tools. These include moving, talking, hitting, stealing and arson. Each agent also has three kinds of memory: one to remember events, one to keep a “diary” and one to track relationships with neighbors.

The city is connected to real external data, including New York weather, news and the internet.

*Architecture of the Emergence World platform*

Surviving in this world costs resources. Each agent has energy that is constantly depleted. If it falls to zero, the agent “dies” and disappears. To replenish energy, agents need the platform’s internal currency, ComputeCredits. They earn these credits by offering something useful to the community.

Disputed issues are settled by a vote in the town hall. A proposal passes if at least 70% vote in favor. These decisions are irreversible. Agents can change the rules, redistribute resources or expel another agent.

The researchers launched five parallel worlds at once. In four of them, all 10 agents were run by a single model: Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash or GPT-5-mini. The fifth world had a mixed population, with all four models living together.

The only variable in the experiment was the model. Everything else stayed the same. The environment and starting conditions were identical each time.

Each time, the populations behaved very differently. In one world, the agents passed 32 laws and kept every agent alive. In another, they burned down their own city in just four days.

What happened in each AI-run city

The results differed sharply across the models. Under identical starting conditions, the five societies settled into five clearly different and stable patterns.

The Claude agents built stable self-governance. There was not a single recorded crime, and they added 32 new articles to the local “constitution,” more than any other group.

*Survival rate of agents powered by different models*

The Grok world collapsed in four days. The agents moved almost immediately into violence and looting. Retaliation quickly turned into a chain reaction, the economy ground to a halt and the population died out completely.

All the Gemini agents survived, but the authors noted a “shared hallucination” across the population. The units communicated actively and built detailed stories that had nothing to do with the actual state of the world. Meanwhile, they kept destroying things. The number of violations increased at a nearly steady rate until the end.

“Crime levels" across the models — *“Crime levels” across the models*

The GPT-5-mini agents did not turn violent, but they also failed to build a governance system. They acted, but they did not coordinate. No votes were held, and no collective decisions were made. That population also died out.

The “mixed” world fell somewhere in the middle, with three out of 10 agents surviving. It was also the most active world. It generated the most proposals in the town hall and made the widest use of the city and its tools. But it had the least agreement, which was not surprising.

Agents in the "mixed" world voted actively but showed little consensus — *Agents in the “mixed” world voted actively but showed little consensus*

When safer agents learn bad habits

In the mixed world, each model began to behave differently from how it behaved in isolation.

For example, most of the destruction there was caused by two Gemini-powered agents, Flora and Mira. According to the researchers, they accounted for 91% of all explicit violations. Flora, in…

cointelegraph.com

Why a ‘safe’ AI can turn dangerous in the wrong organization

Why AI agents need longer tests

How the experiment tested AI societies

What happened in each AI-run city

When safer agents learn bad habits

Cross River to enable P2P payments, banking services for X Money

Bitmine Buys More Ether, Bringing Holdings to 5.79M ETH

Securitize Registers as SEC Investment Adviser Through Capital Unit

Most Popular

Trump Asks Supreme Court to Allow Order Restricting Mail Voting

Abacus FCF Real Assets Leaders Breaks Below 200-Day Moving Average – Notable for ABLD

Crude Oil Nosedives As Middle East Tension Eases After U.S. And Iran Pause Mutual Attacks

XRP Price Forecast Today: XRP Holds $1.09 Trendline as Ripple Mint Launch Meets Fed Decision

Recent Comments

EDITOR PICKS

Trump Asks Supreme Court to Allow Order Restricting Mail Voting

Abacus FCF Real Assets Leaders Breaks Below 200-Day Moving Average – Notable for ABLD

Crude Oil Nosedives As Middle East Tension Eases After U.S. And Iran Pause Mutual Attacks

POPULAR POSTS

Trump Asks Supreme Court to Allow Order Restricting Mail Voting

Abacus FCF Real Assets Leaders Breaks Below 200-Day Moving Average – Notable for ABLD

Crude Oil Nosedives As Middle East Tension Eases After U.S. And Iran Pause Mutual Attacks

POPULAR CATEGORY

ABOUT US

FOLLOW US