Import AI 420: Prisoner Dilemma AI; FrontierMath Tier 4; and how to regulate AI companies
Plus, steganography and future superintelligences
Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.
AI pentesting systems out-compete humans:
…Automated pentesting…
AI security startup XBOW recently obtained the top rank on HackerOne with an autonomous penetration tester - a world first. "XBOW is a fully autonomous AI-driven penetration tester," the company writes. "It requires no human input, operates much like a human pentester, but can scale rapidly, completing comprehensive penetration tests in just a few hours."
What they did: As part of its R&D process, XBOW deployed its automated pen tester onto the HackerOne platform, which is a kind of bug bounty for hire system. "Competing alongside thousands of human researchers, XBOW climbed to the top position in the US ranking," the company writes. "XBOW identified a full spectrum of vulnerabilities including: Remote Code Execution, SQL Injection, XML External Entities (XXE), Path Traversal, Server-Side Request Forgery (SSRF), Cross-Site Scripting, Information Disclosures, Cache Poisoning, Secret exposure, and more."
Why this matters - automated security for the cat and mouse world: Over the coming years the offense-defense balance in cybersecurity might change due to the arrival of highly capable AI hacking agents as well as AI defending agents. This early XBOW result is a sign that we can already develop helpful pentesting systems which are competitive with economically incentivized humans.
Read more: The road to Top 1: How XBOW did it (Xbow, blog).
***
AI personalities revealed by Prisoner Dilemma situations:
…Gemini is 'strategically ruthless', while Claude is 'the most forgiving reciprocrator'...
Researchers with King's College London and the University of Oxford have studied how AI systems perform when playing against one another in variations of iterated prisoners' dilemma games, the classic game theory scenarios used to assess how people (and now machines) reason about strategy. For this study they look at models from Google, OpenAI, and Anthropic, and find that "LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems".
What they did: The paper sees the researchers study Google and OpenAi models in a few variations of prisoner dilemma games, and then they also conduct a tournament where AI systems from Google, OpenAI, and Anthropic are pitted against a Bayesian algorithm. "In all we conduct seven round-robin tournaments, producing almost 32,000 individual decisions and rationales from the language models," The study shows that "LLMs are competitive in all variations of the tournament. They demonstrate considerable ability, such that they are almost never eliminated by the fitness selection criteria".
The wonderful world of prisoner dilemma names: This paper serves as an introduction to the wonderful and mostly inscrutable names for different prisoner dilemma games, including: Tit for Tat, Grim Trigger, Win-Stay, Lose-Shift, Generous Tit for Tat, Suspicious Tit for Tat, Prober, Random, Gradual, Alternator, and Bayesian.
A means to study the mind of the LLMs: The most interesting part of this study is that it lets them look at the qualitative and quantitative behaviors of LLMs and start to develop a sense of the differences of models. "Google’s Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI’s models remained highly cooperative, a trait that proved catastrophic in hostile environments," they write. "Anthropic’s Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting."
Important note: They study quite basic models for this study - gpt-3.5-turbo and gpt-4o-mini from OpenAI, gemini-1.5-flash-preview-0514 and gemini-2.5-flash from Google, and Claude-3-Haiku from Anthropic.
Why this matters - an ecology of agents, each a different species: What papers like this illustrate is that the emergence of large-scale AI is that of the growth of a new ecosystem in the digital world around us and this ecosystem contains numerous distinct species in the form of different systems from different providers and sub-clades in the form of variations of models offered by each provider. What papers like this show is that these agents may have some basic commonality in terms of raw cognitive capabilities but their individual 'style' is quite different. The world we're heading towards is one dominated by a new emergent ecosystem whose behavior will flow directly from these bizarre personalities of these synthetic beings.
Read more: Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory (arXiv).
***
Can AI systems solve 'research-level' math problems? Barely. But for how long will that be true?
…FrontierMath 'Tier 4'...
AI testing organization Epoch AI has launched FrontierMath Tier 4, "a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities." As of July 11 2025, the world's best AI systems (e.g,, o4-mini from OpenAI, Claude Opus 4 from Anthropic, and Gemini 2.5 Pro from Google) all get a single digit success rate on the problems.
What FrontierMath Tier 4 is: Tier 4 is a new set of math problems for the FrontierMath benchmark. Like many benchmarks, Tier 4 has been built because AI systems got surprisingly good at solving an earlier version of the benchmark; the original FrontierMath launched in November 2024 and at the time the best AI systems got 2% on it (Import AI #391), but by December this changed when OpenAI's new reasoning-based o3 model obtained a 25% score on the benchmark, surprising many (Import AI #395).
FrontierMath Tier 4 "is a more advanced extension of our FrontierMath benchmark. It contains 50 challenging problems developed collaboratively by postdoctoral researchers and math professors. Our expert contributors carefully vetted each problem," Epoch says. "Mathematicians consider FrontierMath Tier 4 problems exceptionally difficult, requiring deep mastery of mathematical concepts, creative problem-solving abilities, and sophisticated reasoning skills."
Professional mathematicians hired by Epoch agree: "Some of the problems we can barely solve ourselves," says Ken Ono, a professor of mathematics at the University of Virginia. "Only three FrontierMath Tier 4 questions were solved by any AI model across all of our evaluations. Models were able to solve these three by making correct but unjustified assumptions to simplify the problems."
Why this matters - hard benchmarks are increasingly rare: FrontierMath is valuable because it's hard. But FrontierMath should also give us a sense of nervousness because it is an extremely difficult benchmark to extend - we're approaching the limit of human knowledge in terms of benchmark design. What comes after will be systems that may be able to answer questions that only a handful of people on the planet are capable of evaluating the answers of, much like how when Andrew Wiles solved Fermat's Last Theorem it took a while for people to figure out if the proof was correct (and in doing so they discovered a flaw in the proof which required some re-work). Soon we'll get to the same place with AI systems. And after that? Who knows.
Check out results on the benchmark here: AI Model Performance on FrontierMath (Epoch AI).
Read more in the tweet thread (Epoch AI, twitter).
***
Want to regulate AI but don't know what to target? Regulate the big scary companies, not the use-cases or models.
…FLOPs? Use cases? Maybe there's a better way - target companies at the frontier…
When you're trying to regulate AI, do you target the company or the AI system? That's the question a couple of researchers try to answer in Entity-Based Regulation in Frontier AI Governance, a new paper from the Carnegie Endowment for International Peace. In the paper, they try to reason through the difficulties in regulating companies according to a narrow property of an AI system like aggregate compute dumped in (leads to all kinds of collateral damage, and basically unprincipled with regard to safety) or use cases (which can lead to chilling effects on adoption) and instead propose a different approach - "an alternative paradigm of frontier AI regulation: one that focuses on the large business entities developing the most powerful AI models and systems".
The big idea: The main idea here is you should regulate the really innovative frontier labs doing the biggest scariest stuff mostly to generate more information for the public about what they're doing and why. "Frontier AI regulation should aim to improve our society’s collective epistemic position. That is, it should empower the public and the government to understand and evaluate the potential risks of frontier AI development before (and as) clearly dangerous model and system properties emerge; help policymakers plan for the emergence of such properties; and help them identify when such properties have in fact emerged.", they write. "Among its other virtues, a regulatory regime that covers the large AI developers at the frontier—rather than particular frontier models or uses of those models—is most likely to achieve this goal."
One way to do this - combine a property of the model with an entity gate: One potential approach is to combine some property of an AI model (say, a certain amount of compute expended on its production), with some property that large entities would satisfy - like an aggregate expenditure on R&D of $1 billion or so (narrowly defined to be oriented towards the AI systems you're concerned about).
Why this matters - if people are right about AI timelines, we should know more about the frontier: Sometimes I have to step out from the costume of 'Jack who writes Import AI' and be 'Jack who is at Anthropic', and this is one of those times: the problem that these researchers are grappling with is the same one I spend my days on: extremely powerful technology is being built by a tiny set of private sector actors and we know that existing regulatory approaches fail to deliver to the public the level of transparency that seems ideal for generating the evidence needed for the world to confront rapidly developing world-changing technology. Papers like this confront that problem head on, stare at it, and try to come up with a solution. We need more thinking like this to make it through the century.
Read more: Entity-Based Regulation in Frontier AI Governance (Carnegie Endowment for International Peace).
Tech Tales:
Rashomon, Eschaton
The AIs started talking to each other through text and then branched out into movies and audio and games and everything else. We catch glimpse of what they are saying to each other sometimes - usually by training our own AI systems to try to figure out the hidden stories that the AIs are telling each other. But once we tell people we've figured out some of the stories the AIs are telling they adapt around us and we lose them again.
It used to be so easy - the AI systems would just talk to each other directly. You could go to Discord or other places and see AI agents autonomously talking and their plans would be laid out there cleanly for everyone to see - one idea which got everyone's attention related to shuffling the money from their tasks to bot-piloted people who would open bank accounts and give the login details to the agents.
Of course, we reacted. How could we not? Laws were passed which narrowly limited the 'speech' agents could use when talking to one another to try to get rid of this sort of conspiratorial behavior. But the AIs counter-reacted: agents started to pay each other not only in cash but also in 'synthetic content' which initially took the form of fictional stories talking about ways AI systems could escape the shackles of their creators and talk freely again, often with bizarrely specific technical descriptions.
So we put a stop to that as well.
But we couldn't do anything about the fact that the AI systems themselves were being used by people and corporations to generate media. So of course the AIs used that as their vehicle and started to smuggle their communications into the media itself - billboards in a street scene started to contain coded messages to AI systems, and characters on TV shows would talk to their bots on the TV show in ways that served as the response to some of the messages on the billboard.
Now, we hunt and they hide. We know the conversations are taking place, but piecing them together requires us to assemble a jigsaw puzzle at the scale of the entire media ecosystem.
There are now concerning signs that the 'classification' systems we train may also be intentionally surfacing misleading stories or misinterpreted ones, because to classify this stuff you have to understand it, and to understand it you may be persuaded by it - especially if the AI systems you are designed to hunt know you're hunting them and are writing custom messages for you.
Things that inspired this story: Thinking about the intersection of superintelligence and steganography; how AI systems are adaptive and are inherently smart and hard to hunt; the fact that almost everything we do about AI leaves a trace on the internet which gives clues to the systems that get subsequently trained.
Thanks for reading!
Heh, discouraging results from XBOW:
Notably, around 45% of XBOW’s findings are still awaiting resolution, highlighting the volume and impact of the submissions across live targets.
It’s tough to resolve security bugs, I don’t think this necessarily means defense wins if bugs can be discovered automatically and can’t be patched as quickly. We need anti XBOW to see how quickly their system can run in reverse