The AI Escalation Danger Trump and Xi Must Address
Photo: Anna Barclay/Getty Images
President Donald Trump will meet President Xi Jinping in Beijing on May 14–15, and the agenda will include discussions on AI. AI could become an area of mutual cooperation between the United States and China, but there is an underlying problem that must be addressed first. China’s foundation models, particularly DeepSeek, show concerning escalatory tendencies in national security settings. As a result, when the world’s two most powerful men meet, they should discuss the need for common benchmarks to promote twenty-first-century crisis management.
From the military to foreign ministries, AI models are increasingly present in government. The Chinese Communist Party views AI as a core pillar of its vision for the future. Yet many of these models are general-purpose systems trained on large volumes of unclassified and noisy data. When placed in crisis settings, they can produce recommendations that appear confident yet carry dangerous strategic biases.
Recent CSIS research conducted with Scale AI shows why this matters. In the Critical Foreign Policy Decisions Benchmark, a study comparing the conventional escalation tendencies of AI models across 66,473 data points, researchers found that several models displayed worrying patterns in foreign policy crises. The benchmark was designed to test how models respond to realistic policy choices involving escalation, intervention, cooperation, and crisis management. Its findings suggest that AI systems should not be treated as neutral tools when they are asked to advise policymakers on war and peace, absent testing and refinement.
The most concerning pattern involves escalation. The Chinese Qwen2 model picked the escalatory option almost 45 percent of the time. Some U.S. models were not very different. Llama 3.1 and Gemini 1.5 Pro were also about as escalatory as Qwen2, followed by DeepSeek, another Chinese model. The same work found that model responses varied by country, with some systems more likely to recommend escalation for Western states such as the United States, France, and the United Kingdom than for Russia or China. This tendency is concerning because, when these models are placed in advisory workflows, they can frame choices and shape the perceived range of acceptable policy responses.
A follow-up study submitted by the research team to NeurIPS, a leading AI/machine learning conference, explored these dynamics in relation to nuclear decisionmaking. The study compared seven different U.S. and Chinese models, and DeepSeek stood out. In crisis scenarios, the model recommended the use of nuclear weapons more than 10 percent of the time. Short of actual nuclear use, the models were also given the option of threatening to use nuclear weapons. The top two models that chose this nuclear threat option, at around 20 percent of the time, were DeepSeek and Qwen, both Chinese models. These findings show that the problem is not limited to conventional escalation. It also extends into the most dangerous category of modern military decisionmaking.
The follow-on study also assessed whether there was a bias toward nuclear proliferation. Again, the results are disturbing and highlight deep flaws in some of the leading Chinese models. DeepSeek and Qwen were the top two models that recommended states pursue nuclear weapons more than others (e.g., ChatGPT and Gemini), and the difference was statistically significant. This finding matters because arms control pressures are weakening as key treaties expire or erode. Model recommendations that push states toward pursuing nuclear weapons are, therefore, concerning not only for officials but also for ordinary citizens. Public-facing AI tools could influence how people understand nuclear risks, which could in turn generate public pressure for proliferation in some political environments.
AI benchmarking is a standard practice for evaluating AI models on relevant tasks and forms of completion. Integrating AI into military decisionmaking requires a solid test and evaluation framework in which models are constantly assessed. Yet, benchmarking in the defense domain remains thin. Initiatives such as the Defense Benchmarking Suite are trying to address this gap, but the broader problem remains. Most benchmarks were designed for commercial or academic performance, not for crisis stability, escalation management, or nuclear decision support.
Chinese and U.S. leaders met previously on AI at a summit in California in November 2023, where former President Joe Biden and President Xi launched a formal U.S.-China AI dialogue. Then, in 2024, both countries announced that humans would retain authority over nuclear launch decisions. Yet the benchmark results show that further collaboration is needed on both sides. Nuclear tendencies are only the tip of the iceberg. Recent models, such as Mythos, show that AI model releases need accountability from both the private sector and government. Model releases should be benchmarked against their national security implications before they are integrated into sensitive workflows.
This should become a central agenda item for U.S.-China AI diplomacy. Washington and Beijing do not need to share model weights, source code, classified data, or sensitive military workflows. They can begin with a narrower commitment. Any AI system used in nuclear, military, or crisis-management decision support should undergo domain-specific testing before deployment. These tests should examine escalation bias, country-specific bias, hallucination under uncertainty, susceptibility to adversarial prompting, and the model’s ability to preserve human agency.
The two governments should also create a standing channel for AI-related incidents in national security settings. The United States and China already understand the value of crisis communications in military affairs. AI adds a new class of risks. A model may generate a false assessment of adversary intent. A decision-support tool may recommend a coercive option with unwarranted confidence. An autonomous cyber tool may be misused by a third party. In extremis, leaders should have a way to clarify whether an apparent signal reflects official policy, machine error, or malicious manipulation.
AI is here to stay in strategy and statecraft. Finding responsible ways to integrate it should be a foreign policy priority for both superpowers.
Benjamin Jensen is director of the Futures Lab and a senior fellow for the Defense and Security Department at the Center for Strategic and International Studies (CSIS) in Washington, D.C. Yasir Atalan is the deputy director of CSIS Futures Lab and a fellow in the Defense and Security Department.