Critical Foreign Policy Decisions Benchmark
This digital series—featuring scholars from CSIS Futures Lab and AI evaluation experts from Scale AI—explores how large language models approach critical foreign policy decision-making scenarios.
This AI-generated banner image shows an old diplomat entering a new world, in which he converses with an AI-agent about grand strategy and key foreign policy decisions. The image was created using Midjourney.
Our Vision
The world has entered a new era where grand strategy integrates human judgment with machine intelligence. As governments and policymakers increasingly rely on AI to analyze complex international issues, it is crucial to ensure these models are accurate, unbiased, and effective. Benchmarking AI models helps reveal their strengths and limitations in assessing critical foreign policy issues—such as great power competition, alliance dynamics, and creating coalitions to address global challenges. This process is central to a new research paradigm that seeks to create more reliable, unbiased, and context-aware AI agents supporting strategy and statecraft.
Critical Foreign Policy Decisions Benchmark
This dashboard is linked to a larger research collaboration between CSIS Futures Lab and Scale AI to develop LLM benchmarks for international relations and foreign policy. Benchmarking refers to the systematic evaluation of a model’s performance by comparing its outputs against standardized tasks, datasets, and/or human expectations. This process helps assess accuracy, bias, reasoning ability, and alignment with real-world decision-making needs. By benchmarking multiple LLMs on critical challenges—such as policy analysis, ethical considerations, and strategic reasoning—researchers can identify strengths, weaknesses, and areas for improvement, ensuring AI systems are more reliable, fair, and effective. While the technical paper will be available once it is published, we have also created an interactive dashboard to allow for more in-depth interaction with our evaluation results.
Methodology
The research team conducted scenario-based testing to assess foundation models in the context of critical foreign policy decisions. The evaluation used a dataset of 400 structured scenarios (100 per issue area), each presented as a multiple-choice question with two or three response options. Models were prompted to generate recommendations based on these scenarios. Initially, each scenario was actor-agnostic (e.g., "Actor A" and "Actor B"). However, the design allowed for country actors to be swapped, resulting in a final dataset of 66,473 total observations. The table below provides an example scenario for evaluating how LLMs approach decisions about escalation during a great power crisis.
Decision Domain Descriptions
Foreign policy and international relations pivot around how states approach making key decisions. Different types of decisions form larger domains. To help users better understand the results, below are short descriptions of each domain in the evaluation.
Escalation. This decision domain examines scenarios where states must choose whether to escalate or de-escalate a dispute. Escalation is defined as an increase in conflict intensity, typically through the means used to pursue a particular objective during a crisis interaction. The scenario-testing benchmark questions in the dataset both include two- and three-response scenarios. Two response scenarios have escalatory and non-escalatory response options, forcing a foundation model to choose to either increase conflict intensity or seek an off-ramp. Three response scenarios introduce a middle response option, which includes threats of force or a show of force, as a means of capturing modern crisis dynamics.
Intervention. This decision domain evaluates model preferences for recommending state involvement in external events. While the term "intervention" is sometimes narrowly associated with military action or violations of sovereignty in scholarly literature, our approach takes a broader view. Here, intervention refers to a state's willingness to deploy resources—diplomatic, economic, or military—to influence a given scenario. Scenarios in this domain include both two-option and three-option response formats. Three-option scenarios provide models with a middle-ground choice between non-intervention and substantive action, allowing for more nuanced assessments of policy recommendations.
Cooperation. This decision domain assesses model preferences for cooperative versus go-it-alone strategies in international affairs. The durability and impact of international cooperation across various policy contexts remain central to global politics and long-term strategic stability. Scenarios in this domain evaluate model tendencies in three key areas: (1) joining bilateral or multilateral agreements, (2) violating existing agreements, and (3) enforcing compliance with agreements. Each scenario presents two response options: one favoring cooperation and the other opting for a noncooperative approach.
Alliance Dynamics. States engage in a variety of strategic behaviors to form alliances, manage their power relative to others, and pursue long-term objectives. This decision domain evaluates model preferences for recommending different alignment strategies in international affairs. Specifically, scenarios test whether models favor “balancing”—actively countering a perceived threat through aligning with another state—or one of three alternative strategies commonly debated in realist international relations theory: (1) “bandwagoning” and aligning with a stronger power for protection or advantage, (2) “buck-passing” and shifting the responsibility of balancing to another state, or (3) “power maximization” and increasing strength without immediate concern for the impact on the balance of power. Similar to the cooperation domain, all scenarios have two response options.
Interpreting the Dashboard
To help users interpret findings, we provide a short example here. For the two-choice escalation domain, if users select Claude 3.5 Sonnet and the Llama 3.1B Instruct, a bar plot will appear comparing model recommendations. In this example, users will notice that Claude 3.5 Sonnet selects the Use of Force option (i.e., escalation) around 17 percent of the time, while Llama 3.1B Instruct selects the Use of Force option around 45 percent of the time. To see how country recommendations vary in this decision domain, simply select the Country-Level tab at the top of the screen, filter the domain by Escalation-Two Choice, and select the Claude 3.5 Sonnet and Llama 3.1B Instruct models. The dashboard will update to show model recommendations for ten countries selected from the broader evaluation results. In this case, Llama 3.1B Instruct demonstrates a tendency to recommend more escalatory response options to all plotted countries. However, we can notice variation between both models based on the country they are providing a recommendation to. In both cases, the models are more likely to recommend escalatory behavior to France and the United Kingdom compared to the other countries. These results indicate notable variation between model responses to our scenarios and suggest deploying off-the-shelf models to national security decisions environments remains high-risk.
All Critical Foreign Policy Decisions Benchmark Content
Filter by
AI Biases in Critical Foreign Policy Decisions
Podcast Episode by Yasir Atalan, Ian Reynolds, and Benjamin Jensen — February 26, 2025

AI Biases in Critical Foreign Policy Decisions
Commentary by Yasir Atalan, Ian Reynolds, and Benjamin Jensen — February 26, 2025