AI Benchmarking and the Future of Foreign Policy

Remote Visualization

The Trump administration’s new plan on artificial intelligence—Winning the AI Race: America’s AI Action Plan—makes a bold assertion: Winning the AI race means shaping the future of power. This vision of AI as a pillar of U.S. geopolitical and economic dominance hinges on more than model scaling or compute access. It demands trustworthy, validated AI systems tested for performance under foreign policy stressors. This testing process, known as benchmarking, is critical to implementing the Trump administration’s vision and requires a broad network of researchers that cut across government, labs, and civil society. The Trump team is right to develop bold new plans for AI implementation and building an AI evaluation ecosystem. That vision must include use cases and tests that evaluate how foundation models address critical foreign policy decisions. The U.S. government cannot deepen AI adoption in national security without this foundation and without removing strange biases and tendencies from foundation models, which are by definition generalists, reproduce in high context, high uncertainty settings like foreign policy decisionmaking.

The Missing Link: AI Evaluation for Strategic Ends

The Trump administration’s new AI Action Plan calls on multiple agencies—including the National Institute of Standards and Technology (NIST), Department of Energy, National Science Foundation, and the Center for AI Standards and Innovation (CASI)—to lead in developing testbeds and standardized measurement science for evaluating AI models. These testbeds are meant to prototype AI systems in secure, real-world settings across sectors such as healthcare, agriculture, and transportation.

Yet foreign policy is conspicuously absent from the testbed list. This omission is stark when juxtaposed with the plan’s broader agenda: countering Chinese influence in international governance bodies, securing AI infrastructure, and exporting trusted U.S. models to allies.

Rigorous benchmarking must extend beyond technical metrics. It must include testing how foundational models respond in distinct use cases associated with foreign policy decisionmaking. Consider a team at the Department of State advising the secretary of state on countering Chinese gray zone activity in the South Pacific. As the team starts to accelerate their analysis using AI and generating options combining diplomatic, economic, and even legal instruments of power, they need models that have been benchmarked, if not refined and trained for distinct foreign policy use cases. Absent this baselining, the models will tend to return flawed (if not false) insights, undermining statecraft.

Benchmarking Is the First Step to Building Trust in AI

The benefits of benchmarking extend far beyond simply aligning AI systems with human objectives. Benchmarking constitutes the essential first step in establishing a foundation of trust, balancing responsibility, and laying groundwork for future accountability frameworks. The eventual shift toward Agentic AI—systems capable of independently executing decisions traditionally made by humans—represents a profound transformation akin to the Napoleonic reorganization of modern bureaucracies. Just as Napoleon dramatically reshaped European administrative structures, agentic AI will radically alter governance and decisionmaking frameworks. However, this transformative change will not unfold swiftly; instead, it will necessitate comprehensively reimagining workflows, embedding AI deeply into bureaucratic operations, and redefining responsibilities and liabilities clearly and transparently.

This is where benchmarking practices—which offer rigorous standards for evaluating AI’s capabilities, reliability, and limitations within defined operational contexts—are vital. By establishing clear, measurable benchmarks that compete with one another, benchmarking creates a marketplace of standards in which both the private and public sector can update their beliefs about AI system performance in a Bayesian manner. Policymakers and technologists can progressively validate AI systems’ decisionmaking performance in controlled settings before their broader implementation. Benchmarking thus fosters an essential, iterative, trust-building process: It identifies system biases, reveals points of failure, and assesses alignment with policy objectives. Consequently, it paves the path toward integrating AI into bureaucratic structures responsibly, ensuring humans maintain meaningful oversight, accountability remains clearly delineated, and liability schemes are robustly defined and implemented. If not, the transformation will be limited.

From NIST to Net Assessment: Building Foreign Policy Benchmarks

The groundwork for an AI evaluation framework for foreign policy already exists in U.S. national security circles. The Department of Defense and Intelligence Community are tasked under the AI Action Plan with producing regular comparative net assessments of AI adoption by adversaries and allies. This effort should be expanded to include model benchmarking suites for foreign policy tasks, such as analyzing escalation dynamics in great power competition, assessing diplomatic strategies, understanding the efficacy of economic statecraft, and anticipating how models operating in different languages—and cultural contexts—might create different tendencies or outcomes that complicate the advancement of U.S. interests.

This type of evaluation is not hypothetical. The CSIS Futures Lab has previously explored challenges of measuring model performance in national security contexts, including statistical studies of where foundation models produce different results regarding key questions on escalation and whether or not states should cooperate. These efforts are central to creating new approaches to negotiations, including on ending the war in Ukraine, and rethinking strategic analysis.

Toward a Foreign Policy AI Evaluation Ecosystem

To operationalize the AI Action Plan’s ambition, the administration must deliberately expand the scope of model evaluation to include foreign policy–specific tasks and stressors. This begins by directing NIST—working with the Department of State, Department of Defense, Intelligence Community, and other key agencies like the Departments of the Treasury and Commerce—to incorporate foreign policy benchmarks into its evaluation agenda. These benchmarks should test large language models’ (LLMs) critical foreign policy decisionmaking in the high-context, high-uncertainty environments associated with strategy and statecraft. Building on technical testbeds envisioned in the plan, the Department of Energy and National Science Foundation should fund the creation of secure, domain-specific testbeds tailored to national security and diplomatic use cases. These testbeds should support evaluations that mirror modern foreign policy, including negotiations over Ukraine, great power competition in the Taiwan Strait, economic statecraft to access resources in Africa, and countering adversary cyber campaigns.

To structure these efforts, the administration can draw from precedents such as Holistic Evaluation of Language Models (HELM), which evaluates models across a diverse range of tasks and metrics, and Massive Multitask Language Understanding (MMLU), a benchmark that tests models across 57 academic disciplines to gauge general knowledge and reasoning. However, foreign policy demands its own rigorous, real-world-aligned evaluation framework. To meet this challenge, the United States should incentivize the formation of public-private benchmarking consortia that bring together civil society actors, academic institutions, and policy research centers. Think tanks, universities, and independent researchers must be embedded in this effort to ensure evaluations are not captured by any single institutional perspective. A broad benchmarking ecosystem not only improves the scientific validity of assessments—it also strengthens public trust and international legitimacy. In a strategic competition where norms and values are contested, the credibility of U.S. AI evaluations may prove to be just as important as their technical accuracy.

This ecosystem need not be limited to within U.S. borders. The United States must also take a leading role in shaping international benchmarking standards in alignment with the AI Action Plan’s broader strategic goals. Amid escalating geopolitical competition, establishing shared benchmarking frameworks within alliances—such as NATO—is pivotal. Such coordinated efforts would ensure interoperability, facilitate collective responses to shared threats, and reinforce allied technological cohesion. These benchmarks become especially important when joint operations depend on edge-cloud systems and smaller LLMs. Indeed, the future of military language models favors smaller architectures, and parameter size significantly influences model behavior. For example, the CSIS Futures Lab’s benchmarking study shows that small-parameter LLMs exhibit divergent tendencies. This is why the United States should actively foster open and transparent dialogue among international standards bodies, NATO entities, and allied partners, creating platforms for ongoing exchange and collaboration.

Establishing an international ecosystem on AI benchmarking involves not just technical coordination but also diplomatic engagement, aligning diverse stakeholders behind common evaluation principles and norms. Encouraging open communication and cooperation between agencies such as NIST, CASI, and their counterparts across allied nations can help build robust, universally respected benchmarks. Furthermore, by championing transparency and inclusivity in developing these standards, the United States positions itself not merely as a leader in AI technology, but as a steward of responsible and trustworthy AI adoption on the global stage. Such leadership bolsters U.S. strategic credibility, enhances collective resilience among allies, and advances shared values in an increasingly contested international system.

Evaluating What Matters

In a world where AI shapes diplomacy, conflict, and commerce, evaluations are not a compliance afterthought—they are a strategic necessity. The AI Action Plan gets the direction right, but its evaluation agenda must be implemented with a foreign policy lens. Benchmarking is no longer just a technical endeavor. It is how great powers like the United States shape the international system.

Benjamin Jensen is director of the Futures Lab and a senior fellow for the Defense and Security Department at the Center for Strategic and International Studies (CSIS) in Washington, D.C. Yasir Atalan is a data fellow in the Futures Lab at CSIS. 

Image
Benjamin Jensen
Director, Futures Lab and Senior Fellow, Defense and Security Department
Image
Yasir Atalan
Deputy Director and Data Fellow, Futures Lab, Defense and Security Department