Benchmarking as a Path to International AI Governance
Photo: wudu_8/Adobe Stock (Generated with AI)
A recent CSIS report argues that an associational model of benchmarking can be a useful tool in AI governance. By integrating stakeholders across private and public sectors, as well as civil society, and emphasizing the associative potential of democracies, AI can be more transparent, responsible, and effective. However, this model can also provide lessons for the international governance of AI, particularly in domains of international security, the development of norms and standards, and the creation of confidence-building measures.
Importantly, this vision aligns with the Trump administration’s new AI Action Plan’s call for the United States to lead in AI “international diplomacy and security.” The AI Action Plan forwards the goal of countering China’s influence in international standard-setting and governing bodies. This is a reasonable objective, as research has illustrated that standard setting is a key vector for establishing global practices that align with political objectives. Critically, international standards are not simply technical artifacts; they shape how global players interact, from states to private organizations.
The administration has specified that the Department of State and the Department of Commerce should work within standard-setting bodies to promote U.S. interests in developing global AI standards. Having a substantive effect, however, will require concrete proposals. Here, generating global standards for benchmarking and evaluating models, particularly in the high-risk case of international security, could serve as a practical starting point.
The administration’s action plan aims to evaluate national security risks in frontier models through collaboration between AI labs and governmental organizations like the Department of Commerce and the National Institute of Standards and Technology. However, the plan does not coherently connect these domestic-level evaluation processes to its goal of setting international AI standards. This gap presents an opportunity to explore how AI model benchmarking could influence global military AI governance.
Taking the Model International
It is in the interest of states to ensure that militarily relevant technologies work as expected. This can help to avoid unintentional escalation, grave mistakes, and international disputes. For example, even within the highly competitive global order of the Cold War, states worked in concert to create a range of confidence-building measures (CBM), or as defined by the United Nations, “planned procedures to prevent hostilities, to avert escalation, to reduce military tension, and to build mutual trust between countries.” A clear case of CBMs in action during the Cold War is the range of measures taken by the United States and the Soviet Union concerning nuclear weapons. Notably, under conditions such as what international relations scholars call the “security dilemma,” cooperation between states has been shown to be feasible. As such, bringing global stakeholders to the table, even in contexts of international competition between, for instance, the United States and China, is a possibility.
While the case of nuclear weapons is unlikely to directly parallel AI, thus requiring domain-specific strategies for global governance and arms control, generating global confidence-building measures for AI use cases directly related to international security is an important consideration. Others have noted that CBMs for AI “could create standards for information-sharing and notifications about AI-enabled systems,” reducing the likelihood of inadvertent conflict. Moreover, as stated in a report conducted by the UN Institute for Disarmament Research, “CBMs can be leveraged to respond to a common interest of all members of the international community to achieve ‘some certainty’ about the technology and its use.” In fact, the United States’ “Political Declaration on the Responsible Military Use of AI and Autonomy,” which a range of other states have endorsed, is a good starting point.
Model evaluation and benchmarking can serve as a practical tool for global governance for military AI in two notable fashions. First, developing internationally accepted and validated benchmarks for AI in national security use cases, such as foreign policy decisionmaking, could inform national security leaders about model tendencies that could drive escalatory policies. Developing such benchmarks should include domain-specific experts from a range of interested states in order to ensure the fidelity of the results. Moreover, this process should bring together international experts from universities, think tanks, and other sectors to help build robust and valid benchmarks that apply to real-world circumstances. For example, generating and distributing robust benchmarks for AI-enabled early warning systems could ensure strategic stability and prevent global crisis by confirming that any AI system integrated into nuclear-related intelligence, surveillance, and reconnaissance performs as expected and contains relevant circuit breakers to ensure no accidental nuclear launches occur. Moreover, domain experts on international law could assist in benchmarking AI systems for reliability, increasing global confidence, and ensuring that any deployed AI system properly aligns with the laws of war.
Second, this model also helps forward norms of responsible AI globally, and if pursued by organizations such as the Department of State, could position the United States as the leader of that normative framework. Not only would this forward U.S. interests in shaping global AI governance, but it also has tangible benefits in increasing global security. Unreliable AI has the prospect of causing unwanted escalation or political crisis. Benchmarking and model evaluations could serve as an applied domain in which broader normative frameworks, such as “responsible” or “ethical,” AI are put into use. By illustrating leadership in building a global normative framework, along with concrete associated practices such as benchmarking and evaluating AI models in national security use-cases, the United States can forward its goal of proactively shaping global AI governance.
As with all international coordination in the domain of international security, verification will be a key issue. States may not disclose poorly performing systems for reputational reasons or may not participate in global CBMs to avoid disclosing their own domestic capabilities. Thus, crafting a military AI governance regime will have to creatively incentivize states to participate meaningfully in any implemented version of AI-related CBMs. However, as demonstrated by international relations scholars, repeated interactions between states can illustrate joint interests in cooperation and make enforcing agreements easier. As such, and per the Trump administration’s goal in leading international standard setting as outlined in the AI Action Plan, the United States should pursue international coordination on model benchmarking and evaluation in security-relevant use cases.
By establishing robust, internationally validated benchmarks for AI systems in global security contexts, the United States cannot only productively advance its strategic interests in shaping global AI standards, but it can also contribute to international stability and security. The integration of benchmarking into international AI governance represents not just a technical requirement for building reliable systems, but an opportunity for the United States to spearhead rules and standards for AI in a context in which technological competition and international security are tightly entangled.
Ian Reynolds is the postdoctoral fellow for the Futures Lab in the International Security Program at the Center for Strategic and International Studies in Washington, D.C.