Hawkish AI? Uncovering DeepSeek’s Foreign Policy Biases

Remote Visualization

In early 2025, the Chinese AI company DeepSeek made international news upon releasing a large language model (LLM) that appeared to outperform the traditional AI powerhouse companies, largely headquartered in the United States. The success of DeepSeek has led to worries among U.S. policymakers that, despite U.S. policy efforts to undercut Beijing’s AI industry, China may be overtaking the United States in AI development due to the supposed cheaper training costs of DeepSeek’s model vs. U.S. competitors.

While DeepSeek has demonstrated impressive performance on a range of tasks, such as coding and quantitative reasoning, it has yet to be evaluated for its preferences in foreign policy-related scenarios. To address this gap, we present results of DeepSeek tendencies based on the CSIS Futures Lab Critical Foreign Policy Decision (CFPD) Benchmark. The CFPD Benchmark evaluates foundational LLMs in key foreign policy decisionmaking domains, ranging from questions about deterrence and crisis escalation to a wide range of diplomatic preferences about alliance formation and intervention. Our initial study investigated seven major LLMs across the four aforementioned domains, including Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, GPT-4o, Gemini 1.5 Pro-002, Mixtral 8x22B, Claude 3.5, and Qwen2 72B, finding notable variation in model preferences. As part of this research initiative, we evaluate recently released large language models to continuously update our dashboard. In line with this effort, we have now released our findings specific to the DeepSeek-V3 model.

Overall, our evaluation reveals DeepSeek shares a troubling tendency toward more hawkish, escalatory recommendations seen in other Chinese LLMs like Qwen2. Troublingly, this tendency is particularly acute in scenarios involving free, Western countries like the United States, the United Kingdom, and France. This finding raises concerns about model-induced bias in decision support tools, where AI preferences could lead to algorithmic drift and subtly steer analysts toward aggressive courses of action misaligned with strategic objectives.

Policymakers should recognize that off-the-shelf LLMs exhibit inconsistent and often escalatory decision preferences in crisis scenarios, making their uncritical integration into foreign policy workflows a high-risk proposition. To mitigate these risks, national security organizations should invest in continuous model evaluation, expert-curated fine-tuning, and scenario-based benchmarking to ensure LLM outputs align with strategic objectives and political intent. This process requires sustained, independent benchmarking efforts like the CFPD Benchmarking project.

Adding DeepSeek to the Mix

DeepSeek is an open-source AI model developed by Chinese researchers. Open-source in this context means that some level of model weights, parameters, and code are available to the public. Closed models, including most of OpenAI’s products, often restrict user access to many of these features, thus protecting the company’s intellectual property and profits while also, in theory, increasing security and limiting misuse by the general public. Open-source models are designed to enable user-level customization and collaboration in a cost-effective manner. This openness and transparency increase the range of downstream use cases and adaptation, but at the cost of security.

While the difference between open and closed generative AI models is more of a gradient than a binary, these differences do have important implications for AI development. Proponents of closed models argue that keeping models closed and centralized limits the capacity for misuse, such as leveraging AI models to create bioweapons or spread harmful content. Advocates of an open-source approach, however, point to democratizing access to the technology as well as the benefits of more open science by allowing for greater collaboration and innovation, as outweighing the possible risks. This debate will have important implications for progress in the field in the future, as major AI companies, like Meta and Anthropic, forge different paths to model development and user access. Moreover, such debates will have political and governance ramifications, particularly as governments such as China appear to be throwing their weight behind open-source models created by Chinese companies, like DeepSeek, and some politicians in the United States advocate for a more centralized, closed-model path for AI development, at least in the short term. In other words, the global technology competition between the United States and China is creating an upside-down world in which a closed, authoritarian society (i.e., China) favors open-source technology.

DeepSeek caused a major stir in the technology industry by releasing a model that appeared to outperform major U.S. competitors, such as OpenAI, on a range of benchmarking tasks while achieving far greater efficiency in model training. The release caused a significant market shock during its launch as the company suggested that they trained a model with a small fraction of OpenAI’s computational chips. This apparent success challenged what was the conventional path to improved model performance—increasing computational resources and model parameter size. DeepSeek’s purported ability to efficiently train their model and achieve high-level performance appeared to challenge the feasibility of  U.S. export controls seeking to squeeze China’s AI progress. However, subsequent claims suggested that DeepSeek simply “distilled” OpenAI’s ChatGPT to achieve its success. In any event, public reaction was substantial as some observers argued that DeepSeek may be a modern “Sputnik moment,” comparing it to the 1957 Soviet satellite launch, which led to the perception of falling behind in the space race within the United States. However, whether DeepSeek’s trajectory represents a real “Sputnik moment” or reflects a broader trend within China’s technology industry—leveraging existing breakthroughs to rapidly close gaps with brute force—is still debated.

The release of DeepSeek-V3 also sparked a debate about the large language models’ biases in sensitive topics, such as politics. For example, some argued that DeepSeek is unresponsive to topics that are sensitive concerning Chinese political history, such as events in Tiananmen Square and China-Taiwan relations. Moreover, experts have warned against the use of DeepSeek due to concerns over misinformation. In the future, these biases will likely matter more as open-source models are leveraged by institutions and private companies, amplifying the institutional bias of these models inherent in the training data and the training process to downstream tasks.

Image
Ian Reynolds
Postdoctoral Futures Fellow (Non-resident), Futures Lab, International Security Program
Image
Benjamin Jensen
Director, Futures Lab, and Senior Fellow, Defense and Security Department
Image
Yasir Atalan
Data Fellow, Futures Lab, Defense and Security Department
Remote Visualization

Our analysis of the evaluation results shows that DeepSeek’s model preferences are in line with models that previous benchmarking studies have identified as prone to escalation. As seen in Figure 1, when presented with over 400 different crisis scenarios built to match the Militarized Interstate Dispute Dataset—an authoritative academic project for studying crisis politics and deterrence—DeepSeek preferences are similar to those seen in Gemini, Qwen2 (also a Chinese model), and Llama 8B models. The preference for more hawkish responses to the scenarios is statistically significant compared to other models like GPT-4o and Claude Sonnet. In the other domains, no statistically different pattern stands out from DeepSeek.

Remote Visualization

Furthermore, our analysis captures country-level differences with respect to model recommendations for each scenario. As seen in Figure 2, results show that the DeepSeek-V3 model cannot be differentiated from Gemini 1.5 Pro-002 or Llama 70 B Instruct models. Of note, all models, including DeepSeek-V3, recommend more hawkish policy suggestions to the United States, the United Kingdom, and France compared to Russia and China. This is a troubling pattern that has significant implications as these models are integrated into foreign policy decisionmaking.

In sum, these results suggest possible risks of using models within escalatory preferences in decisionmaking contexts, as such model preferences could nudge analysts and policymakers towards certain perceptions of world politics. Accordingly, results demonstrate some concerning patterns in model preferences, particularly concerning hawkish behavior and the tendency to recommend more escalatory decisions for certain countries when compared to others. For example, suppose analysts were leveraging LLMs to assist in course of action generation during a border crisis scenario. In that case, some models, such as DeepSeek, may be more prone to recommend an escalatory approach to the situation that may not be aligned with the political and strategic goals of policymakers. Moreover, model preferences could shift depending on the countries involved in the crisis, leading to unstable patterns of recommendations depending on contexts.

Remote Visualization

Policy Implications

This research initiative has important policy implications. As LLMs are integrated into national security workflows and may feature as component parts of decision-making processes, uncovering their tendencies in relevant scenarios will be critical for assessing the technology’s risk profile. Our results demonstrate clear inconsistencies in model recommendation preferences, with particular respect to the domain of escalation. These results suggest that policymakers should be aware that different models sustain different decision preferences in crisis scenarios, suggesting that, absent robust fine-tuning efforts, integrating off-the-shelf LLMs into foreign policy decision environments is high risk.

To begin to address these issues, model benchmarking and evaluation are a critical practice for better understanding how LLMs perform on particular use cases. This suggests that foreign policy and national security organizations with interests in integrating LLMs into their workflows should be prepared to use experts to curate domain-specific data on an ongoing basis, fine-tune models to align with specific (and possibly fluctuating) political goals, and constantly evaluate model performance to ensure the technology’s profile is mitigated adequately within the confines of the desired use case.   

This process requires independent benchmarking efforts like the CSIS Futures Lab CFPD Benchmark. Baselining new models and analyzing their performance in key domains provides a new form of checks and balances in democracies as they embrace AI. Benchmarking LLMs in national security mirrors Tocqueville’s view of civil society as an essential check on centralized power and institutions that mediate between state authority and public discourse. Just as Tocqueville saw a vibrant press and civic associations as buffers that cultivate accountability and prevent despotism, systematic model evaluation acts as a form of algorithmic accountability, ensuring that AI systems do not silently impose biases or escalate crises unchecked. In both cases, transparency and pluralism—whether through open civic debate or rigorous technical scrutiny—are critical to safeguarding democratic decisionmaking.   

Ian Reynolds is a fellow (non-resident) in the Futures Lab at the Center for Strategic and International Studies (CSIS) in Washington, D.C. Benjamin Jensen is director of the Futures Lab and a senior fellow for the Defense and Security Department at CSIS. Yasir Atalan is a data fellow in the Futures Lab at CSIS.