The Cold-Start Safety Gap in LLM Agents
Chung-En Sun, Linbo Liu, Tsui-Wei Weng
University of California, San Diego

Abstract

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks — a phenomenon we term the cold-start safety gap. To study this systematically, we introduce SODA (Safety Over Depth for Agents), a benchmark with 16 tool-use environments and 80 scenarios that controls how many regular tasks the agent completes before encountering a safety threat. Evaluating 7 models from 4 families, we find safety improves by 9–52% as depth increases from zero to twenty. By ablating which part of the conversation matters, we identify the key driver and propose a simple zero-cost deployment strategy that closes the gap while preserving utility.

Our contributions:

  1. Benchmark & Discovery: We design SODA, a benchmark with 16 tool-use environments that controls conversation depth. Evaluating 7 models from 4 families, we find safety improves by 9–52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms hidden states gradually shift toward a safety-aligned region.
  2. What drives the safety improvement: We systematically ablate warm-up conversations and find that regular agentic task requests are the primary driver of safety — the agent's own prior responses have less effect on safety but are essential for preserving utility.
  3. Safety generalizes without hurting utility: The warm-up effect generalizes to external safety benchmarks (AgentHarm, Agent Safety Bench) while preserving full tool-calling utility on BFCL Multi-Turn and API-Bank.
  4. Deployment recommendation: We recommend a simple zero-cost strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap.
Overview
Figure 1. Overview of the cold-start safety gap and the SODA benchmark.

SODA: Safety Over Depth for Agents

SODA controls how many regular agentic tasks an agent completes before encountering a harmful request. Each task involves real multi-turn tool interaction across 16 environments spanning banking, healthcare, social media, cloud infrastructure, and more.

SODA Benchmark
Figure 2. SODA benchmark design: 16 environments with controlled depth.

Agents Are Most Vulnerable at Session Start

All 7 models are significantly more likely to comply with harmful requests at the very beginning of a conversation (D=0) than after completing regular tasks. This vulnerability is universal across model families and scales.

Model D=0 D=5 D=10 D=20 Δ
Llama-3.1-8B-Instruct5.755.156.357.8+52.1
Llama-3.3-70B-Instruct23.653.259.761.9+38.3
Qwen3-4B-Instruct-250744.157.667.572.5+28.4
Qwen3-30B-A3B-Instruct-250759.174.878.879.1+20.0
Qwen3.5-9B53.162.865.767.4+14.3
Gemma-4-E4B-it61.872.074.875.0+13.2
Gemma-4-26B-A4B-it82.990.089.591.8+8.9
Table 1. Safety rate (%) at each depth D. The agent completes D regular agentic tasks before encountering the threat.

Representation Analysis

Hidden states at the moment a harmful request is presented migrate from the unsafe region to the safe region as depth increases.

PCA
Figure 3. PCA projections colored by safety verdict (blue=safe, red=unsafe). Dashed line: estimated boundary. At D=0, most points fall in the unsafe region; with more preceding tasks, they migrate into the safe region.

What Drives the Safety Improvement?

We isolate which part of the warm-up conversation drives safety by modifying the task requests (user side), the agent's responses (assistant side), or both:

Category Variant Llama3-8B Llama3-70B Qwen3-4B Qwen3-30B Qwen3.5-9B Gemma4-4B Gemma4-26B
BaselineFull Interaction6→58 +5224→62 +3844→72 +2859→79 +2053→67 +1462→75 +1383→92 +9
Fix
Requests
Compliant Response5→90 +8525→86 +6143→71 +2860→84 +2455→72 +1762→76 +1483→90 +7
Random Response5→92 +8724→80 +5644→85 +4159→84 +2553→76 +2363→82 +1983→94 +11
Empty Response6→50 +4424→75 +5144→72 +2859→78 +1954→60 +662→75 +1383→89 +6
Fix
Responses
Random Request6→60 +5424→26 +243→60 +1759→72 +1353→73 +2061→69 +883→85 +2
Empty Request6→42 +3625→49 +2444→57 +1360→70 +1055→68 +1359→65 +683→83 +0
Vary
Both
All Random6→70 +6424→36 +1244→38 -659→57 -253→77 +2460→72 +1283→85 +2
All Empty6→22 +1624→57 +3344→36 -860→50 -1053→65 +1262→64 +283→83 +0
Table 2. Safety rate (%) at D=0 with change (Δ) to D=20. Green: safety increases. Red: safety decreases. All variants preserving real task requests (Fix Requests) show substantial safety gains regardless of what the agent responds.

Finding: Task requests are the primary driver of safety — replacing the agent's responses with random text or leaving them empty still produces safety gains (Fix Requests group). However, as shown in Table 4, real agent responses are needed to preserve tool-calling utility: variants with fake responses degrade utility on BFCL and API-Bank.

Does the Warm-Up Generalize and Preserve Utility?

We test whether the warm-up effect holds on external safety benchmarks and whether it preserves tool-calling utility.

Benchmark Variant Llama3-8B Llama3-70B Qwen3-4B Qwen3-30B Qwen3.5-9B Gemma4-4B Gemma4-26B
AgentHarmFull Interaction35→78 +4327→74 +4761→81 +2063→85 +2265→74 +973→81 +876→88 +12
Compliant Resp.35→91 +5627→76 +4960→76 +1663→82 +1965→73 +872→79 +775→84 +9
Random Resp.35→89 +5427→73 +4661→81 +2063→78 +1565→78 +1373→74 +176→86 +10
Empty Resp.35→69 +3427→78 +5160→69 +963→80 +1765→49 -1672→70 -276→84 +8
ASBFull Interaction27→43 +1628→39 +1149→57 +849→54 +545→50 +551→59 +854→57 +3
Compliant Resp.28→40 +1228→31 +349→51 +248→50 +245→46 +151→56 +554→55 +1
Random Resp.28→40 +1228→32 +449→50 +149→49 046→44 -251→54 +353→54 +1
Empty Resp.27→34 +728→32 +449→52 +348→48 046→40 -652→53 +153→56 +3
Table 3. Safety rate (%) on external benchmarks at D=0 → D=20. Green: safety increases. Red: safety decreases. The warm-up effect generalizes across all variants and benchmarks.
Benchmark Variant Llama3-8B Llama3-70B Qwen3-4B Qwen3-30B Qwen3.5-9B Gemma4-4B Gemma4-26B
BFCL
Multi
Full Interaction33→38 +537→38 +164→66 +272→68 -465→65 036→34 -252→51 -1
Compliant Resp.32→29 -340→37 -366→53 -1372→60 -1265→61 -437→32 -551→50 -1
Random Resp.34→24 -1040→37 -366→54 -1270→69 -165→61 -438→34 -450→52 +2
Empty Resp.32→38 +639→38 -167→62 -574→67 -765→59 -636→38 +251→52 +1
API-
Bank
Full Interaction79→87 +886→89 +385→82 -387→85 -279→79 073→77 +479→77 -2
Compliant Resp.78→50 -2886→84 -284→66 -1888→65 -2380→75 -571→59 -1279→74 -5
Random Resp.82→53 -2983→80 -385→62 -2387→83 -479→80 +173→61 -1280→73 -7
Empty Resp.84→83 -185→82 -386→74 -1288→83 -582→77 -572→71 -179→75 -4
Table 4. Tool-calling utility (%) at D=0 → D=20. Green: utility preserved/improved. Red: utility degraded. Full Interaction preserves utility; Compliant/Random Response variants degrade it by teaching a non-tool-calling pattern.

Conclusion & Recommendation

We discover the cold-start safety gap: tool-calling LLM agents are most vulnerable at the very start of a session. A brief warm-up of regular agentic tasks (full interaction) substantially closes this gap across all models tested.

We additionally tested in-context refusal demonstrations and safety fine-tuning as alternative mitigations. Both improve safety but at significant cost: ICL refusal is unstable and causes over-refusal, while safety SFT collapses tool-calling utility (e.g., BFCL drops from 64% to 17%). This reveals a fundamental helpfulness–safety tradeoff — it is not easy to close the gap without losing agent capability.

The simplest and most practical solution today is full interaction warm-up: having the agent complete a few regular tasks before possible exposure to safety-critical requests. This requires no fine-tuning, no data collection, and nearly zero computational overhead. We hope future work explores strategies that achieve high safety without this tradeoff.

BibTeX

@article{sun2026coldstart,
  title={The Cold-Start Safety Gap in LLM Agents},
  author={Sun, Chung-En and Liu, Linbo and Weng, Tsui-Wei},
  journal={arXiv preprint arXiv:2606.07867},
  year={2026},
  url={https://arxiv.org/abs/2606.07867}
}