Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents.
Abstract
Robustness remains a paramount concern in deep reinforcement learning (DRL), with randomized smoothing emerging as a key technique for enhancing this attribute. However, a notable gap exists in the performance of current smoothed DRL agents, often characterized by significantly low clean rewards and weak robustness.
In response to this challenge, our study introduces innovative algorithms aimed at training effective smoothed robust DRL agents. We propose S-DQN and S-PPO, novel approaches that demonstrate remarkable improvements in clean rewards, empirical robustness, and robustness guarantee across standard RL benchmarks. Notably, our S-DQN and S-PPO agents not only significantly outperform existing smoothed agents by an average factor of \(\bf{2.16\times}\) under the strongest attack, but also surpass previous robustly-trained agents by an average factor of \(\bf{2.13\times}\). This represents a significant leap forward in the field.
Furthermore, we introduce Smoothed Attack, which is \(\bf{1.89\times}\) more effective in decreasing the rewards of smoothed agents than existing adversarial attacks.
Overview of our method.
Motivation
Recently, there is a growing interest on enabling certifiable robustness in DRL agents using Randomized Smoothing (RS), transforming agents into their smoothed counterparts during testing. Unfortunately, we found that existing smoothed agents (in brown color) demonstrate a notable deficiency: they yield substantially lower clean reward and little improvement in robustness compared to their non-smoothed counterparts (in grey color).
Comparison between our agents, existing smoothed agents, and non-smoothed agents.
S-DQN and S-PPO
We propose two training algorithm leveraging RS: S-DQN (Smoothed - Deep Q Network) and S-PPO (Smoothed - Proximal Policy Optimization). The pipeline is as follows:
S-DQN (Smoothed - Deep Q Network)
Training S-DQN
Training pipeline of S-DQN
The training process of S-DQN is shown in the above figure, which involves two main steps: collecting transitions and updating the networks. First, we collect the transitions with noisy states
$$a_t=
\begin{cases}
\textrm{argmax}_a Q(D(\tilde{s}_t;\theta),a),\;\textrm{with probability}\;1-\epsilon \\
\textrm{Random Action},\;\textrm{with probability}\;\epsilon
\end{cases}$$
(1)
where \(\tilde{s}_t\) is the state with noise \(\tilde{s}_t=s_t+\mathcal{N}(0,\sigma^2I_N)\), \(D\) is the denoiser, \(Q\) is the pretrained Q-network, and \(\sigma\) is the standard deviation of the Gaussian distribution. Here, we introduce a denoiser \(D\) before the Q-network, aiming to alleviate the side effects of the low clean reward resulting from the noisy states. After collecting the transitions, they are stored in the replay buffer. In the second stage, we sample some transitions from the replay buffer and update the parameters of the denoiser \(D\). The entire loss function is designed with two parts, reconstruction loss \(\mathcal{L}_\textrm{R}\) and temporal difference loss \(\mathcal{L}_{\textrm{TD}}\). Suppose the sampled transition is \(\{s,a,r,s^\prime\}\), the reconstruction loss \(\mathcal{L}_\textrm{R}\) is defined as:
$$\mathcal{L}_\textrm{R} =\frac{1}{N}||D(\tilde{s};\theta)-s||^2_2,$$
(2)
where \(\tilde{s}=s+\mathcal{N}(0,\sigma^2I_N)\), and \(N\) is the dimension of the state. The reconstruction loss is the mean square error (MSE) between the original state and the output of the denoiser. This loss aims to train the denoiser \(D\) to effectively reconstruct the original state. The temporal difference loss \(\mathcal{L}_{\textrm{TD}}\) is defined as:
$$\begin{split}
& \mathcal{L}_{\textrm{TD}} =
\begin{cases}
\frac{1}{2\zeta}\eta^2,\;\textrm{if}\;|\eta|<\zeta \\
|\eta|-\frac{\zeta}{2},\;\textrm{otherwise}
\end{cases} \\
& \eta=r+\gamma\max_{a^\prime}Q(s^\prime,a^\prime)-Q(D(\tilde{s};\theta),a),
\end{split}$$
(3)
where \(\zeta\) is set to \(1\). Our designed \(\mathcal{L}_{\textrm{TD}}\) is different from the common temporal difference loss in the DQN learning: the current Q-value is estimated with the denoised state (the output of \(D\)) and the target Q-value remains clean without noisy input. Note that the pretrained Q-network \(Q\) can be replaced with robust agents and our S-DQN framework can also be combined with adversarial training to further improve the robustness.
Testing S-DQN
Testing pipeline of S-DQN
The testing process of S-DQN is shown in the above figure. In the testing stage, we need to obtain the smoothed Q-values of S-DQN. We leverage the hard Randomized Smoothing (hard RS) strategy to enhance robustness. We first define the hard Q-value \(Q_h(s,a)=\bf{1}_{\{a=\textrm{argmax}_{a^\prime}Q(s,a^\prime)\}}\). Note that the hard Q-value \(Q_h\) is always in \([0,1]\). Then, we define the hard RS for S-DQN as follows:
$$\widetilde{Q}(s,a)=\mathbb{E}_{\delta\sim\mathcal{N}(0,\sigma^2I_N)}Q_h(D(s+\delta),a).$$
(4)
In practice, we need to estimate the expectation to get \(\widetilde{Q}\), which can be done by using Monte Carlo sampling. The action is then selected by taking \(\textrm{argmax}_{a}\widetilde{Q}(s,a)\).
S-PPO (Smoothed - Proximal Policy Optimization)
Training S-PPO
Training pipeline of S-PPO
The training process of S-PPO is shown in the above figure. Initially, we gather trajectories using the smoothed policy and subsequently update both the value network and the policy network. In the trajectory collection phase, We use the Median Smoothing strategy to smooth our agents. The median value has a nice property: it is almost unaffected by the outliers. Hence, Median Smoothing can give a better estimation of the expectation than mean smoothing when the number of samples is small. The smoothed policy of S-PPO is defined as follows:
$$\tilde{\pi}_i(a|s)=\mathcal{N}(\widetilde{M}_i,\widetilde{\Sigma}_i^2),\;\forall i\in\{1,...,N_{\textrm{action}}\}$$
(5)
where \(\widetilde{M}_i=\textrm{sup}\{M\in\mathbb{R}|\mathbb{P}_{\delta\sim\mathcal{N}(0,\sigma^2I_N)}[a^{\textrm{mean}}_i\leq M]\leq p\}\), \(\widetilde{\Sigma}_i=\textrm{sup}\{\Sigma\in\mathbb{R}|\mathbb{P}_{\delta\sim\mathcal{N}(0,\sigma^2I_N)}[a^{\textrm{std}}_i\leq \Sigma]\leq p\}\), \((a^{\textrm{mean}}_i,a^{\textrm{std}}_i)\) is the output of policy network given a state with noise \(s+\delta\) as input, which represents the mean and standard deviation of the \(i\)-th coordinate of the action, \(N_{\textrm{action}}\) is the dimension of the action, and \(p\) is the percentile. Now, we define the loss function for S-PPO as follows:
$$\begin{aligned}
& \mathcal{L}_{\tilde{\pi}}(\theta)=-\mathbb{E}_t[\min(\mathcal{R}_{\tilde{\pi}}\hat{A}_t,\textrm{clip}(\mathcal{R}_{\tilde{\pi}},1-\epsilon_{\textrm{c}},1+\epsilon_{\textrm{c}})\hat{A}_t)], \\
& \mathcal{R}_{\tilde{\pi}}=\frac{\tilde{\pi}(a_t|s_t;\theta)}{\tilde{\pi}(a_t|s_t;\theta_{\textrm{old}})},
\end{aligned}$$
(6)
where \(\hat{A}_t\) is the advantage, and \(\epsilon_{\textrm{c}}\) is the clipping hyperparameter. This is the loss of the classic PPO algorithm combined with RS. Note that our S-PPO can also be combined with other robust PPO algorithms and adversarial training.
Testing S-PPO
Testing pipeline of S-PPO
We also use Median Smoothing during testing to obtain the smoothed policy. However, we use the smoothed deterministic policy as follows:
$$\tilde{\pi}_{i,\textrm{det}}(s)=\widetilde{M}_i,\;\forall i\in\{1,...,N_{\textrm{action}}\},$$
(7)
where \(\widetilde{M}_i=\textrm{sup}\{M\in\mathbb{R}|\mathbb{P}_{\delta\sim\mathcal{N}(0,\sigma^2I_N)}[a^{\textrm{mean}}_i\leq M]\leq p\}\), and \(a^{\textrm{mean}}_i\) is the output of policy network given a state with noise \(s+\delta\) as input (\(a^{\textrm{mean}}_i=\pi_{i,\textrm{det}}(s+\delta)\)) representing the mean of the \(i\)-th coordinate of the action. Here we only use the \(a^{\textrm{mean}}\) value of the output of the policy network for smoothing.
Smoothed attack
we found that the classic PGD attack is ineffective in decreasing the reward of the smoothed DQN agents. Hence, we propose a new attack framework named Smoothed Attack, which is specifically designed for the smoothed agents.
Pipeline of smoothed attack
Testing pipeline of S-PPO
The pipeline of Smoothed Attack is shown in the above figure. The objective of Smoothed Attack is as follows:
$$\begin{split}
\min_{\Delta s}\log\dfrac{\exp{Q(D(\tilde{s}+\Delta s),a^*)}}{\Sigma_a\exp{Q(D(\tilde{s}+\Delta s),a)}},\;\textrm{s.t.}\;||\Delta s||_{p}\leq\epsilon,
\end{split}$$
(8)
where \(a^*=\textrm{argmax}_{a}\widetilde{Q}(s,a)\), \(\widetilde{Q}(s,a)\) is defined in Eq.(4), \(\tilde{s}=s+\mathcal{N}(0,\sigma^2I_N)\), \(\epsilon\) is the attack budget, and \(p=2\;\textrm{or}\;\infty\) in our setting. In our Smoothed Attack, the state with perturbation is added with a noise sampled from Gaussian distribution with the corresponding smoothing variance \(\sigma\). This setting can be integrated with various existing attacks, such as PGD attack and PA-AD, by replacing the objective with the Smoothed Attack objective in Eq.(8). The comparison of our Smoothed Attack (S-PGD and S-PA-AD) against the PGD attack and PA-AD attack is in the below table:
Performance for Smoothed Attack: Smoothed Attack (S-PGD and S-PA-AD) is much stronger in decreasing the reward of smoothed agents than the common Non-smoothed attack.
Robustness certification
The strength of the smoothed agents lies in their certifiable robustness. We formally formulate the certified radius, action bound, and reward lower bound of our S-DQN and S-PPO agents.
Certified Radius for S-DQN
The certified radius for our S-DQN is defined as follows:
$$R_t=\dfrac{\sigma}{2}(\Phi^{-1}(\widetilde{Q}(s_t,a_1))-\Phi^{-1}(\widetilde{Q}(s_t,a_2))),$$
(8)
where \(a_1\) is the action with the largest Q-value among all the other actions, \(a_2\) is the ”runner-up” action, \(R_t\) is the certified radius at time \(t\), \(\Phi\) is the CDF of normal distribution, \(\sigma\) is the smoothing variance, and \(\widetilde{Q}(s,a)\) is smoothed Q-value. As long as the \(\ell_2\) perturbation is bounded by \(R_t\), the action is guaranteed to be the same.
Action Bound for S-PPO
Unlike the discrete action setting, there is no guarantee that the action will not change under a certain radius in the continuous action setting. Hence, we derive the Action Bound, which bounds the policy of S-PPO agents in a close region:
$$\tilde{\pi}_{\textrm{det},{\underline{p}}}(s_t) \preceq \tilde{\pi}_{\textrm{det},p}(s_t+\Delta s)\preceq\tilde{\pi}_{\textrm{det},{\overline{p}}}(s_t),\;\textrm{s.t.}\;||\Delta s||_2\leq\epsilon,$$
(9)
where \(\tilde{\pi}_{i,\textrm{det},p}(s)=\textrm{sup}\{a_i\in\mathbb{R}|\mathbb{P}_{\delta\sim\mathcal{N}(0,\sigma^2I_N)}[\pi_{i,\textrm{det}}(s+\delta)\leq a_i]\leq p\},\forall i\in\{1,...,N_{\textrm{action}}\}\), \(\underline{p}=\Phi(\Phi^{-1}(p)-\frac{\epsilon}{\sigma})\), \(\overline{p}=\Phi(\Phi^{-1}(p)+\frac{\epsilon}{\sigma})\), and \(p\) is the percentile.
Reward lower bound for smoothed agents
By viewing the whole trajectory as a function \(F_\pi\), we define \(F_\pi :\mathbb{R}^{H\times N}\rightarrow\mathbb{R}\) that maps the vector of perturbations for the whole trajectory \(\boldsymbol{\Delta s}=[\Delta s_0,...,\Delta s_{H-1}]^\top\) to the cumulative reward. Then, the reward lower bound is defined as follows:
$$\widetilde{F}_{\pi,p}(\boldsymbol{\Delta s})\geq\widetilde{F}_{\pi,\underline{p}}(\boldsymbol{0}),\;\textrm{s.t.}\;||\boldsymbol{\Delta s}||_2\leq B,$$
(9)
where \(\widetilde{F}_{\pi,p}(\boldsymbol{\Delta s})=\textrm{sup}\{r\in\mathbb{R}|\mathbb{P}_{\boldsymbol{\delta}\sim\mathcal{N}(0,\sigma^2I_{H\times N})}[F_\pi(\boldsymbol{\delta}+\boldsymbol{\Delta s})\leq r]\leq p\}\), \(\boldsymbol{\delta}=[\delta_0,...,\delta_{H-1}]^\top\), \(\underline{p}=\Phi(\Phi^{-1}(p)-\frac{B}{\sigma})\), \(H\) is the length of the trajectory, and \(B\) is the \(\ell_2\) attack budget for the entire trajectory. If the attack budget of each state is \(\epsilon\), then \(B=\epsilon\sqrt{H}\). This bound ensures that the reward will not fall below a certain value while given any \(\ell_2\) perturbation with budget \(B\).
Experiment
Setup
In our DQN settings, the evaluations are done in three Atari environments — Pong, Freeway, and RoadRunner. We train the denoiser $D$ with different base agents and with adversarial training. Our methods are listed as follows:
- S-DQN ({Base agent}): S-DQN combined with a certain base agents. {Base agent} can be Radial [1] or Vanilla (simple DQN).
- S-DQN (S-PGD): S-DQN (Vanilla) adversarially trained with our proposed S-PGD.
We compare our S-DQN with the following baselines:
- Non-smoothed robust agents: RadialDQN [1], SADQN [2], WocaRDQN [3].
- Previous smoothed agents [4][5]: RadialDQN+RS, SADQN+RS, WocaRDQN+RS. We use {Base agent}+RS to denote them.
In our PPO settings, the evaluations are done on two continuous control tasks in the Mujoco environments — Walker and Hopper. We train each agent \(15\) times and report the median performance as suggested in
[2] since the training variance of PPO algorithms is high. Our methods are listed as follows:
- S-PPO ({base algorithm}): S-PPO combined with a certain base algorithms. {base algorithm} can be Radial [1], SGLD [2], WocaR [3], or Vanilla (simple PPO).
- S-PPO (S-ATLA), S-PPO (S-PA-ATLA): S-PPO with smoothed adversarial training.
We compare our S-PPO with the following baselines:
- Non-smoothed robust agents: RadialPPO [1], SGLDPPO [2], WocaRPPO [3], ATLAPPO [6], PA-ATLAPPO [7].
- Previous smoothed agents [4][5]: RadialPPO+RS, SGLDPPO+RS, WocaRPPO+RS, ATLAPPO+RS, PA-ATLAPPO+RS.
Performance
1. Robust reward
Our S-DQNs and S-PPOs outperform the previous SOTA non-smoothed agents and smoothed agents.
The average normalized reward of DQN and PPO agents under attack
2. Reward lower bound
Our S-DQNs and S-PPOs achieve a much higher lower bound than all the previous smoothed agents, indicating that our method can enhance not only the empirical robustness but also the robustness guarantee.
The reward lower bound of smoothed DQN and PPO agents
Related Works
[1] (Radial) Oikarinen, et al. "Robust deep reinforcement learning through adversarial loss", NeurIPS 2021
[2] (SADQN, SGLD) Zhang, et al. "Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations", NeurIPS 2020
[3] (WocaR) Liang, et al. "Efficient Adversarial Training without Attacking: Worst-Case-Aware Robust Reinforcement Learning", NeurIPS 2022
[4] Wu, et al. "CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing", ICLR 2022
[5] Kumar, et al. "Policy Smoothing for Provably Robust Reinforcement Learning", ICLR 2022
[6] (ATLA) Zhang, et al. "Robust Reinforcement Learning on State Observations with Learned Optimal Adversary", ICLR 2021
[7] (PA-ATLA) Sun, et al. "Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL", ICLR 2022
Cite this work
Chung-En Sun, Sicun Gao, Tsui-Wei Weng. "Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents", ICML 2024
@article{robustRSRL,
title={Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents},
author={Chung-En Sun, Sicun Gao, Tsui-Wei Weng},
journal={ICML},
year={2024}
}