Robust sequential decision-making in adversarial environments

0648523 20260417130701.620260409235959.9 105033597476 001722680500001 10.1080/21642583.2026.2646376 Robust sequential decision-making in adversarial environments 21 s. E cav_un_epca*06308992164-2583Systems Science & Control Engineering14 dynamic programming adversarial machine learning multi-agent reinforcement learning robust reinforcement learning Bayesian reinforcement learning cav_un_auth*0491463 Ružejnikov Jurij UTIA-B Adaptivní systémy Department of Adaptive Systems AS AS CZ 80 K Ústav teorie informace a automatizace AV ČR, v. v. i. cav_un_auth*0101092 Guy Tatiana Valentine UTIA-B Adaptivní systémy Department of Adaptive Systems AS AS Department of Adaptive Systems 20 S Ústav teorie informace a automatizace AV ČR, v. v. i. https://library.utia.cas.cz/separaty/2026/AS/ruzejnikov-0648523.pdf https://www.tandfonline.com/doi/full/10.1080/21642583.2026.2646376 101168272 EC XE cav_un_auth*0492513 CA24136 EC XE cav_un_auth*0504278 2025A1013 ČZU CZ cav_un_auth*0504279 Reinforcement learning (RL) agents often fail in adversarial environments where the Markov Decision Process (MDP) assumption of a stationary environment is violated. While model-free solutions for this setting exist, planning-based counterparts remain less explored. This paper introduces offline and online value iteration algorithms within the Threatened Markov Decision Process (TMDP) framework, in which the RL agent maintains and updates a Bayesian belief over the adversary’s policy. The belief is integrated into a modified Bellman optimality equation to compute robust policies. We evaluate our framework with the stochastic adversarial multi-agent Coin Game. Our primary finding is that the model-based agent outperforms the TMDP version of model-free Q-learning by a significant margin, confirming that the benefits of model-based planning extend from MDP to TMDP. Furthermore, the proposed framework maintains a performance advantage over Q-learning baselines even when the system’s transition function is unknown. The RL agent also demonstrated robustness to direct adversarial interactions. This work validates TMDP value iteration as an effective, planning-based approach for decision-making against adaptive adversaries. WOS BC 10000 10100 10103 2027 2 RVO:67985556 https://hdl.handle.net/11104/0378170 cav_un_auth*0478849 Provozně ekonomická fakulta, Česká zemědělská univerzita v Praze PEF CZU CZ S 2646376 A AUTOMATION&CONTROLSYSTEMS 4.5 0.9 4.3 0.00285 1.004 2102 38 0.783 39.98 5.14 938 Q1 4.200 116 Q2 74.7 0.65 Q2 74.7 2026 105033597476 SCOPUS PUBMED 001722680500001 WOS cav_un_epca*0630899 Systems Science & Control Engineering 14 1 2026 2164-2583 2164-2583 Robust Sequential Decision-Making in Adversarial Environments: Codebase https://hdl.handle.net/11104/0376313