Abstract
Deep reinforcement learning (DRL) algorithms have shown impressive results in various applications, but non-stationary environments, such as varying operating conditions and external disturbances, remain a significant challenge. To address this challenge, we propose the hidden transition inference (HTI) framework for learning non-stationary transitions in multi-step tree search. Different from previous methods that focus on single-step transition changes, the HTI framework improves decision-making by inferring multi-step environmental variations. Specifically, this framework constructs a probabilistic graphical model for Monte Carlo Tree Search (MCTS) in latent space and utilizes the variational lower bound of hidden states for policy improvement. Furthermore, this work theoretically proves the convergence of the HTI framework, ensuring its effectiveness in handling non-stationary environments.
The proposed framework is integrated with the state-of-the-art MCTS-based algorithm Sampled MuZero and evaluated on multiple control tasks with different non-stationary dynamics transitions. Experimental results show that HTI framework can improve the inference capability of tree search in non-stationary environments, showcasing its potential for addressing the control challenges in non-stationary environments.
Method
A hidden transition inference (HTI) framework for learning non-stationary transitions is proposed in this work. To establish the relationship between time-varying environmental parameters, a probabilistic graphical model is constructed for a multi-step search in the latent space and the variational lower bound of the optimal variables is derived.By learning the non-stationary hidden transitions, the HTI framework improves the decision-making ability of MCTS in non-stationary environments.
Fig. 1 The probabilistic graphical model for MCTS in latent space. The posterior probability of the optimal variable in the hidden transition is related to the historical MDP.
The planning and training processing of HTI framework. The representation network processes historical observations and stores extracted features in the root node during the planning phase, and the decoder network reconstructs the original hidden transition during the training phase.
Result
Comparing the control performance of LILAC policy and HTISZero policy in the DoorOpen-ns3 task. LILAC policy is unstable to complete the task reliably in the presence of disturbances, whereas HTISZero policy exhibits stable control even under non-stationary disturbances.
Maximum return of different algorithms in robotic arm tasks with different non-stationary types.