Solving MDPs using Two-timescale Simulated Annealing with Multiplicative Weights

Abdulla, Mohammed Shahid ; Bhatnagar, Shalabh (2007) Solving MDPs using Two-timescale Simulated Annealing with Multiplicative Weights In: American Control Conference, 9-13 July 2007, New York, NY, USA.

Full text not available from this repository.

Official URL: http://doi.org/10.1109/ACC.2007.4282586

Related URL: http://dx.doi.org/10.1109/ACC.2007.4282586

Abstract

We develop extensions of the simulated annealing with multiplicative weights (SAMW) algorithm that proposed a method of solution of finite-horizon Markov decision processes (FH-MDPs). The extensions developed are in three directions: a) Use of the dynamic programming principle in the policy update step of SAMW b) A two-timescale actor-critic algorithm that uses simulated transitions alone, and c) Extending the algorithm to the infinite-horizon discounted-reward scenario. In particular, a) reduces the storage required from exponential to linear in the number of actions per stage-state pair. On the faster timescale, a 'critic' recursion performs policy evaluation while on the slower timescale an 'actor' recursion performs policy improvement using SAMW. We give a proof outlining convergence w.p. 1 and show experimental results on two settings: semiconductor fabrication and flow control in communication networks.

Item Type:Conference or Workshop Item (Paper)
Source:Copyright of this article belongs to Institute of Electrical and Electronics Engineers.
Keywords:Markov Decision Processes; Reinforcement Learning; Two Timescale Stochastic Approximation; Simulated Annealing With Multiplicative Weights.
ID Code:116723
Deposited On:12 Apr 2021 07:27
Last Modified:12 Apr 2021 07:27

Repository Staff Only: item control page