Abdulla, Mohammed Shahid ; Bhatnagar, Shalabh (2007) Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes Discrete Event Dynamic Systems, 17 (1). pp. 23-52. ISSN 0924-6703
Full text not available from this repository.
Official URL: http://doi.org/10.1007/s10626-006-0003-y
Related URL: http://dx.doi.org/10.1007/s10626-006-0003-y
Abstract
This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).
| Item Type: | Article | 
|---|---|
| Source: | Copyright of this article belongs to Springer Nature. | 
| Keywords: | Actor-Critic Algorithms; Two Timescale Stochastic Approximation; Markov Decision Processes; Policy Iteration; Simultaneous Perturbation Stochastic Approximation; Normalized Hadamard Matrices; Reinforcement Learning; TD-learning. | 
| ID Code: | 116597 | 
| Deposited On: | 12 Apr 2021 07:07 | 
| Last Modified: | 12 Apr 2021 07:07 | 
Repository Staff Only: item control page

