An actor critic algorithm based on Grassmanian search

K.J., Prabuchandran ; Bhatnagar, Shalabh ; Borkar, Vivek S. (2014) An actor critic algorithm based on Grassmanian search In: 53rd IEEE Conference on Decision and Control, 15-17 Dec. 2014, Los Angeles, CA, USA.

Full text not available from this repository.

Official URL: http://doi.org/10.1109/CDC.2014.7039948

Related URL: http://dx.doi.org/10.1109/CDC.2014.7039948

Abstract

We propose the first online actor-critic scheme with adaptive basis to find a local optimal control policy for a Markov Decision Process (MDP) under the weighted discounted cost objective. We parameterize both the policy in the actor and the value function in the critic. The actor performs gradient search in the space of policy parameters using simultaneous perturbation stochastic approximation (SPSA) gradient estimates. This gradient computation requires estimates of value function that are provided by the critic by minimizing a mean square Bellman error objective. In order to obtain good estimates of the value function, the critic adaptively tunes the basis functions (or the features) to obtain the best representation of the value function using gradient search in the Grassmanian of features. Our control algorithm makes use of multi-timescale stochastic approximation. The actor updates its parameters along the slowest time scale. The critic uses two time scales to estimate the value function. For any given feature value, our algorithm performs gradient search in the parameter space via a residual gradient scheme on the faster timescale and, on a medium timescale, performs gradient search in the Grassman manifold of features. We provide an outline of the proof of convergence of our control algorithm to a locally optimum policy. We show empirical results using our algorithm as well as a similar algorithm that uses temporal difference (TD) learning in place of the residual gradient scheme for the faster timescale updates.

Item Type:Conference or Workshop Item (Paper)
Source:Copyright of this article belongs to Institute of Electrical and Electronics Engineers.
Keywords:Control; Feature Adaptation; Online Learning; Residual Gradient Scheme; Temporal Difference Learning; Stochastic Approximation; Grassman Manifold.
ID Code:116671
Deposited On:12 Apr 2021 07:21
Last Modified:12 Apr 2021 07:21

Repository Staff Only: item control page