Experiments with Infinite-Horizon, Policy-Gradient Estimation (2001)
Jonathan Baxter, Whizbang Labs, Peter L. Bartlett, Biowulf Technologies, Lex Weaver
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm...
Infinite-Horizon Policy-Gradient Estimation (2001)
Jonathan Baxter, Whizbang Labs, Peter L. Bartlett, Biowulf Technologies
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems...