Operations Research
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


OPERATIONS RESEARCH
Vol. 53, No. 1, January-February 2005, pp. 126-139
DOI: 10.1287/opre.1040.0145
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Chang, H. S.
Right arrow Articles by Marcus, S. I.
Right arrow Search for Related Content

An Adaptive Sampling Algorithm for Solving Markov Decision Processes

Hyeong Soo Chang, Michael C. Fu, Jiaqiao Hu, Steven I. Marcus

Department of Computer Science and Engineering, Sogang University, Seoul, Korea
Robert H. Smith School of Business, and Institute for Systems Research, University of Maryland, College Park, Maryland 20742
Department of Electrical and Computer Engineering, and Institute for Systems Research, University of Maryland, College Park, Maryland 20742
Department of Electrical and Computer Engineering, and Institute for Systems Research, University of Maryland, College Park, Maryland 20742

hschang{at}ccs.sogang.ac.kr
mfu{at}rhsmith.umd.edu
jqhu{at}glue.umd.edu
marcus{at}eng.umd.edu

Based on recent results for multiarmed bandit problems, we propose an adaptive sampling algorithm that approximates the optimal value of a finite-horizon Markov decision process (MDP) with finite state and action spaces. The algorithm adaptively chooses which action to sample as the sampling process proceeds and generates an asymptotically unbiased estimator, whose bias is bounded by a quantity that converges to zero at rate (ln N)/N, where N is the total number of samples that are used per state sampled in each stage. The worst-case running-time complexity of the algorithm is O((|A|N)H), independent of the size of the state space, where |A| is the size of the action space and H is the horizon length. The algorithm can be used to create an approximate receding horizon control to solve infinite-horizon MDPs. To illustrate the algorithm, computational results are reported on simple examples from inventory control.

Subject classifications: dynamic programming/optimal control:Markov finite state.
History: Received June 2002; revision received May 2003; accepted November 2003.







HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2005 by INFORMS.