bibtype J - Journal Article
ARLID 0485146
utime 20240103215417.8
mtime 20180119235959.9
SCOPUS 85040739483
WOS 000424732300008
DOI 10.14736/kyb-2017-6-1086
title (primary) (eng) Second Order Optimality in Markov Decision Chains
specification
page_count 14 s.
media_type P
serial
ARLID cav_un_epca*0297163
ISSN 0023-5954
title Kybernetika
volume_id 53
volume 6 (2017)
page_num 1086-1099
publisher
name Ústav teorie informace a automatizace AV ČR, v. v. i.
keyword Markov decision chains
keyword second order optimality
keyword optimalilty conditions for transient, discounted and average models
keyword policy and value iterations
author (primary)
ARLID cav_un_auth*0101196
full_dept (cz) Ekonometrie
full_dept (eng) Department of Econometrics
department (cz) E
department (eng) E
full_dept Department of Econometrics
share 100%
name1 Sladký
name2 Karel
institution UTIA-B
garant K
fullinstit Ústav teorie informace a automatizace AV ČR, v. v. i.
source
url http://library.utia.cas.cz/separaty/2017/E/sladky-0485146.pdf
cas_special
project
ARLID cav_un_auth*0321097
project_id GA15-10331S
agency GA ČR
abstract (eng) The article is devoted to Markov reward chains in discrete-time setting with finite state spaces. Unfortunately, the usual optimization criteria examined in the literature on Markov decision chains, such as a total discounted, total reward up to reaching some specific state (called the first passage models) or mean (average) reward optimality, may be quite insufficient to characterize the problem from the point of a decision maker. To this end it seems that it may be preferable if not necessary to select more sophisticated criteria that also reflect variability -risk features of the problem. Perhaps the best known approaches stem from the classical work of Markowitz on mean variance selection rules, i.e. we optimize the weighted sum of average or total reward and its variance. The article presents explicit formulae for calculating the variances for transient and discounted models (where the value of the discount factor depends on the current state and action taken) for finite and infinite time horizon. The same result is presented for the long run average nondiscounted models where finding stationary policies minimizing the average variance in the class of policies with a given long run average reward is discussed.
RIV BB
FORD0 10000
FORD1 10100
FORD2 10103
reportyear 2018
num_of_auth 1
inst_support RVO:67985556
permalink http://hdl.handle.net/11104/0280354
confidential S
mrcbC86 3+4 Article|Proceedings Paper Computer Science Cybernetics
mrcbC86 3+4 Article|Proceedings Paper Computer Science Cybernetics
mrcbC86 3+4 Article|Proceedings Paper Computer Science Cybernetics
mrcbT16-e COMPUTERSCIENCECYBERNETICS
mrcbT16-j 0.224
mrcbT16-s 0.321
mrcbT16-B 18.907
mrcbT16-D Q4
mrcbT16-E Q3
arlyear 2017
mrcbU14 85040739483 SCOPUS
mrcbU24 PUBMED
mrcbU34 000424732300008 WOS
mrcbU63 cav_un_epca*0297163 Kybernetika 0023-5954 Roč. 53 č. 6 2017 1086 1099 Ústav teorie informace a automatizace AV ČR, v. v. i.