GoTriple - Toward Practical Reinforcement Learning Algorithms: Classification Based Policy Iteration and Model-Based Learning

author

Ávila Pires, Bernardo

Contributors

Csaba Szepesvári (Computing Science),

Ivan Mizera (Mathematical and Statistical Sciences),

András György (Electrical and Electronic Engineering, Imperial College London),

Dale Schuurmans (Computing Science),

Ambuj Tewari (Statistics, University of Michigan)

+ 4

Publisher

University of Alberta. Department of Computing Science.

Abstract

Specialization: Statistical Machine Learning Degree: Doctor of Philosophy Abstract: In this dissertation, we advance the theoretical understanding of two families of Reinforcement Learning (RL) methods: Classification-based policy iteration (CBPI) and model-based reinforcement learning (MBRL) with factored semi-linear models. In contrast to generalized policy iteration, CBPI does not rely on value-function estimates (value estimates for all states and actions). Instead, CBPI uses a classifier to learn how to discriminate value-maximizing actions based on value estimates for a set of observed states and actions. This creates the potential for learning effective policies in settings where estimating value functions is challenging, but where good value estimates can be obtained and good actions can be distinguished from sub-optimal ones. Previous theoretical work on CBPI has required classifiers that are computed by solving a combinatorial problem, which we can expect to be computationally hard (with minimization of the infamous 0-1 loss as a special case). In contrast, classifiers that are computed by solving convex minimization problems (which can be done efficiently) enjoy limited or no performance guarantees, namely, bounds on the cost-sensitive generalization of the misclassification probability, the so-called "true risk". Therefore, we only have instances of CBPI that enjoy theoretical guarantees but cannot be used in practice, or vice-versa. We present a theoretical analysis of CBPI that fills this gap, by providing theoretical guarantees for instances of CBPI that can be used in practice (by virtue of the classification methods used). Our analysis extends an existing theoretical analysis of CBPI, and incorporates performance guarantees for classification methods that minimize a cost-sensitive multiclass convex surrogate loss that generalizes the hinge loss. The hinge loss has been widely used in the design of classification methods, including the popular support vector machines (SVMs). As part of our analysis, we also present results for cost-sensitive multiclass classification: Novel expected surrogate loss (surrogate-risk) bounds, as well as tools for converting surrogate risk bounds into true risk bounds. This is done with the help of novel calibration functions that are specific to cost-sensitive multiclass classification losses. Moreover, our analysis of CBPI can be easily adjusted to accommodate for different classification methods, provided that the corresponding surrogate risk bounds and calibration functions are available. We also present policy error bounds for MBRL methods that use factored semi-linear models. The factored semi-linear model framework generalizes the factored linear model framework and many existing MBRL methods. Notably, factored semi-linear models generalize a recent trend of MBRL methods that depart from learning "traditional" MDP models in order to achieve flexibility, computational efficiency, data efficiency and good empirical performance. As such, factored semi-linear models are both flexible and geared toward efficient policy computation, with instances that have been shown to be promising in practice. The policy error bounds that we present improve the previously existing bounds by relaxing conditions, refining the bounds themselves, and increasing the scope of models that they apply to---namely, factored semi-linear models. These bounds allow us to understand the policy error in norms other than the overly-harsh supremum norm. For example, our Lp(mu) norm results allow us to see that policy error bounds for MBRL methods with factored semi-linear models are less sensitive to covariate-shift than policy error bounds for competing methods, such as approximate linear programming or approximate dynamic programming methods. This robustness suggests that MBRL methods with factored semi-linear models have much potential to be a valid alternative to popular non-model-based RL methods.