论文标题
多准子多臂匪徒的遗憾最小化
Constrained regret minimization for multi-criterion multi-armed bandits
论文作者
论文摘要
我们考虑了随机的多武器匪徒设置,并研究了给定时间范围内遗憾最小化的问题。每个手臂都与未知的,可能是多维分布相关联,并且手臂的优点由几个可能的,可能是冲突的属性决定。目的是将“主要”属性优化为其他“次要”属性的用户提供的约束。我们假设可以使用武器分布中的样品来估算属性,并且估计器具有合适的浓度属性。我们提出了一种称为CON-LCB的算法,该算法保证了对数遗憾,即,所有非最佳武器的平均戏剧数量最多是地平线上的对数。该算法还输出了一个布尔标志,该标志以高概率正确识别,是否相对于约束是可行的/不可行的。我们还表明,CON-LCB在通用常数中是最佳的,即更复杂的算法不能普遍做得更好。最后,我们在遗憾最小化和可行性识别之间建立了一个基本的权衡。我们的框架在财务组合优化中找到了自然应用,在这种情况下,风险限制了预期收益的最大化是有意义的。
We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a 'primary' attribute subject to user-provided constraints on other 'secondary' attributes. We assume that the attributes can be estimated using samples from the arms' distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a Boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.