Machine Learning and Friends Lunch

Optimal Coordinated Learning Among Self-Interested Agents in the Multi-Armed Bandit Problem

Abstract

In the multi-armed bandit problem, one of a set of n independent
reward-generating processes (or arms) is activated at every time-step
in a sequential decision problem. I consider a multi-agent variant, in
which each agent brings an action to a system and receives a privately
observed stochastic reward when any action is taken. I model agent
self interest, and consider a central arbitrator, that takes an action
in each time period based on agent bids. I adapt the factored method
of Gittins (1974) to provide a distributed algorithm for optimal
coordinated learning, where incentives are provided for agents to
report true rewards and the distributed algorithm itself is in
equilibrium. The arbitrator interleaves an ``epsilon-sampling''
process with a Bayes-optimal learning policy. Together, they are used
to define a payment sequence that brings the joint learning process
into an equilibrium.

Back to ML Lunch home