Online Learning in Opportunistic Spectrum Access: A Restless Bandit Approach
Full recordShow full item record
AbstractWe consider an opportunistic spectrum access (OSA) problem where the time-varying condition of each channel (e.g., as a result of random fading or certain primary users' activities) is modeled as an arbitrary finite-state Markov chain. At each instance of time, a (secondary) user probes a channel and collects a certain reward as a function of the state of the channel (e.g., good channel condition results in higher data rate for the user). Each channel has potentially different state space and statistics, both unknown to the user, who tries to learn which one is the best as it goes and maximizes its usage of the best channel. The objective is to construct a good online learning algorithm so as to minimize the difference between the user's performance in total rewards and that of using the best channel (on average) had it known which one is the best from a priori knowledge of the channel statistics (also known as the regret). This is a classic exploration and exploitation problem and results abound when the reward processes are assumed to be iid. Compared to prior work, the biggest difference is that in our case the reward process is assumed to be Markovian, of which iid is a special case. In addition, the reward processes are restless in that the channel conditions will continue to evolve independent of the user's actions. This leads to a restless bandit problem, for which there exists little result on either algorithms or performance bounds in this learning context to the best of our knowledge. In this paper we introduce an algorithm that utilizes regenerative cycles of a Markov chain and computes a sample-mean based index policy, and show that under mild conditions on the state transition probabilities of the Markov chains this algorithm achieves logarithmic regret uniformly over time, and that this regret bound is also optimal.