Bandits and Reinforcement Learning

COMS 6998-11, Spring 2022

Akshay Krishnamurthy

Project Guidelines

The course project is an opportunity to explore additional topics in the theory of reinforcement learning that you are especially interested in. Broadly, projects can fall into three categories.

Literature Survey + Open problems. Conduct a literature review of research on a particular topic in RL theory. Provide a detailed and clear overview of results along with algorithmic and analysis techniques, and identify possibly interesting directions for further research. If you can unify proof techniques for several algorithms or settings that would be very useful.
Literature Survey + Empirical evaluation.As above focus on a particular setting in RL theory, provide a detailed description of algorithms for that setting to demonstrate your understanding of the theory, and then implement the algorithms and evaluate them on some task of your choice. Verifying theoretical results in experiments or testing conjectures would also be interesting.
New Theoretical Research. Design and analyze a new algorithm for some bandit or reinforcement learning setting, improve on analysis for an existing algorithm, or derive negative results. Any research tied to reinforcement learning, with a theoretical flavor, is acceptable here but make sure you have a backup contingency plan.

Teams

You may work by yourself or in groups of up to two students. I expect two-person projects to be more ambitious.

Timeline and Evaluation

The project will consist of three components.

Project Proposal (25%), due 3/11.

1-2 pages, not including references.
Include the project title, team members, abstract, related work, problem formulation, and goals you hope to achieve.

Project Presentations (25%), in class on 5/2.

5 minute presentation that briefly summarizes the problem you are studying and any results or progress that you have.

Project Writeup (75%), due 5/9.

8 pages, not including references.
The writeup should read like a research paper. Describe the problem you are studying, how to solve it and why the approach is sound. The writeup will be evaluated like a research paper, on merit (e.g., is the approach reasonable?), technical depth (e.g., was it challenging?), and presentation (e.g., are the visualizations and writing clear?)

Both documents should be in NeurIPS format.

Some project ideas

These are heavily biased by my personal interests and tastes. Feel free to choose something else!

Instance dependent or adaptive guarantees for bandits and RL. In this course we will mostly see worst-case guarantees. However, it is often the case that much more favorable guarantees are possible in benign instances. Can you identify some parameter to quantify when a problem is easy and design an algorithm that adapts to this parameter? This paper is a good place to start reading up about instance-dependent guarantees for bandits.
Bandit learning with nonlinear models. The optimism principle works well for bandit settings with linear reward models, but doesn't seem to work when the reward function is nonlinear. Can you develop algorithms that do work for some nonlinear settings? Here are some related papers.
Contextual bandit model selection. Searching over different function classes, also known as model selection, in contextual bandits was posed as an open problem in COLT 2020 and the most general question was (essentially) resolved in the negative in NeurIPS 2021. However it doesn't rule out interesting model classes for which selection is possible. Can you develop an algorithm for model selection in contextual bandits, whenever it is possible?
First order contextual bandits. We recently showed how to achieve a "first-order" bound in contextual bandits, using an online regression oracle. Can you do this using an offline oracle?
Efficient planning in large state spaces (OR perspective). A major barrier to computationally efficient reinforcement learning is that planning even with the environment is known can be intractable. Outside of machine learning (in theoretical computer science, operations research, etc.) there are many results on approximate planning in structured MDPs. Can we extract some high-level principles for understanding when planning is possible? Here is perhaps one paper to start your search.
Exploration with policy gradient methods. Study the recent works on combining policy gradient methods with exploration schemes. Can you improve the sample-efficiency of these methods?
Horizon depenendence in tabular RL. A recent line of work has shown that one can avoid horizon dependence entirely in tabular reinforcement learning, but the guarantees are suboptimal in other factors. Can you improve these results?
Efficient algorithms for linear bellman completeness. We'll study the "linear bellman completeness" setting for RL with linear function approximation in the second part of the course. This setting is statistically tractable but we do not know of any computationally efficient algorithms. Can you develop one?
Survey on lower bounds for linear realizability. A flurry of recent work has studied RL with linear function approximation and mostly establishing hardness results. Can you identify an interesting class of RL problems where linear function approximation is tractable?
Square root T regret for rich observation MDPs. We will see a line of work establishing statistical tractability for rich observation MDPs, or those with very complex state spaces. However most of these algorithms do not achieve the typically optimal square root T-tpe regret. One notable exception is this paper. Is there a simpler algorithm? Or do these techniques apply more generally?
Imitation learning when the expert is too powerful. Imitation learning guarantees often assume that your policy class can closely approximate the expert, yet in applications the expert typically has some privilege information which necessarily makes it hard to approximate. What happens in this case and can we develop better algorithms? Here is a recent paper to start your search.
Survey of offline RL upper and lower bounds with general function approximation. In class we will see one algorithm for offline reinforcement learning, but this is also a rich space with many interesting upper and lower bound arguments and some open problems. Can you identify new conditions under which offline RL is tractable?