Bandits and Reinforcement Learning
COMS 6998-11, Spring 2022
|
Project Guidelines
The course project is an opportunity to explore additional topics in the theory of reinforcement learning that you are especially interested in.
Broadly, projects can fall into three categories.
- Literature Survey + Open problems. Conduct a literature review of research on a particular topic in RL theory. Provide a detailed and clear overview of results along with algorithmic and analysis techniques, and identify possibly interesting directions for further research. If you can unify proof techniques for several algorithms or settings that would be very useful.
- Literature Survey + Empirical evaluation.As above focus on a particular setting in RL theory, provide a detailed description of algorithms for that setting to demonstrate your understanding of the theory, and then implement the algorithms and evaluate them on some task of your choice. Verifying theoretical results in experiments or testing conjectures would also be interesting.
- New Theoretical Research. Design and analyze a new algorithm for some bandit or reinforcement learning setting, improve on analysis for an existing algorithm, or derive negative results.
Any research tied to reinforcement learning, with a theoretical flavor, is acceptable here but make sure you have a backup contingency plan.
Teams You may work by yourself or in groups of up to
two students. I expect two-person projects to be more
ambitious.
Timeline and Evaluation
The project will consist of three components.
- Project Proposal (25%), due 3/11.
- 1-2 pages, not including references.
- Include the project title, team members, abstract, related work, problem formulation, and goals you hope to achieve.
- Project Presentations (25%), in class on 5/2.
- 5 minute presentation that briefly summarizes the problem you are studying and any results or progress that you have.
- Project Writeup (75%), due 5/9.
- 8 pages, not including references.
- The writeup should read like a research
paper. Describe the problem you are studying, how to solve
it and why the approach is sound. The writeup will be
evaluated like a research paper, on merit (e.g., is the
approach reasonable?), technical depth (e.g., was it
challenging?), and presentation (e.g., are the
visualizations and writing clear?)
Both documents should be in NeurIPS format.
Some project ideas
These are heavily biased by my personal interests and tastes. Feel free to choose something else!
- Instance dependent or adaptive guarantees for bandits
and RL. In this course we will mostly see worst-case
guarantees. However, it is often the case that much more
favorable guarantees are possible in benign instances. Can
you identify some parameter to quantify when a problem is
easy and design an algorithm that adapts to this
parameter? This
paper is a good place to start reading up about
instance-dependent guarantees for bandits.
- Bandit learning with nonlinear models. The
optimism principle works well for bandit settings with
linear reward models, but doesn't seem to work when the
reward function is nonlinear. Can you develop algorithms
that do work for some nonlinear settings? Here are
some related papers.
- Contextual bandit model selection. Searching over
different function classes, also known as model selection,
in contextual bandits was posed as an open problem
in COLT 2020
and the most general question was (essentially) resolved in
the negative
in NeurIPS
2021. However it doesn't rule out interesting model
classes for which selection is possible. Can you develop an
algorithm for model selection in contextual bandits,
whenever it is possible?
- First order contextual bandits. We recently
showed how to achieve a "first-order" bound in contextual
bandits, using an online regression oracle. Can you do this
using an offline oracle?
- Efficient planning in large state spaces (OR perspective). A major
barrier to computationally efficient reinforcement learning
is that planning even with the environment is known can be
intractable. Outside of machine learning (in theoretical
computer science, operations research, etc.) there are many
results on approximate planning in structured MDPs. Can we
extract some high-level principles for understanding when
planning is possible? Here is
perhaps one
paper to start your search.
- Exploration with policy gradient methods. Study the recent works on combining policy gradient methods with exploration schemes. Can you improve the sample-efficiency of these methods?
- Horizon depenendence in tabular RL. A recent line of work has shown that one can avoid horizon dependence entirely in tabular reinforcement learning, but the guarantees are suboptimal in other factors. Can you improve these results?
- Efficient algorithms for linear bellman
completeness. We'll study the "linear bellman
completeness" setting for RL with linear function
approximation in the second part of the course. This setting
is
statistically tractable
but we do not know of any computationally efficient
algorithms. Can you develop one?
- Survey on lower bounds for linear realizability.
A flurry
of recent work
has studied
RL with linear function approximation and mostly
establishing hardness results. Can you identify an
interesting class of RL problems where linear function
approximation is tractable?
- Square root T regret for rich observation MDPs.
We will see a line of
work establishing statistical tractability
for rich observation MDPs, or those with very complex state
spaces. However most of these algorithms do not achieve the
typically optimal square root T-tpe regret. One notable
exception is this
paper. Is there a simpler algorithm? Or do these
techniques apply more generally?
- Imitation learning when the expert is too
powerful. Imitation learning guarantees often assume
that your policy class can closely approximate the expert,
yet in applications the expert typically has some privilege
information which necessarily makes it hard to
approximate. What happens in this case and can we develop
better algorithms? Here is a
recent paper
to start your search.
- Survey of offline RL upper and lower bounds with
general function approximation. In class we will see one
algorithm for offline reinforcement learning, but this is
also a rich space with many
interesting upper
and lower
bound arguments and some open problems. Can you identify new
conditions under which offline RL is tractable?
|