The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDiscovery—a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM’s prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDiscovery in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Our human evaluation further reveals that two-thirds of discoveries made by our system are surprising to domain experts as well, suggesting this is an important step towards building open-ended ASD systems.
@inproceedings{agarwal2025autodiscovery,title={AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise},author={Agarwal, Dhruv and Majumder, Bodhisattwa Prasad and Adamson, Reece and Chakravorty, Megha and Gavireddy, Satvika Reddy and Parashar, Aditya and Surana, Harshit and Mishra, Bhavana Dalvi and McCallum, Andrew and Sabharwal, Ashish and Clark, Peter},booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},year={2025},}
ICLR
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Bodhisattwa Prasad
Majumder*, Harshit
Surana*, Dhruv
Agarwal*, Bhavana Dalvi
Mishra*, Abhijeetsingh
Meena, Aryan
Prakhar, Tirth
Vora, Tushar
Khot, Ashish
Sabharwal, and Peter
Clark
In The Thirteenth International Conference on Learning Representations, 2025
Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations on data-driven workflows that are not covered in the manually collected split. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
@inproceedings{majumder2025discoverybench,title={DiscoveryBench: Towards Data-Driven Discovery with Large Language Models},author={Majumder, Bodhisattwa Prasad and Surana, Harshit and Agarwal, Dhruv and Mishra, Bhavana Dalvi and Meena, Abhijeetsingh and Prakhar, Aryan and Vora, Tirth and Khot, Tushar and Sabharwal, Ashish and Clark, Peter},booktitle={The Thirteenth International Conference on Learning Representations},year={2025},}
ICLR
Searching for Optimal Solutions with LLMs via Bayesian Optimization
Scaling test-time compute to search for optimal solutions is an important step towards building generally-capable language models that can reason. Recent work, however, shows that tasks of varying complexity require distinct search strategies to solve optimally, thus making it challenging to design a one-size-fits-all approach. Prior solutions either attempt to predict task difficulty to select the optimal search strategy, often infeasible in practice, or use a static, pre-defined strategy, e.g., repeated parallel sampling or greedy sequential search, which is sub-optimal. In this work, we argue for an alternative view using the probabilistic framework of Bayesian optimization (BO), where the search strategy is adapted dynamically based on the evolving uncertainty estimates of solutions as search progresses. To this end, we introduce Bayesian-OPRO (BOPRO)––a generalization of a recent method for in-context optimization, which iteratively samples from new proposal distributions by modifying the prompt to the LLM with a subset of its previous generations selected to explore or exploit different parts of the search space. We evaluate our method on word search, molecule optimization, and a joint hypothesis+program search task using a 1-D version of the challenging Abstraction and Reasoning Corpus (1D-ARC). Our results show that BOPRO outperforms all baselines in word search (≥10 points) and molecule optimization (higher quality and 17% fewer invalid molecules), but trails a best-k prompting strategy in program search. Our analysis reveals that despite the ability to balance exploration and exploitation using BOPRO, failure is likely due to the inability of code representation models in distinguishing sequences with low edit-distances.
@inproceedings{agarwal2025searching,title={Searching for Optimal Solutions with {LLM}s via Bayesian Optimization},author={Agarwal, Dhruv and Arivazhagan, Manoj Ghuhan and Das, Rajarshi and Swamy, Sandesh and Khosla, Sopan and Gangadharaiah, Rashmi},booktitle={The Thirteenth International Conference on Learning Representations},year={2025},}
arXiv
MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time
Peter
Phan, Dhruv
Agarwal, Kavitha
Srinivas, Horst
Samulowitz, Pavan
Kapanipathi, and Andrew
McCallum
Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.
@article{phan2025migrate,title={MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time},author={Phan, Peter and Agarwal, Dhruv and Srinivas, Kavitha and Samulowitz, Horst and Kapanipathi, Pavan and McCallum, Andrew},journal={arXiv Preprint},year={2025},}
2024
ICML
Position: Data-driven Discovery with Large Generative Models
Bodhisattwa Prasad
Majumder*, Harshit
Surana*, Dhruv
Agarwal*, Sanchaita
Hazra, Ashish
Sabharwal, and Peter
Clark
In Forty-first International Conference on Machine Learning, 2024
With the accumulation of data at an unprecedented rate, its potential to fuel scientific discovery is growing exponentially. This position paper urges the Machine Learning (ML) community to exploit the capabilities of large generative models (LGMs) to develop automated systems for end-to-end data-driven discovery—a paradigm encompassing the search and verification of hypotheses purely from a set of provided datasets, without the need for additional data collection or physical experiments. We first outline several desiderata for an ideal data-driven discovery system. Then, through DataVoyager, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata—a feat previously unattainable—while also highlighting important limitations in the current system that open up opportunities for novel ML research. We contend that achieving accurate, reliable, and robust end-to-end discovery systems solely through the current capabilities of LGMs is challenging. We instead advocate for fail-proof tool integration, along with active user moderation through feedback mechanisms, to foster data-driven scientific discoveries with efficiency and reproducibility.
@inproceedings{majumder2024position,title={Position: Data-driven Discovery with Large Generative Models},author={Majumder, Bodhisattwa Prasad and Surana, Harshit and Agarwal, Dhruv and Hazra, Sanchaita and Sabharwal, Ashish and Clark, Peter},booktitle={Forty-first International Conference on Machine Learning},year={2024},}
NAACL (Findings)
Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA
Dhruv
Agarwal, Rajarshi
Das, Sopan
Khosla, and Rashmi
Gangadharaiah
In Findings of the Association for Computational Linguistics: NAACL 2024, Jun 2024
We present BYOKG, a universal question-answering (QA) system that can operate on any knowledge graph (KG), requires no human-annotated training data, and can be ready to use within a day—attributes that are out-of-scope for current KGQA systems. BYOKG draws inspiration from the remarkable ability of humans to comprehend information present in an unseen KG through exploration—starting at random nodes, inspecting the labels of adjacent nodes and edges, and combining them with their prior world knowledge. Exploration in BYOKG leverages an LLM-backed symbolic agent that generates a diverse set of query-program exemplars, which are then used to ground a retrieval-augmented reasoning procedure to synthesize programs for arbitrary questions. BYOKG is effective over both small- and large-scale graphs, showing dramatic gains in zero-shot QA accuracy of 27.89 and 59.88 F1 on GrailQA and MetaQA, respectively. We further find that performance of BYOKG reliably improves with continued exploration as well as improvements in the base LLM, notably outperforming a state-of-the-art fine-tuned model by 7.08 F1 on a sub-sampled zero-shot split of GrailQA. Lastly, we verify our universality claim by evaluating BYOKG on a domain-specific materials science KG and show that it improves zero-shot performance by 46.33 F1.
@inproceedings{agarwal-etal-2024-bring,title={Bring Your Own {KG}: Self-Supervised Program Synthesis for Zero-Shot {KGQA}},author={Agarwal, Dhruv and Das, Rajarshi and Khosla, Sopan and Gangadharaiah, Rashmi},editor={Duh, Kevin and Gomez, Helena and Bethard, Steven},booktitle={Findings of the Association for Computational Linguistics: NAACL 2024},month=jun,year={2024},address={Mexico City, Mexico},publisher={Association for Computational Linguistics},doi={10.18653/v1/2024.findings-naacl.57},pages={896--919},}
2023
EMNLP (Findings)
Machine Reading Comprehension using Case-based Reasoning
Dung
Thai, Dhruv
Agarwal, Mudit
Chaudhary, Wenlong
Zhao, Rajarshi
Das, Jay-Yoon
Lee, Hannaneh
Hajishirzi, Manzil
Zaheer, and Andrew
McCallum
In Findings of the Association for Computational Linguistics: EMNLP 2023, Dec 2023
We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds upon the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a test question, CBR-MRC first retrieves a set of similar cases from a nonparametric memory and then predicts an answer by selecting the span in the test context that is most similar to the contextualized representations of answers in the retrieved cases. The semi-parametric nature of our approach allows it to attribute a prediction to the specific set of evidence cases, making it a desirable choice for building reliable and debuggable QA systems. We show that CBR-MRC provides high accuracy comparable with large reader models and outperforms baselines by 11.5 and 8.4 EM on NaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability of CBR-MRC in identifying not just the correct answer tokens but also the span with the most relevant supporting evidence. Lastly, we observe that contexts for certain question types show higher lexical diversity than others and find that CBR-MRC is robust to these variations while performance using fully-parametric methods drops.
@inproceedings{thai-etal-2023-machine,title={Machine Reading Comprehension using Case-based Reasoning},author={Thai, Dung and Agarwal, Dhruv and Chaudhary, Mudit and Zhao, Wenlong and Das, Rajarshi and Lee, Jay-Yoon and Hajishirzi, Hannaneh and Zaheer, Manzil and McCallum, Andrew},editor={Bouamor, Houda and Pino, Juan and Bali, Kalika},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},month=dec,year={2023},address={Singapore},publisher={Association for Computational Linguistics},doi={10.18653/v1/2023.findings-emnlp.564},pages={8414--8428},}
2022
NAACL
Entity Linking via Explicit Mention-Mention Coreference Modeling
Dhruv
Agarwal, Rico
Angell, Nicholas
Monath, and Andrew
McCallum
In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul 2022
Learning representations of entity mentions is a core component of modern entity linking systems for both candidate generation and making linking predictions. In this paper, we present and empirically analyze a novel training approach for learning mention and entity representations that is based on building minimum spanning arborescences (i.e., directed spanning trees) over mentions and entities across documents to explicitly model mention coreference relationships. We demonstrate the efficacy of our approach by showing significant improvements in both candidate generation recall and linking accuracy on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset. In addition, we show that our improvements in candidate generation yield higher quality re-ranking models downstream, setting a new SOTA result in linking accuracy on MedMentions. Finally, we demonstrate that our improved mention representations are also effective for the discovery of new entities via cross-document coreference.
@inproceedings{agarwal-etal-2022-entity,title={Entity Linking via Explicit Mention-Mention Coreference Modeling},author={Agarwal, Dhruv and Angell, Rico and Monath, Nicholas and McCallum, Andrew},editor={Carpuat, Marine and de Marneffe, Marie-Catherine and Meza Ruiz, Ivan Vladimir},booktitle={Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},month=jul,year={2022},address={Seattle, United States},publisher={Association for Computational Linguistics},doi={10.18653/v1/2022.naacl-main.343},pages={4644--4658},}
Patent
Method, system, and non-transitory computer readable medium for an artificial intelligence based room assignment optimization system
Andrew
Vakhutinsky, Setareh Borjian
Boroujeni, Saraswati
Yagnavajhala, Jorge Luis Rivero
Perez, Dhruv
Agarwal, and Akash
Chatterjee
Embodiments provide optimized room assignments for a hotel in response to receiving a plurality of hard constraints and soft constraints and receiving reservation preferences and room features. The optimization includes determining a guest satisfaction assignment cost based on the reservation preferences and room features, determining an operational efficiency assignment cost, generating a weighted cost matrix based on the guest satisfaction assignment cost and the operational efficiency assignment cost, and generating preliminary room assignments based on the weighted cost matrix. When the preliminary room assignments are feasible, the preliminary room assignments are the optimized room assignments comprising a feasible selection of elements of the matrix. When the preliminary room assignments are infeasible, embodiments relax one or more constraints and repeat the performing optimization until the preliminary room assignments are feasible.
@misc{vakhutinsky2022method,title={Method, system, and non-transitory computer readable medium for an artificial intelligence based room assignment optimization system},author={Vakhutinsky, Andrew and Boroujeni, Setareh Borjian and Yagnavajhala, Saraswati and Perez, Jorge Luis Rivero and Agarwal, Dhruv and Chatterjee, Akash},year={2022},month=nov,publisher={Google Patents},note={Oracle International Corp, US Patent 11,514,374},}