@inproceedings{drozdov-etal-2022-cant,
    title = "You can{'}t pick your neighbors, or can you? When and How to Rely on Retrieval in the k{NN}-{LM}",
    author = "Drozdov, Andrew  and
      Wang, Shufan  and
      Rahimi, Razieh  and
      McCallum, Andrew  and
      Zamani, Hamed  and
      Iyyer, Mohit",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.218",
    doi = "10.18653/v1/2022.findings-emnlp.218",
    pages = "2997--3007",
    abstract = "Retrieval-enhanced language models (LMs), which condition their predictions on text retrieved from large external datastores, have recently shown significant perplexity improvements compared to standard LMs. One such approach, the kNN-LM, interpolates any existing LM{'}s predictions with the output of a k-nearest neighbors model and requires no additional training. In this paper, we explore the importance of lexical and semantic matching in the context of items retrieved by kNN-LM. We find two trends: (1) the presence of large overlapping n-grams between the datastore and evaluation set plays an important factor in strong performance, even when the datastore is derived from the training data; and (2) the kNN-LM is most beneficial when retrieved items have high semantic similarity with the query. Based on our analysis, we define a new formulation of the kNN-LM that uses retrieval quality to assign the interpolation coefficient. We empirically measure the effectiveness of our approach on two English language modeling datasets, Wikitext-103 and PG-19. Our re-formulation of the kNN-LM is beneficial in both cases, and leads to nearly 4{\%} improvement in perplexity on the Wikitext-103 test set.",
}