awesome-pretrained-models-for-information-retrieval

A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pre-training for IR). If there are any papers I missed, please let me know! And any feedback and contributions are welcome!

Pre-training for IR

Survey Papers
Phase 1: First-stage Retrieval
Phase 2: Re-ranking Stage
Jointly Learning to Retrieve and Re-rank
Model-based IR System
Multimodal Retrieval
- Unified Single-stream Architecture
- Multi-stream Architecture Applied on Input
Other Resources

For people who want to acquire some basic & advanced knowledge about neural models for information retrieval and try some neural models by hand, we refer readers to the below awesome NeuIR survey and the text-matching toolkit MatchZoo-py:

A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al. IPM 2020

Survey Papers

Pre-training Methods in Information Retrieval. Yixing Fan, Xiaohui Xie et.al. 2021
Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al. 2020
Semantic Models for the First-stage Retrieval: A Comprehensive Review. Jiafeng Guo et.al. TOIS 2021

First Stage Retrieval

Sparse Retrieval

Dense Retrieval

Hard negative sampling

Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR, in-batch negatives)
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. Arxiv 2020. [code] (RepBERT)
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE, refresh index during training)
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. NAACL 2021. (RocketQA: cross-batch negatives, denoise hard negatives and data augementation)
Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et.al. SIGIR 2021.[code] (ADORE&STAR, query-side finetuning build on pretrained document encoders)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren et.al. EMNLP Findings 2021. [code] (PAIR)

Late interaction and multi-vector representation

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
Sparse, Dense, and Attentional Representations for Text Retrieval. Yi Luan, Jacob Eisenstein et.al. TACL 2020. (ME-BERT, multi-vectors)
Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. Hongyin Tang, Xingwu Sun et.al. ACL 2021.
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
Multi-View Document Representation Learning for Open-Domain Dense Retrieval. Shunyu Zhang et.al. ACL 2022. (MVR)

Knowledge distillation

Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020. [unofficial code] (Distill cross-attention of reader to retriever)
Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et.al. SIGIR 2020. [code] (Distill from cross-encoders to bi-encoders)
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et.al. Arxiv 2020. [code] (Distill from BERT ensemble)
Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin. Arxiv 2020. [code] (TCTColBERT: distill from ColBERT)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2, joint learning by distillation)

Jointly learning retrieval and indexing

Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index. Han Zhang et.al. SIGIR 2021 short. [code] (Poeem)
Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. Jingtao Zhan et.al. CIKM 2021. [code] (JPQ)
Efficient Passage Retrieval with Hashing for Open-domain Question Answering. Ikuya Yamada et.al. ACL 2021. [code] (BPR, convert embedding vector to binary codes)
Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. Jingtao Zhan et.al. WSDM 2022. [code] (RepCONC)

Domain adaptation

Multi-Task Retrieval for Knowledge-Intensive Tasks. Jean Maillard, Vladimir Karpukhin^ et.al. ACL 2021. (Multi-task learning)
Evaluating Extrapolation Performance of Dense Retrieval. Jingtao Zhan et.al. Arxiv 2022. [code]

Pre-training tailored for dense retrieval

Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et.al. ACL 2019. [code] (ORQA, ICT)
Pre-training tasks for embedding-based large scale retrieval. Wei-Cheng Chang et.al. ICLR 2020. (ICT, BFS and WLP)
REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020. [code] (REALM)
Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder. Shuqi Lu, Di He, Chenyan Xiong et.al. EMNLP 2021. [code] (Seed)
Condenser: a Pre-training Architecture for Dense Retrieval. Luyu Gao et.al. EMNLP 2021. [code](Condenser)
Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval. Ning Wu et.al. JICAI 2022. [code](CCP, cross-lingual pre-training)
Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. Luyu Gao et.al. ACL 2022. [code](coCondenser)
LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval. Canwen Xu, Daya Guo et.al. ACL 2022. [code] (LaPraDoR, ICT+dropout)
Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction Xinyu Ma et.al. SIGIR 2022. [code]

Dense retrieval in open domain QA

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo, Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin, Barlas Oguz et.al. EMNLP 2020 [code] (DPR)
Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
Multi-Task Retrieval for Knowledge-Intensive Tasks. Jean Maillard, Vladimir Karpukhin^ et.al. ACL 2021. (Multi-task learning)

Combining Sparse Retrieval and Dense Retrieval

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Complement Lexical Retrieval Model with Semantic Residual Embeddings. Luyu Gao et.al. ECIR 2021.
BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. Shuai Wang et.al. ICTIR 2021.
Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Shitao Xiao et.al. WWW 2022. [code]

Re-ranking Stage

Basic Usage

Discriminative ranking models

Representation-focused

Understanding the Behaviors of BERT in Ranking. Yifan Qiao et.al. Aixiv 2019. (Representation-focused and Interaction-focused)

Interaction-focused

Passage Re-ranking with BERT. Rodrigo Nogueira et.al. [code] (monoBERT: Maybe the first work on applying BERT to IR)
Multi-Stage Document Ranking with BERT, The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. Rodrigo Nogueira et.al. Arxiv 2020. (Expando-Mono-Duo: doc2query+pointwise+pairwise)
CEDR: Contextualized Embeddings for Document Ranking. Sean MacAvaney et.al. SIGIR 2020 short. [code] (CEDR: BERT+neuIR model)

Generative ranking models

Beyond [CLS] through Ranking by Generation. Cicero Nogueira dos Santos et.al. EMNLP 2020 short. (Query generation using GPT and BART)
Document Ranking with a Pretrained Sequence-to-Sequence Model. Rodrigo Nogueira, Zhiying Jiang et.al. EMNLP 2020. [code] (Relevance token generation using T5)

Hybrid ranking models

Generalizing Discriminative Retrieval Models using Generative Tasks. Bingsheng Liu, Hamed Zamani et.al. WWW 2021. (GDMTL, joint discriminative and generative model with multitask learning)

Long Document Processing Techniques

Passage score aggregation

Deeper Text Understanding for IR with Contextual Neural Language Modeling. Zhuyun Dai et.al. SIGIR 2020 short. [code] (BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level)
Simple Applications of BERT for Ad Hoc Document Retrieval, Applying BERT to Document Retrieval with Birch, Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval. Wei Yang, Haotian Zhang et.al. Arxiv 2020, Zeynep Akkalyoncu Yilmaz et.al. EMNLP 2019 short. [code] (Birch: Sentence-level)
Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking. Sebastian Hofstätter et.al. SIGIR 2021. [code] (Distill a ranking model to conv-knrm to select top-k passages)

Passage representation aggregation

PARADE: Passage Representation Aggregation for Document Reranking. Canjia Li et.al. Arxiv 2020. [code] (An extensive comparison of various Passage Representation Aggregation methods)
Leveraging Passage-level Cumulative Gain for Document Ranking. Zhijing Wu et.al. WWW 2020. (PCGM)

Designing new architectures

Local Self-Attention over Long Text for Efficient Document Retrieval. Sebastian Hofstätter et.al. SIGIR 2020 short. [code] (TKL:Transformer-Kernel for long text)
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. Liu Yang et.al. CIKM 2020. [code] (SMITH for doc2doc matching)
Socialformer: Social Network Inspired Long Document Modeling for Document Ranking. Yujia Zhou et.al. WWW 2022. (Socialformer)

Improving Efficiency

Decoupling the interaction

DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Sean MacAvaney et.al. SIGIR 2020. [code] (PreTTR)
Modularized Transfomer-based Ranking Framework. Luyu Gao et.al. EMNLP 2020. [code] (MORES, similar to PreTTR)
TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. Shengyao Zhuang, Guido Zuccon SIGIR 2021. [code] (TILDE)
Fast Forward Indexes for Efficient Document Ranking. Jurek Leonhardt et.al. WWW 2022. (Fast forward index)

Knowledge distillation

Understanding BERT Rankers Under Distillation. Luyu Gao et.al. ICTIR 2020. (LM Distill + Ranker Distill)
Simplified TinyBERT: Knowledge Distillation for Document Retrieval. Xuanang Chen et.al. ECIR 2021. [code] (TinyBERT+knowledge distillation)

Early exit

The Cascade Transformer: an Application for Efficient Answer Sentence Selection. Luca Soldaini et.al. ACL 2020.[code] (Cascade Transformer: prune candidates by layer)
Early Exiting BERT for Efficient Document Ranking. Ji Xin et.al. EMNLP 2020 SustaiNLP Workshop. [code] (Early exit)

Re-weighting Training Samples

Training Curricula for Open Domain Answer Re-Ranking. Sean MacAvaney et.al. SIGIR 2020. [code] (curriculum learning based on BM25)
Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. Daniel Cohen et.al. SIGIR 2021.

Query Expansion

BERT-QE: Contextualized Query Expansion for Document Re-ranking. Zhi Zheng et.al. EMNLP 2020 Findings. [code] (BERT-QE)

Partial Fine-tuning

Semi-Siamese Bi-encoder Neural Ranking Model Using Lightweight Fine-Tuning. Euna Jung, Jaekeol Choi et.al. WWW 2022. [code] (Lightweight Fine-Tuning)

Pre-training Tailored for Re-ranking

MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval. Lila Boualili et.al. SIGIR 2020 short. [code] (MarkedBERT)
Selective Weak Supervision for Neural Information Retrieval. Kaitao Zhang et.al. WWW 2020. [code] (ReInfoSelect)
PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. WSDM 2021. [code] (PROP)
Cross-lingual Language Model Pretraining for Retrieval. Puxuan Yu et.al. WWW 2021.
B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. SIGIR 2021. [code] (B-PROP)
Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need. Zhengyi Ma et.al. CIKM 2021. [code] (HARP)
Contrastive Learning of User Behavior Sequence for Context-Aware Document Ranking. Yutao Zhu et.al. CIKM 2021. [code](COCA)
Pre-trained Language Model based Ranking in Baidu Search. Lixin Zou et.al. KDD 2021.
A Unified Pretraining Framework for Passage Ranking and Expansion. Ming Yan et.al. AAAI 2021. (UED, jointly training ranking and query generation)
Axiomatically Regularized Pre-training for Ad hoc Search. Jia Chen et.al. SIGIR 2022. [code] (ARES)

Cross-lingual Retrieval

Cross-lingual Retrieval for Iterative Self-Supervised Training. Chau Tran et.al. NIPS 2020. [code] (CRISS)
CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. Shuo Sun et.al. EMNLP 2020. [code] (Multilingual dataset-CLIRMatrix and multilingual BERT)

Jointly Learning to Retrieve and Re-rank

RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2)
Adversarial Retriever-Ranker for dense text retrieval. Hang Zhang et.al. ICLR 2022. [code] (AR2)

Model-based IR System

Rethinking Search: Making Domain Experts out of Dilettantes. Donald Metzler et.al. SIGIR Forum 2020. (Envisioned the model-based IR system)
Transformer Memory as a Differentiable Search Index. Yi Tay et.al. Arxiv 2022. (DSI)
DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index. Yujia Zhou et.al. Arxiv 2022. (DynamicRetriever)
A Neural Corpus Indexer for Document Retrieval. Yujing Wang et.al. Arxiv 2022. (NCI)
Autoregressive Search Engines: Generating Substrings as Document Identifiers. Michele Bevilacqua et.al. Arxiv 2022. [code] (SEAL)

Multimodal Retrieval

Unified Single-stream Architecture

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan et.al. AAAI 2020. [code] (Unicoder-VL)
XGPT: Cross-modal Generative Pre-Training for Image Captioning. Qiaolin Xia, Haoyang Huang, Nan Duan et.al. Arxiv 2020. [code] (XGPT)
UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li et.al. ECCV 2020. [code] (UNITER)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin et.al. ECCV 2020. [code] (Oscar)
VinVL: Making Visual Representations Matter in Vision-Language Models. Pengchuan Zhang, Xiujun Li et.al. ECCV 2020. [code] (VinVL)
Dynamic Modality Interaction Modeling for Image-Text Retrieval. Leigang Qu et.al. SIGIR 2021 Best student paper. [code] (DIME)

Multi-stream Architecture Applied on Input

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra et.al. NeurIPS 2019. [code] (VilBERT)
12-in-1: Multi-Task Vision and Language Representation Learning. Jiasen Lu, Dhruv Batra et.al. CVPR 2020. [code] (A multi-task model based on VilBERT)
Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et.al. CVPR 2020. [code] (CLIP, GPT team)
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang et.al. Arxiv 2020. [code] (ERNIE-ViL，1st place on the VCR leaderboard)
M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. Junyang Lin, An Yang et.al. KDD 2020. (M6-v0/InterBERT)
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. Haoyang Huang, Lin Su et.al. CVPR 2021. [code] (M3P, MILD dataset)

Other Resources

Some Retrieval Toolkits

Other Resources About Pre-trained Models in NLP

Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu et.al.
BERT-related-papers
Pre-trained Languge Model Papers from THU-NLP

Surveys About Efficient Transformers

Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani et.al. Arxiv 2020.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

awesome-pretrained-models-for-information-retrieval

Pre-training for IR

Survey Papers

First Stage Retrieval

Sparse Retrieval

Neural term re-weighting

Query or document expansion

Sparse representation learning

Dense Retrieval

Hard negative sampling

Late interaction and multi-vector representation

Knowledge distillation

Jointly learning retrieval and indexing

Domain adaptation

Pre-training tailored for dense retrieval

Dense retrieval in open domain QA

Combining Sparse Retrieval and Dense Retrieval

Re-ranking Stage

Basic Usage

Discriminative ranking models

Representation-focused

Interaction-focused

Generative ranking models

Hybrid ranking models

Long Document Processing Techniques

Passage score aggregation

Passage representation aggregation

Designing new architectures

Improving Efficiency

Decoupling the interaction

Knowledge distillation

Early exit

Re-weighting Training Samples

Query Expansion

Partial Fine-tuning

Pre-training Tailored for Re-ranking

Cross-lingual Retrieval

Jointly Learning to Retrieve and Re-rank

Model-based IR System

Multimodal Retrieval

Unified Single-stream Architecture

Multi-stream Architecture Applied on Input

Other Resources

Some Retrieval Toolkits

Other Resources About Pre-trained Models in NLP

Surveys About Efficient Transformers