Skip to yearly menu bar Skip to main content


Poster
in
Workshop: AI for Science

LinkBERT: Language Model Pretraining with Document Link Knowledge

Michihiro Yasunaga · Jure Leskovec · Percy Liang


Abstract:

Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks, citation links. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on diverse downstream tasks across both general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). In particular, LinkBERT is effective for knowledge- and reasoning-intensive tasks such as multi-hop reasoning and few-shot inference (+7\% absolute gain on BioASQ and MedQA), and achieves new state-of-the-art results on various biomedical NLP tasks including relation extraction and literature classification. Our results suggest the promise of LinkBERT for scientific applications.

Chat is not available.