ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 [1] and GPT-3 [2] have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as linguistic knowledge and world knowledge. In addition, most large-scale models are trained in an auto-regressive way. As a result, this kind of traditional fine-tuning approach demonstrates relatively weak performance when solving downstream language understanding tasks. In order to solve the above problems, we propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few-shot learning or fine-tuning. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Empirical results show that the model outperforms the state-of-the-art models on 54 Chinese NLP tasks, and its English version achieves the first place on the SuperGLUE [3] benchmark (July 3, 2021), surpassing the human performance by +0.8% (90.6% vs. 89.8%).
Related Works
Large-scale Pre-trained Models
Knowledge Enhanced Models
Universal Representation Module
ERNIE 3.0 uses a multi-layer Transformer-XL [34] as the backbone network like other pre-trained models such as XLNet [35], Segatron [36] and ERNIE-Doc [37], in which Transformer-XL is similar to Transformer but introduces an auxiliary recurrence memory module to help modelling longer texts.
Task-specific Representation Module
the task-specific representation module is also a multi-layer Transformer-XL, which is used to capture the top-level semantic representations for different task paradigms.
Pre-training Tasks
Word-aware Pre-training Tasks
Knowledge Masked Language Modeling
Document Language Modeling
Structure-aware Pre-training Tasks
Sentence Reordering
Sentence Distance
Knowledge-aware Pre-training Tasks
Universal Knowledge-Text Prediction
an extension of knowledge masked language modeling. While knowledge masked language modeling only requires unstructured texts, universal knowledge-text prediction task requires both unstructured texts and knowledge graphs.
In detail, we build the corpus for ERNIE 3.0 based on that from ERNIE 2.0 (including baike, wikipedia, feed and etc), Baidu Search (including Baijiahao, Zhidao, Tieba, Experience), Web text, QA-long, QA-short, Poetry 2&Couplet 3, Domain-specific data from medical, law and financial area and Baidu knowledge graph with more than 50 million facts.
