https://arxiv.org/pdf/1909.11942.pdf
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut
Abstract
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at this https URL.
논문선정이유
경량화 작업 관련하여 더 읽고 싶었다. 대표적인 nlp 논문 중 하나다.
Introduction
의문제기:
“Is having better NLP models as easy as having larger models?”
기존 Full network pre-training:
It has become common practice to pre-train large models and distill them down to smaller ones.
ALBERT incorporates two parameter reduction techniques that lift the major obstacles in scaling pre-trained models.
- factorized embedding parameterization
- cross-layer parameter sharing
Related Work
Scaling up Representation Learning for Natural Language
더 큰 hidden size, hidden layer, attention head를 사용하면 더 나은 성능을 보여준다.
하지만 hidden size를 2048까지 하면 오히려 성능이 저하된다.
Cross-Layer Parameter Sharing
The idea of sharing parameters across layers has been previously explored with the Transformer architecture (Vaswani et al., 2017), but this prior work has focused on training for standard encoder-decoder tasks rather than the pretraining/finetuning setting.
networks with cross-layer parameter sharing (Universal Transformer, UT) get better performance on language modeling and subject-verb agreement than the standard 2Published as a conference paper at ICLR 2020 transformer.
Sentence Ordering Objectives
- ALBERT는 두 개의 text segment의 순서를 예측하여 pre-train loss로 사용한다.
- BERT에서 사용하는 NSP는 SOP와 비교하여 더 쉬운 task.
- more useful for certain downstream tasks.
The Elements of ALBERT
backbone of the ALBERT architecture is similar to BERT in that it uses a transformer encoder (Vaswani et al., 2017) with GELU nonlinearities (Hendrycks & Gimpel, 2016).
Factorized embedding parameterization.
- WordPiece embeddings are meant to learn context-independent representations, whereas hidden-layer embeddings are meant to learn context-dependent representations.
- vocabulary size V to be large
for ALBERT we use a factorization of the embedding parameters, decomposing them into two smaller matrices.
Instead of projecting the one-hot vectors directly into the hidden space of size H, we first project them into a lower dimensional embedding space of size E, and then project it to the hidden space.
parameter reduction is significant when H >> E.
고로 수십억개의 파라미터의 모델 생성이 가능해진다.
Cross-layer parameter sharing.
- The default decision for ALBERT is to share all parameters across layers
Our measurement on the L2 distances and cosine similarity show that our embeddings are oscillating rather than converging.
- transitions from layer to layer are much smoother for ALBERT than for BERT.
- This shows that the solution space for ALBERT parameters is very different from the one found by DQE.
parameter sharing이 network의 parameter 안정화에 영향을 주는 것을 알 수 있다.
Inter-sentence coherence loss.
BERT uses an additional loss called next-sentence prediction (NSP).
However, subsequent studies (Yang et al., 2019; Liu et al., 2019) found NSP’s impact unreliable and decided to eliminate it, a decision supported by an improvement in downstream task performance across several tasks.
NSP의 영향이 신뢰할 수 없다고 판단.
We conjecture that the main reason behind NSP’s ineffectiveness is its lack of difficulty as a task, as compared to MLM. As formulated, NSP conflates topic prediction and coherence prediction in a single task.
for ALBERT, we use a sentence-order prediction (SOP) loss, which avoids topic prediction and instead focuses on modeling inter-sentence coherence.
The SOP loss uses as positive examples the same technique as BERT (two consecutive segments from the same document), and as negative examples the same two consecutive segments but with their order swapped.
Model Setup
Overall Comparison between BERT and ALBERT
ALBERT-xxlarge achieves significant improvements over BERT-large, as measured by the difference on development set scores for several representative downstream tasks: SQuADv1.1 (+1.9%), SQuAD v2.0 (+3.1%), MNLI (+1.4%), SST-2 (+2.2%), and RACE (+8.4%).
ALBERT models have much smaller parameter size compared to corresponding BERT models.
FACTORIZED EMBEDDING PARAMETERIZATION
embedding size E=128 이 best인걸로 판단되어 사용.
CROSS-LAYER PARAMETER SHARING
We choose all-shared strategy as our default choice.
WHAT IF WE TRAIN FOR THE SAME AMOUNT OF TIME?
After training for roughly the same amount of time, ALBERT-xxlarge is significantly better than BERT-large: +1.5% better on Avg, with the difference on RACE as high as +5.2%.
ADDITIONAL TRAINING DATA AND DROPOUT EFFECTS
DISCUSSION
• ALBERT-xxlarge는 BERT-large보다 parameter가 적고 훨씬 더 좋은 성능을 보여줌. 그러나 large structure로 인해 계산비용이 더 비싸다.
• 따라서 중요한 다음 단계는 sparse attention(link) 및 block attention(link)과 같은 방법을 통해 ALBERT의 train 및 inference속도를 높이는 것이다.
• 또한 SOP가 더 나은 language representation으로 이어지는 설득력 있는 증거를 가지고 있지만, 우리는 self-supervised training에서 결과적인 representation에 대한 추가적인 representation을 생성할 수있는 더 많은 차원이 있을 수 있다고 가정한다.