Powerful, large-scale pre-trained language models like Google’s BERT have been a game-changer in natural language processing (NLP) and beyond. The impressive achievements, however, were accompanied by huge demands on compute and memory, which made it difficult to deploy such models on devices with limited resources.
Previous studies have proposed task-independent BERT distillation to solve this problem – an approach that aims to achieve a small general BERT model that can be fine-tuned directly as a teacher model (such as BERT-Base). But even task-independent BERT distillation is computationally expensive, due to the large-scale corpora involved and the need to perform both a forward process for the teacher model and a forward-backward process for the teacher model. student model.
In the newspaper Extract then distill: Efficient, task-independent BERT distillation, a research team from Huawei Noah’s Ark Lab and Tsinghua University offers Extract Then Distill (ETD), a generic and flexible strategy that reuses the parameters of the teacher model for efficient and task-independent distillation that can be applied to student models of all sizes.
The researchers summarize their contributions as follows:
- Provide an efficient ETD method, which improves the efficiency of task-independent BERT distillation by reusing teacher parameters to initialize the student model.
- The proposed ETD method is flexible and applicable to student models of any size.
- Demonstrate the effectiveness of ETD on the GLUE and SQuAD benchmarks.
- Demonstrate that ETD is general and can be applied to various existing advanced distillation methods, such as TinyBERT and MiniLM, to further improve their performance.
- Validate that the ETD extraction process is efficient and provides almost no additional calculation.
The proposed ETD strategy consists of three stages: extraction in the width direction, uniform selection of layers and transformer distillation.
The widthwise extraction extracts the parameters from the teacher to a thin teacher model. The process includes the extraction of feed-forward network (FFN) neurons, head neurons and hidden neurons. The extraction of hidden neurons follows the principle of hidden coherence which ensures that hidden neurons extracted from different modules have the same position indices. Researchers propose two approaches for extracting teacher parameters: ETD-Rand performs a widthwise extraction of teacher parameters randomly, while ETD-Impt does this based on importance scores. . This step results in a thin teacher model with the same width as the student.
After the extraction in width, the researchers are able to adopt the strategy of uniform selection of the layers for the extraction in depth. More precisely, given a thin teacher who has N transformer layers and a student with M transformer layers, they apply a uniform strategy to choose M layers in the thin teacher model to initialize the student model.
Finally, they initialize the student with the extracted parameters and adopt the strategy of last layer distillation in ETD.
The team used English Wikipedia and the Toronto Book Corpus as their distillation data sets. They used two basic models – BERT-Base (a 12-layer transformer with a hidden layer size of 768) and DistilBERT (a 6-layer model whose width remains the same as the teacher’s model) – as teachers to test ETD distillation performance. They also applied ETD-Impt to popular distillation methods such as TinyBERT and MiniLM) to assess the generality of ETD.
In the experiments, the ETD-Rand and ETD-Impt strategies achieved results comparable to Rand-Init while using only 43% and 28% of the computational costs, respectively. Compared to TinyBERT and MiniLM, ETD-Impt achieved similar performance with a computational cost still below 28%, validating both the efficiency and the generic of the proposed ETD approach.
The paper Extract then distill: Efficient, task-independent BERT distillation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research findings. Subscribe to our popular newsletter Global AI synchronized weekly to get weekly AI updates.