Deep neural networks (DNN) underpin state-of-the-art applications of artificial intelligence (AI) in almost all fields, such as image, speech and natural language processing. However, DNN architectures are often data, compute, space, power and energy hungry, typically requiring powerful GPUs or large-scale clusters to train and deploy, which has been viewed as a “non-green” technology. Furthermore, often the best performing models are ensembles of hundreds or thousands of base-level models. The space required to store these cumbersome models, and the time required to execute these models at run-time, significantly prohibit their use for applications with limited memory, storage space, or computational power such as mobile devices or sensor networks, and for applications in which real-time predictions are needed.
Knowledge distillation (a novel and cutting-edge model compression method for deep neural nets) can transfer the knowledge from a teacher network (a cumbersome model) to a student (a small model) network, so it is a much more promising technique to disrupt current situation for NLP tasks where almost all systems are tending to use cumbersome DNN architectures. Knowledge distillation techniques have been successfully adapted to the state-of-the-art speech synthesis model WaveNet, which generates realistic-sounding voices for the Google Assistant. This production model is more than 1000 times faster than the original and with higher quality. However, for NLP tasks using cumbersome DNNs (e.g. neural machine translation), distilling knowledge is more challenging and different from the speech task.
Therefore, our goal in this proposal is to develop a more efficient and effective knowledge distillation framework to build fast and compact DNN models for NLP tasks, and to deploy on resource-constrained environments without quality loss and with low latency. Regarding this goal, we have three specific questions to address:
(1) architecture of the student model: it needs to be a simple and small architecture, suitable for parallel computing in terms of training and inference, and suitable for deploying at resource-constrained environments;
(2) the kind of knowledge that needs to be transferred or distilled: the original model memorises the whole dataset and it learns different knowledge, so how can we design the objective function so that we can transfer the required knowledge to the student model?
(3) balance between model size and performance: we need to carefully design the architecture, knowledge to be distilled and objective function to have a better balance between the model size and system performance based on the deployment and run-time requirements.