Model Compression Introduction

Introduction

Large Deep Learning models provide high accuracy for majority of tasks, but they compromise with speed. This could be a deal breaker for industrial use case, where sometimes hundreds or thousands of inferences needs to be made in short amount of time. In fact we have two major problems to solve here, and let's take a detour to explore what can be done about them from the AIOps perspective (i.e. not involving Data Scientists at all),
1. Slower model leading to high inference time. To solve this issue we can,
  - Use more powerful compute: just use newer and faster machine! 😎
  - Use GPU or TPU compute: use GPU computes (if only using CPU till now) to enhance Deep learning models performance due to faster matrix juggling by GPUs.
2. High demand leading to a lots of inference request in short amount of time. To solve this issue we can,
  - Multi Threading: if your system performs some I/O calls or DB calls or 3rd party API polling, multi threading could be the way to improve inference speed. Basically, while one thread is busy perfoming asyncronous task, another task can take over and start running.
  - Multi Processing: we can utilize multiple cores of a compute to create multiple processes, each working independently of each other.
  Note
  
  Multi-processing is a memory (or GPU) consuming task. This is because if you create 4 workers (4 processes), OS will create 4 copies of your model! For a 1GB model, 4GB will be consumed just to keep the system up and running.
  - Multi Instances: if you are using K8, just create multiple nodes of the code. It's like creating multiple copies of the code and running them independently on different computes.
  Note
  
  Multi Instances is a costly task, as each new replication needs a new compute which will increase the overall cost.

All of the above proposed solutions have one major flaw -- it will cost you money 💰 and that too on a recurrent basis if you plan to keep the model running for months. On top of it, all of these are not the "Data Scientist" way of tacking the problem 😊 Can we do something more scientific? 🤔
Enter model compression techniques, where the intuition is to reduce the size of the model which will inherently increase the speed of the inference (or training). And do this with minimal compromise on the accuracy!

Note

Please understand that the main assumption before applying model compression techniques is that to create a good smaller model, it is more optimal (in terms of either cost, time, data or accuracy) to utilize an existing good bigger model, rather than training the smaller model from scratch on the same data.

From another perspective, people might argue that if they have sufficient compute and data, why not just train a smaller model from scratch? A counter argument could be that for the same data, a bigger model will most likely always provide better accuracy, so using a bigger model to distil a smaller model might induce some additional knowledge from the teacher model or at the least drastically reduce the learning time.

Types of Model compression

At a high level, there are following types of model compression. We will go through them one by one.
- Knowledge Distillation: in these methods, we distil the learned information (or knowledge) from one neural network model (generally larger) to another model (generally smaller).
- Quantization: in these methods, we transform the data type used to represent the weights and activations of Neural networks which leads to reduction in memory consumption.
- Parameter Pruning and Sharing: in these methods, we remove the non-essential parameters from a neural network with minimal to no effect on the overall performance. (Lazy Data Scientist at work - 😴)

We will discuss each of these methods in detail in the next sections.