Machine Learning / Deep Learning Snippets
Sharing some of the most widely used and arguably not so famous Machine Learning snippets
Feature importance
- Feature importance calculation is an important technique to identify the features which "helps" with the downstream classification or regression tasks.
- Sklearn provides several options to infer the importance of a feature. Most importantly, many model automatically computed the importane and store it in
model.feature_importances_
, after you call.fit()
- As an example, lets take the text based classification task and try to infer the following,
- Part 1: First use
CountVectorizer
for feature engineering andExtraTreesClassifier
for classification. - Part 2: Show the top N features.
- Part 3: Show evidence of a feature (by value count over different class labels)
- Part 1: First use
- Following dataset based assumtions have been made,
- We assume
x_train
andy_train
contains the a list of sentences and labels repectively. - We assume a pandas dataframe of name
train_df
is present which containsx_train
andy_train
as columns with nametitle
andlabel
respectively.
- We assume
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
|
Cross validation
- Cross validation is a technique in which at each iteration you create different split of train and dev data. At each such iteration, we train he model on the train split and validate on the remaining split. This way, event with small training data, we can perform multiple fold of validation.
- If you repeat this operation (for \(N\) iterations) over the complete data such that (1) each data point belonged to the dev split at most once, (2) each data point belonged to train split \(N-1\) times - its cross-validation.
- I have used Stratified K-Folds cross-validator, you can use any function from the complete list mentioned here - Sklearn Model selection
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
Hyper-parameter tuning
- Below is an example of hyperparameter tuning for SVR regression algorithm. There we specify the search space i.e. the list of algorithm parameters to try, and for each parameter combination perform a 5 fold CV test. Refer for more details - Sklearn Hyperparameter tuning and Sklearn SVR Algorithm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
Callbacks
- Callbacks are the hooks that you can attach to your deep learning training or validation process.
- It can be used to affect the training process from simple logging metric to even terminating the training in case special conditions are met.
- Below is an example of
EarlyStopping
andModelCheckpoint
callbacks.
1 2 3 4 5 |
|
Mean pooling
- References this stackoverflow answer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
- Some points to consider,
- The
pool_size
should be equal to the step/timesteps size of the recurrent layer. - The shape of the output is (
batch_size
,downsampled_steps
,features
), which contains one additionaldownsampled_steps
dimension. This will be always 1 if you set thepool_size
equal to timestep size in recurrent layer.
- The
Dataset and Dataloader
- Dataset can be downloaded from Kaggle.
1 2 3 4 5 6 |
|
Freeze Layers
- Example on how to freeze certain layers while training
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Check for GPU availability
- We need GPUs for deep learning, and before we start training or inference it's a good idea to check if GPU is available on the system or not.
- The most basic way to check for GPUs (if it's a NVIDIA one) is to run
nvidia-smi
command. It will return a detailed output with driver's version, cuda version and the processes using GPU. Refer this for more details on individual components.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce MX110 Off | 00000000:01:00.0 Off | N/A |
| N/A 43C P0 N/A / N/A | 164MiB / 2004MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6348 G /usr/lib/xorg 53MiB |
| 0 13360 G ...BBBBBaxsxsuxbssxsxs --shared-files 28MiB |
+-----------------------------------------------------------------------------+
- You can even use deep learning frameworks like Pytorch to check for the GPU availability. In fact, this is where you will most probably use them.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Monitor GPU usage
- If you want to continuously monitor the GPU usage, you can use
watch -n 2 nvidia-smi --id=0
command. This will refresh thenvidia-smi
output every 2 second.
HuggingFace Tokenizer
- Tokenizer is a pre-processing step that converts the text into a sequence of tokens. HuggingFace tokenizer is a wrapper around the tokenizers library, that contains multiple base algorithms for fast tokenization.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Explore Model
- You can use the
summary
method to check the model's architecture. This will show the layers, their output shape and the number of parameters in each layer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
- To check the named parameters of the model and their dtypes, you can use the following code,
1 2 3 4 |
|