Skip to content

Interview Questions

  • Here are some questions and their answers to make you ready for your next interview. Best of luck šŸ‘‹

What is Deep learning and how is it different from traditional Machine learning?

Deep learning is a subfield of machine learning that uses neural networks with many layers, called deep neural networks, to learn and make predictions. It is different from traditional machine learning in that it can automatically learn hierarchical representations of the data and doesn't rely heavily on feature engineering.

What is Dummy Variable Trap in ML?

  • When using linear models, like logistic regression, on a one-hot encoded (dummy var) dataset with a finite set of levels (unique values in a categorical column), it is suggested to drop one level from the final data such that total no of new one-hot encoded columns added is one less than the unique levels in the column. For example, consider a season column that contains 4 unique values spring, summer, fall, and winter. When doing one-hot encoding it is suggested to finally keep any 3 and not all 4 columns.
  • The reason: "If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect multicollinearity, so that the matrix inversion in the estimation algorithm would be impossible." Refer Wikipedia

Note

If using regularizing, then don't drop a level as it biases your model in favor of the variable you dropped. Refer Damien Martin's Blog

How does backpropagation work in a neural network?

Backpropagation is an algorithm used to train neural networks. It starts by propagating the input forward through the network, calculating the output. Then it compares the output to the desired output and calculates the error. The error is then propagated backwards through the network, adjusting the weights in the network so as to minimize the error. This process is repeated multiple times until the error is minimized.

While training deep learning models, why do we prefer training on mini-batch rather than on individual sample?

First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases. Second, computation over a batch can be much more efficient than m computations for individual examples, due to the parallelism afforded by the modern computing platforms. Ref

What are the benefits of using Batch Normalizattion?

Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for Dropout (Srivastava et al., 2014). Finally, Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes. Ref

What is Entropy (information theory)?

Entropy is a measurement of uncertainty of a system. Intuitively, it is the amount of information needed to remove uncertainty from the system. The entropy of a probability distribution p for various states of a system can be computed as: \(-\sum_{i}^{} (p_i \log p_i)\)

Even though Sigmoid function is non-linear, why is Logistic regression considered a linear classifier?

Logistic regression is often referred to as a linear classifier despite using the sigmoid (logistic) activation function because it models the relationship between the input features and the log-odds (logit) of the binary target variable in a linear manner. The linearity in logistic regression refers to the fact that it creates a linear decision boundary in the feature space, which is a hyperplane. Refer

What is the difference between Logits, Soft and Hard targets?

  • Let us understand each of the terms one by one. For better understanding, let's take a dog vs cat image classification as an example.
    • Logits are the un-normalized output of the model. In our cat vs dog example, logits will be, say, 10.1 for cat and 5.6 for dog for an image with cat. Refer this SE question.
    • Soft target: are normalized logits by applying a linear function. In our example, if we use softmax to the logits we get 0.99 for cat and 0.1 for dog.
    • Hard targets: are the encoding of the soft targets. In our example, as the model predicted (here correctly) the image as of cat, the hard targets be 1 for cat and 0 for dog.
graph LR
    A[Logits] -- normalization --> B[Soft Targets]
    B -- encoding --> C[Hard Targets]

How do you handle overfitting in deep learning models?

  • Overfitting occurs when a model becomes too complex and starts to fit the noise in the training data, rather than the underlying pattern. There are several ways to handle overfitting in deep learning models, including:
    • Regularization techniques such as L1 and L2 regularization, which add a penalty term to the loss function to discourage large weights
    • Early stopping, where training is stopped before the model has a chance to fully fit the noise in the training data
    • Dropout, which randomly drops out a certain percentage of neurons during training to prevent them from co-adapting and becoming too specialized
    • Adding more data to the training set

Explain the concept of temperature in deep learning?

In deep learning, the concept of "temperature" is often associated with the Softmax function and is used to control the degree of confidence or uncertainty in the model's predictions. It's primarily applied in the context of classification tasks, such as image recognition or natural language processing, where the model assigns probabilities to different classes.

The Softmax function is used to convert raw model scores or logits into a probability distribution over the classes. Each class is assigned a probability score, and the class with the highest probability is typically selected as the predicted class.

The Softmax function is defined as follows for a class "i":

\[P(i) = \frac{e^{z_i / \tau}}{\sum_{j} e^{z_j / \tau}}\]

Where:

  • \(P(i)\) is the probability of class "i."
  • \(z_i\) is the raw score or logit for class "i."
  • \(\tau\), known as the "temperature," is a positive scalar parameter.

The temperature parameter, \(\tau\), affects the shape of the probability distribution. When \(\tau\) is high, the distribution becomes "soft," meaning that the probabilities are more evenly spread among the classes. A lower temperature results in a "harder" distribution, with one or a few classes having much higher probabilities.

Here's how temperature impacts the Softmax function:

  • High \(\tau\): The model is more uncertain, and the probability distribution is more uniform, which can be useful when exploring diverse options or when dealing with noisy data.
  • Low \(\tau\): The model becomes more confident, and the predicted class will have a much higher probability. This is useful when you want to make decisive predictions.

Temperature allows you to control the trade-off between exploration and exploitation in the model's predictions. It's a hyperparameter that can be adjusted during training or inference to achieve the desired level of certainty in the model's output, depending on the specific requirements of your application.

Can you explain the concept of convolutional neural networks (CNN)?

A convolutional neural network (CNN) is a type of neural network that is primarily used for learning image and video patterns. CNNs are designed to automatically and adaptively learn spatial hierarchies of features from input data. They use a variation of multi-layer perceptrons, designed to require minimal preprocessing. Instead of hand-engineered features, CNNs learn features from data using a process called convolution.

How do you handle missing data in deep learning?

  • Missing data can be handled in several ways, including:
    • Removing the rows or columns with missing data
    • Interpolation or imputation of missing values
    • Using a technique called masking, which allows the model to ignore missing values when making predictions

Can you explain the concept of transfer learning in deep learning?

Transfer learning is a technique where a model trained on one task is used as a starting point for a model on a second, related task. This allows the model to take advantage of the features learned from the first task and apply them to the second task, which can lead to faster training and better performance. This can be done by using a pre-trained model as a feature extractor or fine-tuning the pre-trained model on new data.

What is Gradient Descent in deep learning?

Gradient Descent is an optimization algorithm used to minimize the loss function of a neural network. It works by updating the weights of the network in the opposite direction of the gradient of the loss function with respect to the weights. The magnitude of the update is determined by the learning rate. There are several variants of gradient descent, such as batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

What is Representation learning?

Representation learning is the fundamental concept in AI that denotes the power of the system to learn multiple levels of feature representation with increasing abstraction i.e. learning representations of data. These representations are stored inside the neurons and are used to make predictions and make decisions.

Explain Label smoothing.

Label smoothing is a technique used in machine learning to prevent the model from becoming over-confident (overfitting). The smoothing is done by adding a small amount of noise to the labels of the training data, which makes the model less likely to overfit to the training data. Technically it generates soft labels by applying a weighted average between the uniform distribution and the hard label. Refer Paper 1 or Paper 2

Please explain what is Dropout in deep learning?

Dropout is a regularization technique used in deep learning to prevent overfitting. It works by randomly dropping out a certain percentage of neurons during training, effectively reducing the capacity of the network. This forces the network to learn multiple independent representations of the data, making it less prone to overfitting.

What are Autoencoder?

An autoencoder is a type of neural network that is trained to reconstruct its input. It has an encoder part that compresses the input into a lower-dimensional representation called the bottleneck or latent code, and a decoder part that reconstructs the input from the latent code. Autoencoders can be used for tasks such as dimensionality reduction, anomaly detection and generative modelling.

Can you explain the concept of attention mechanism in deep learning?

Attention mechanism is a way to weight different parts of the input in a neural network, giving more importance to certain features than others. It is commonly used in tasks such as machine translation, where the model needs to focus on different parts of the input sentence at different times. Attention mechanisms can be implemented in various ways, such as additive attention, dot-product attention, and multi-head attention.

What are Generative Adversarial Networks (GANs)?

Generative Adversarial Networks (GANs) are a type of generative model that consists of two parts, a generator and a discriminator. The generator is trained to generate new data that is similar to the data it was trained on, while the discriminator is trained to distinguish the generated data from the real data. The two parts are trained together in a game-theoretic manner, where the generator tries to generate data that can fool the discriminator, and the discriminator tries to correctly identify the generated data.

Can you explain the concept of Memory Networks in deep learning?

Memory networks are a type of neural network architecture that allow the model to access and manipulate an external memory matrix, which can be used to store information that is relevant to the task. This allows the model to reason about the past and use this information to make predictions about the future. Memory networks have been used in tasks such as language understanding and question answering. Refer this for more details.

Explain Capsule Networks in deep learning?

Capsule networks are a type of neural network architecture that aims to overcome the limitations of traditional convolutional neural networks (CNNs) by using a new type of layer called a capsule. A capsule contains multiple neurons that work together to represent an object or part of an object, and the activities of the neurons are used to represent the properties of the object such as position, size and orientation. Capsule networks have been used in tasks such as image classification and object detection.

Can you explain the concept of generative models in deep learning?

Generative models are a type of deep learning model that can generate new data that is similar to the data it was trained on. These models are trained on a dataset and learn the underlying probability distribution of the data, allowing them to generate new, unseen data that fits that distribution. Examples of generative models include Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs).

What is the concept of adversarial training in deep learning?

Adversarial training is a technique used to improve the robustness of deep learning models by generating adversarial examples and using them to train the model. Adversarial examples are inputs that are slightly perturbed in such a way as to cause the model to make a mistake. By training the model on these examples, it becomes more robust to similar perturbations in the real world.

What is weight initialization in deep learning?

Weight initialization is the process of setting the initial values for the weights in a neural network. The initial values of the weights can have a big impact on the network's performance and training time. There are several methods to initialize weights, including random initialization, Glorot initialization, and He initialization. Each of these methods have different properties and are more suitable for different types of problems.

Explain data augmentation?

Data augmentation is a technique used to increase the amount of data available for training a deep learning model. This is done by creating new training examples by applying various random transformations to the original data, such as random cropping, flipping, or rotation. Data augmentation can be a powerful tool to prevent overfitting and improve the generalization performance of a mode.

What is the different between Standardization and Normalization?

Normalization is the process of scaling the data to a common scale. It is also known as Min-Max Scaling where the final range could be [0, 1] or [-1,1] or something else. \(X_{new} = (X - X_{min})/(X_{max} - X_{min})\) Standardization is the process of scaling the data to have zero mean and unit variance. \(X_{new} = (X - mean)/Std\)

Is it possible that during ML training, both validation (or test) loss and accuracy, are increasing?

Accuracy and loss are not necessarily exactly (inversely) correlated, as loss measures a difference between raw prediction (float) and class (0 or 1), while accuracy measures the difference between thresholded prediction (0 or 1) and class. So if raw predictions change, loss changes but accuracy is more "resilient" as predictions need to go over/under a threshold to actually change accuracy. Soltius's answer on SE

Is K-means clustering algorithm guaranteed to converge with unique result?

K-means clustering algorithm is guaranteed to converge but the final result may vary based on the centroid initialisation. This is why it is suggested to try multiple initialization strategies and pick the one with best clustering. The convergence is guaranteed as the sum of squared distances between each point and its centroid strictly decreases over each iteration. Also the practical run time of k-means is basically linear. Refer Visualizing K Means Clustering - Naftali Harris

In K-means clustering, is it possible that a centroid has no data points assigned to it?

Yes it is possible, imagine a centroid placed in middle of ring of other centroids. Several implementations either removes that centroid or random;y replace it somewhere else in the data space. Refer Visualizing K Means Clustering - Naftali Harris

What is entropy in information theory?

Entropy is a measure of the amount of uncertainty or randomness in a system. It is often used in information theory and statistical mechanics to describe the unpredictability of a system or the amount of information required to describe it. It's formula is, \(\mathrm {H} (X):=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x)=\mathbb {E} [-\log p(X)]\)

Here is an excellent video from Aurelien Geron, explaining the topic.

What is the difference between supervised and unsupervised learning?

Supervised learning uses labeled data to train a model to make predictions, while unsupervised learning uses unlabeled data to find patterns or structure in the data.

How do you evaluate the performance of a machine learning model?

One common method is to split the data into a training set and a test set, and use metrics such as accuracy, precision, recall, and F1 score to evaluate the model's performance on the test set.

What is overfitting in machine learning and how can it be prevented?

Overfitting occurs when a model is too complex and is able to fit the noise in the training data, leading to poor performance on new, unseen data. To prevent overfitting, methods such as cross-validation, regularization, and early stopping can be used.

What is the difference between a decision tree and random forest?

A decision tree is a single tree model that makes a prediction by traversing the tree from the root to a leaf node, while a random forest is an ensemble of decision trees, where the final prediction is made by averaging the predictions of all the trees in the forest.

What is the Bias-Variance trade-off in machine learning?

The Bias-Variance trade-off refers to the trade-off between a model's ability to fit the training data well (low bias) and its ability to generalize well to new, unseen data (low variance). A model with high bias will underfit the training data, while a model with high variance will overfit the training data.

What is the difference between batch and online learning?

Batch learning is a type of machine learning where the model is trained on a fixed dataset and the parameters are updated after processing the entire dataset. In contrast, online learning is a type of machine learning where the model is trained on a continuous stream of data and the parameters are updated incrementally after processing each example.

What is the difference between a decision boundary and a decision surface in machine learning?

A decision boundary is a boundary that separates different classes in a dataset, it can be represented by a line or a hyperplane in a two-dimensional or multi-dimensional space respectively. A decision surface is a generalization of decision boundary, it's a surface that separates different classes in a dataset, it can be represented by a surface in a multi-dimensional space. In simple words, a decision boundary is a one-dimensional representation of a decision surface.

What is the use of principal component analysis (PCA) in machine learning?

Principal component analysis (PCA) is a technique used to reduce the dimensionality of a dataset by identifying the most important features, called principal components. PCA finds a new set of uncorrelated features, called principal components, that can explain most of the variance in the original data. This can be useful for visualizing high-dimensional data, reducing noise, and improving the performance of machine learning models.

What is the use of the Random Forest algorithm in machine learning?

Random Forest is an ensemble learning technique that combines multiple decision trees to improve the performance and stability of the model. It works by creating multiple decision trees using a random subset of the features and training data, and then averaging the predictions of all the trees to make a final prediction. Random Forest algorithm is often used for classification and regression problems, it's robust to outliers, missing values, and irrelevant features, and it can also be used for feature selection and feature importance analysis.

What is the difference between a generative model and a discriminative model?

A generative model learns the probability distribution of the data and can generate new samples from it, while a discriminative model learns the boundary between different classes and make predictions based on it. Generative models are used for tasks such as density estimation, anomaly detection, and data generation, while discriminative models are used for tasks such as classification and regression.

What is the difference between an autoencoder and a variational autoencoder?

An autoencoder is a type of neural network that learns to encode and decode input data, it can be used to reduce the dimensionality of the data and learn a compact representation of it. A variational autoencoder (VAE) is a type of autoencoder that learns a probabilistic encoding of the input data, it generates new samples from the learned distribution. VAE can be used for tasks such as image generation and anomaly detection.

What is Expectation-Maximization (EM) algorithm?

The Expectation-Maximization (EM) algorithm is a method for finding maximum likelihood estimates in incomplete data problems, where some of the data is missing or hidden. EM works by iteratively refining estimates of the missing data and the parameters of the model, until it converges to a local maximum of the likelihood function. It can be used for tasks such as clustering, image segmentation, and missing data imputation.

What is the difference between L1 and L2 regularization in machine learning?

L1 and L2 regularization are methods used to prevent overfitting in machine learning models by adding a penalty term to the loss function. L1 regularization adds the absolute value of the weights to the loss function, while L2 regularization adds the square of the weights. L1 regularization leads to sparse models where some weights will be zero, while L2 regularization leads to models where all weights are small.

Explain Support Vector Machine (SVM).

Support Vector Machine (SVM) is a supervised learning algorithm that can be used for classification and regression tasks. SVM works by finding the hyperplane that maximally separates the different classes in a dataset and then uses this hyperplane to make predictions. SVM is particularly useful when the data is linearly separable, it's also robust to high-dimensional data and it can be used with kernel functions to solve non-linearly separable problems.

What is the use of the k-nearest neighbors (k-NN) algorithm?

k-nearest neighbors (k-NN) is a type of instance-based learning algorithm that can be used for classification and regression tasks. The algorithm works by finding the k training examples that are closest to a new input and using the majority class or average value of those examples to make a prediction. k-NN is a simple and efficient algorithm that can be used for tasks such as image classification, anomaly detection, and recommendation systems.

What is the use of the Random Sampling method for feature selection in machine learning?

Random Sampling is a method for feature selection that involves randomly selecting a subset of features from the dataset and evaluating the performance of a model trained on that subset. The subset of features that result in the best performance are then chosen for further analysis or use in a final model. This method can be useful when the number of features is large and there is no prior knowledge of which features are most relevant.

Explain Bagging method in ensemble learning?

Bagging (Bootstrap Aggregating) is a method for ensemble learning that involves training multiple models on different subsets of the data and then combining the predictions of those models. The subsets of data are created by randomly sampling the original data with replacement, this method helps to reduce the variance of the model and increase the robustness of the predictions. Bagging is commonly used with decision trees and can be implemented using Random Forest algorithm.

Explain AdaBoost method in ensemble learning?

AdaBoost (Adaptive Boosting) is a method for ensemble learning that involves training multiple models on different subsets of the data and then combining the predictions of those models. The subsets of data are created by giving more weight to the examples that are misclassified by the previous models, this method helps to increase the accuracy of the model. AdaBoost is commonly used with decision trees and can be used with any type of base classifier.

Explain Gradient Boosting method in ensemble learning?

Gradient Boosting is a method for ensemble learning that involves training multiple models in a sequential manner, where each model tries to correct the mistakes of the previous model. The method uses gradient descent to minimize the loss function, this method is commonly used with decision trees and it can be used with any type of base classifier. Gradient Boosting is a powerful method that can achieve state-of-the-art performance in many machine learning tasks.

Explain XGBoost method in ensemble learning?

XGBoost (Extreme Gradient Boosting) is a specific implementation of the Gradient Boosting method that uses a more efficient tree-based model and a number of techniques to speed up the training process and reduce overfitting. XGBoost is commonly used in machine learning competitions and it's one of the most popular libraries used for gradient boosting. It's used for classification and regression problems.

What is group_size in context of Quantization?

Group size is a parameter used in the quantization process that determines the number of weights or activations (imagine weights in a row of matrix) that are quantized together. A smaller group size can lead to better quantization accuracy, but it can also increase the memory and computational requirements of the model. Group size is an important hyperparameter that needs to be tuned to achieve the best trade-off between accuracy and efficiency. Note, the default groupsize for a GPTQ is 1024. Refer this interesting Reddit discussion

What is EMA (Exponential Moving Average) in context of deep learning?

EMA is a technique used in deep learning to stabilize the training process and improve the generalization performance of the model. It works by maintaining a moving average of the model's parameters during training, which helps to smooth out the noise in the gradients and prevent the model from overfitting to the training data. Usually during training, the model's parameters are updated using the gradients of the loss function, but the EMA parameters are updated using a weighted average of the current parameters and the previous EMA parameters, as shown below:

\[ \text{ema_param} = \text{ema_param} * \text{decay} + \text{param} * (1 - \text{decay}) \]

One more advantage of EMA model is that it can be used to resume the training. Some models provide both the model parameters and EMA parameters.

What is Bradley-Terry model? And how is it used in machine learning?

Bradley-Terry model is a probability model that can be used to model the outcome of a pairwise comparison between two items. It is commonly used in sports analytics to rank teams or players based on their performance in head-to-head matches. Refer: Wikipedia

In case of ML, the model is a popular choice for modeling human preferences in the context of training language models. It stipulates that the probability of a human preferring one completion over another can be expressed as a ratio of exponentials of the latent reward associated with each completion. Specifically, the human preference distribution

\[ p^*(y_1 \succ y_2 | x) = \frac{exp(r^*(x, y_1))}{exp(r^*(x, y_1)) + exp(r^*(x, y_2))} \]

This model is commonly used to capture human preferences to provide a framework for understanding and incorporating human feedback into the training of language models.