Weight decay, also known as L2 regularization, is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function, which discourages large weights in the model. This is done by adding the squared sum of the model's weights to the loss function.
How it works:
Given a loss function , the weight decay term is added as:
Where:
-
is the original loss function (e.g., mean squared error or cross-entropy),
-
are the model's weights,
-
is a regularization hyperparameter that controls the strength of the penalty (larger values result in stronger regularization).
Effect of weight decay:
-
It shrinks the weights during training, which can help prevent the model from fitting noise or overly complex patterns in the data.
-
It encourages the model to learn simpler and more generalizable features.
-
When is too large, the model may underfit because the weights become too small to capture meaningful patterns. Conversely, a very small may lead to overfitting, as it doesn't penalize large weights enough.
Weight decay is particularly useful in neural networks, where large weights can lead to unstable training and overfitting. It's commonly used in combination with gradient-based optimization methods like stochastic gradient descent (SGD).
Comments
Post a Comment