Refresher:
Adam (Adaptive Moment Estimation) Combines momentum (tracking exponentially decaying average of past gradients) with RMSprop (tracking exponentially decaying average of squared gradients) to adapt learning rates for each parameter individually. Uses bias correction to account for initialization at zero, making it particularly effective early in training. Widely used default optimizer that typically works well across many problems with minimal hyperparameter tuning.
AdaGrad (Adaptive Gradient) Accumulates the sum of squared gradients for each parameter throughout training and divides the learning rate by the square root of this accumulated value. Works well for sparse data and parameters that receive infrequent updates, giving them larger effective learning rates. Major drawback is that the accumulated squared gradients grow monotonically, causing learning rates to eventually become vanishingly small and training to stop.
RMSprop (Root Mean Square Propagation) Addresses AdaGrad's diminishing learning rate problem by using an exponentially decaying average of squared gradients instead of accumulating all historical values. Maintains separate adaptive learning rates for each parameter while allowing continued learning throughout training. Particularly effective for non-stationary objectives and problems with noisy gradients like mini-batch training.
AdaDelta Extension of RMSprop that eliminates the need for a default learning rate by replacing it with an exponentially decaying average of squared parameter updates from previous steps. Ensures that parameter updates have consistent units/magnitudes by forming a ratio between the RMS of parameter updates and RMS of gradients. More robust to hyperparameter choices but adds computational overhead by tracking additional running averages.
Comments
Post a Comment