QR decomposition in neural networks refers to using the QR matrix factorization technique within neural network architectures. Let me explain what this means and why it's useful.
What is QR Decomposition?
First, QR decomposition is a mathematical technique that factors a matrix A into two matrices:
- Q: An orthogonal matrix (columns are orthonormal vectors)
- R: An upper triangular matrix
So: A = QR
Applications in Neural Networks
1. Weight Initialization and Normalization
QR decomposition is used to initialize weight matrices with orthogonal properties:
- Orthogonal initialization: Weights are initialized as orthogonal matrices using QR decomposition of random matrices
- Benefits: Helps prevent vanishing/exploding gradients and improves training stability
- Example: In RNNs and deep networks, orthogonal initialization helps maintain gradient flow
2. Orthogonal Regularization
During training, QR decomposition enforces orthogonality constraints:
# Pseudo-code for orthogonal regularization
W_normalized = QR_decompose(W)[0] # Keep only Q part
This ensures weight matrices maintain orthogonal properties, which:
- Prevents feature redundancy
- Improves gradient flow
- Reduces internal covariate shift
3. Spectral Normalization
QR decomposition helps control the spectral properties of weight matrices:
- Constrains the singular values of weight matrices
- Particularly useful in GANs for training stability
- Prevents mode collapse and improves convergence
4. Efficient Computation in Transformers
Recent research uses QR decomposition for efficient attention mechanisms:
- Performers and Linear Transformers use QR decomposition to approximate attention
- Reduces computational complexity from O(n²) to O(n)
- Maintains mathematical properties while speeding up computation
5. Neural ODE Networks
In continuous-depth neural networks (Neural ODEs):
- QR decomposition ensures numerical stability
- Helps solve differential equations within the network
- Maintains orthogonality during continuous transformations
Specific Architectures Using QR Decomposition
Orthogonal RNNs (ORNNs)
- Use QR decomposition to maintain orthogonal recurrent weight matrices
- Solves vanishing gradient problem more effectively than LSTM/GRU
- Better at learning long-term dependencies
Cayley Neural Networks
- Use Cayley transform (related to QR) to parameterize orthogonal matrices
- Guarantees orthogonality throughout training
- More stable optimization landscape
Householder Networks
- Use Householder reflections (building blocks of QR) as layers
- Efficient parameterization of orthogonal transformations
- Memory-efficient for large matrices
Implementation Example
Here's a simplified example of how QR decomposition might be used in practice:
import torch
import torch.nn as nn
class QRLinear(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features))
def forward(self, x):
# Apply QR decomposition to weight matrix
q, r = torch.qr(self.weight)
# Use only orthogonal part Q
return torch.matmul(x, q.t())
Key Benefits
- Training Stability: Orthogonal matrices preserve norm, preventing gradient issues
- Feature Diversity: Orthogonal weights ensure different neurons learn different features
- Computational Efficiency: In some cases, QR structure enables faster computations
- Better Generalization: Orthogonality constraints act as implicit regularization
Challenges
- Computational Cost: QR decomposition can be expensive for large matrices
- Gradient Computation: Backpropagation through QR decomposition requires careful implementation
- Limited Flexibility: Orthogonality constraints may be too restrictive for some tasks
QR decomposition neural networks represent an important technique for improving training dynamics and model performance, especially in scenarios requiring stable gradient flow or efficient computation. They're particularly valuable in RNNs, normalizing flows, and modern transformer variants.
I'll solve simple problems by hand for each method to show how they work step-by-step.
Problem Setup
Let's solve a simple linear system: Find the best-fit line through points (1,3) and (2,5).
This gives us the system:
- a + b = 3 (when x=1, y=3)
- 2a + b = 5 (when x=2, y=5)
In matrix form: Ax = y where:
A = [1 1] x = [a] y = [3]
[2 1] [b] [5]
1. QR Decomposition Solution (by hand)
Step 1: Decompose A into Q and R
Using Gram-Schmidt process to find Q:
Column 1 of A: v₁ = [1, 2]ᵀ
Normalize it:
- ||v₁|| = √(1² + 2²) = √5
- q₁ = v₁/||v₁|| = [1/√5, 2/√5]ᵀ
Column 2 of A: v₂ = [1, 1]ᵀ
Orthogonalize against q₁:
- projection = (v₂ · q₁)q₁ = ((1×1/√5 + 1×2/√5)) × [1/√5, 2/√5]ᵀ
- projection = (3/√5) × [1/√5, 2/√5]ᵀ = [3/5, 6/5]ᵀ
- v₂⊥ = v₂ - projection = [1, 1]ᵀ - [3/5, 6/5]ᵀ = [2/5, -1/5]ᵀ
Normalize it:
- ||v₂⊥|| = √((2/5)² + (-1/5)²) = √(4/25 + 1/25) = √(5/25) = 1/√5
- q₂ = v₂⊥/||v₂⊥|| = [2/√5, -1/√5]ᵀ
So: Q = [1/√5 2/√5] [2/√5 -1/√5]
Step 2: Calculate R = QᵀA
R₁₁ = q₁ᵀ × column₁(A) = [1/√5, 2/√5] · [1, 2]ᵀ = 1/√5 + 4/√5 = 5/√5 = √5
R₁₂ = q₁ᵀ × column₂(A) = [1/√5, 2/√5] · [1, 1]ᵀ = 1/√5 + 2/√5 = 3/√5
R₂₂ = q₂ᵀ × column₂(A) = [2/√5, -1/√5] · [1, 1]ᵀ = 2/√5 - 1/√5 = 1/√5
So: R = [√5 3/√5] [0 1/√5]
Step 3: Solve Rx = Qᵀy
First, calculate Qᵀy:
Qᵀy = [1/√5 2/√5] [3] = [3/√5 + 10/√5] = [13/√5]
[2/√5 -1/√5] [5] [6/√5 - 5/√5] [1/√5]
Now solve Rx = Qᵀy using back substitution:
[√5 3/√5] [a] [13/√5]
[0 1/√5] [b] = [1/√5]
From row 2: (1/√5)b = 1/√5 → b = 1
From row 1: √5a + (3/√5)×1 = 13/√5
- √5a = 13/√5 - 3/√5 = 10/√5
- a = 10/5 = 2
Answer: a = 2, b = 1 (line equation: y = 2x + 1)
2. Gradient Descent Solution (by hand)
Setup
Minimize loss: L = ½[(a + b - 3)² + (2a + b - 5)²]
Starting point: a₀ = 0, b₀ = 0 Learning rate: η = 0.1
Iteration 1
Calculate gradients:
-
∂L/∂a = (a + b - 3)×1 + (2a + b - 5)×2
-
∂L/∂a = (0 + 0 - 3)×1 + (0 + 0 - 5)×2 = -3 - 10 = -13
-
∂L/∂b = (a + b - 3)×1 + (2a + b - 5)×1
-
∂L/∂b = (0 + 0 - 3)×1 + (0 + 0 - 5)×1 = -3 - 5 = -8
Update weights:
- a₁ = a₀ - η(∂L/∂a) = 0 - 0.1×(-13) = 1.3
- b₁ = b₀ - η(∂L/∂b) = 0 - 0.1×(-8) = 0.8
Iteration 2
Calculate gradients with a=1.3, b=0.8:
-
∂L/∂a = (1.3 + 0.8 - 3)×1 + (2×1.3 + 0.8 - 5)×2
-
∂L/∂a = (-0.9)×1 + (-1.6)×2 = -0.9 - 3.2 = -4.1
-
∂L/∂b = (1.3 + 0.8 - 3)×1 + (2×1.3 + 0.8 - 5)×1
-
∂L/∂b = (-0.9)×1 + (-1.6)×1 = -0.9 - 1.6 = -2.5
Update weights:
- a₂ = 1.3 - 0.1×(-4.1) = 1.3 + 0.41 = 1.71
- b₂ = 0.8 - 0.1×(-2.5) = 0.8 + 0.25 = 1.05
Iteration 3
Calculate gradients with a=1.71, b=1.05:
-
∂L/∂a = (1.71 + 1.05 - 3)×1 + (2×1.71 + 1.05 - 5)×2
-
∂L/∂a = (-0.24)×1 + (-0.53)×2 = -0.24 - 1.06 = -1.30
-
∂L/∂b = (-0.24)×1 + (-0.53)×1 = -0.77
Update weights:
- a₃ = 1.71 - 0.1×(-1.30) = 1.84
- b₃ = 1.05 - 0.1×(-0.77) = 1.13
After more iterations, this converges toward a = 2, b = 1
3. Matrix Inversion Solution (by hand)
Using Normal Equation: x = (AᵀA)⁻¹Aᵀy
Step 1: Calculate AᵀA
Aᵀ = [1 2] A = [1 1]
[1 1] [2 1]
AᵀA = [1 2] [1 1] = [1×1+2×2 1×1+2×1] = [5 3]
[1 1] [2 1] [1×1+1×2 1×1+1×1] [3 2]
Step 2: Calculate Aᵀy
Aᵀy = [1 2] [3] = [1×3+2×5] = [13]
[1 1] [5] [1×3+1×5] [8]
Step 3: Find (AᵀA)⁻¹
For 2×2 matrix inverse:
[a b]⁻¹ = 1/(ad-bc) × [d -b]
[c d] [-c a]
For AᵀA = [5 3]: [3 2]
- Determinant = 5×2 - 3×3 = 10 - 9 = 1
(AᵀA)⁻¹ = 1/1 × [2 -3] = [2 -3]
[-3 5] [-3 5]
Step 4: Calculate x = (AᵀA)⁻¹Aᵀy
x = [2 -3] [13] = [2×13 + (-3)×8] = [26-24] = [2]
[-3 5] [8] [(-3)×13 + 5×8] [-39+40] [1]
Answer: a = 2, b = 1
Comparison Summary
| Method | Steps | Result | Effort |
|---|---|---|---|
| QR Decomposition | 1. Gram-Schmidt orthogonalization<br>2. Form Q and R matrices<br>3. Back substitution | a=2, b=1<br>(exact) | High complexity, but numerically stable |
| Gradient Descent | 1. Initialize randomly<br>2. Calculate gradients<br>3. Update iteratively<br>4. Repeat until convergence | a≈2, b≈1<br>(after many iterations) | Simple steps, but many iterations needed |
| Matrix Inversion | 1. Calculate AᵀA<br>2. Calculate Aᵀy<br>3. Find inverse<br>4. Multiply | a=2, b=1<br>(exact) | Direct but requires matrix inverse |
Key Insights from Hand Calculations:
- QR: Most work is in the orthogonalization process, but gives stable solution
- Gradient Descent: Simple operations but needs many iterations to converge
- Matrix Inversion: Fastest for small problems but finding inverse can be unstable for larger/ill-conditioned matrices
Each method arrives at the same solution (y = 2x + 1), but through very different mathematical paths!
I'll compare QR decomposition with Gradient Descent and Matrix Inversion in the context of solving linear systems and neural network training, providing clear examples for each.
Overview Comparison Table
| Aspect | QR Decomposition | Gradient Descent | Matrix Inversion |
|---|---|---|---|
| Type | Direct algebraic method | Iterative optimization | Direct algebraic method |
| Use Case | Solving Ax=b, orthogonalization | Minimizing loss functions | Solving Ax=b exactly |
| Computational Complexity | O(n³) | O(n²) per iteration | O(n³) |
| Stability | Very stable | Depends on learning rate | Unstable for ill-conditioned matrices |
| Memory | Moderate | Low | High |
| Exact vs Approximate | Exact (numerical precision) | Approximate | Exact (if matrix invertible) |
1. QR Decomposition Example
Solving Linear Regression: Ax = b
import numpy as np
# Problem: Find best-fit line for points
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]]) # Design matrix
y = np.array([2, 3, 4, 5]) # Target values
# QR Decomposition Solution
Q, R = np.linalg.qr(X)
# Solve Rx = Q^T y
weights_qr = np.linalg.solve(R, Q.T @ y)
print("QR Solution:", weights_qr) # [1.0, 1.0] - equation: y = 1 + 1*x
Neural Network Application:
import torch
import torch.nn as nn
class OrthogonalLayer(nn.Module):
"""Layer that maintains orthogonal weights using QR"""
def __init__(self, size):
super().__init__()
self.weight = nn.Parameter(torch.randn(size, size))
def forward(self, x):
# Ensure weights stay orthogonal
Q, R = torch.linalg.qr(self.weight)
return torch.matmul(x, Q)
# Usage
layer = OrthogonalLayer(64)
x = torch.randn(32, 64)
output = layer(x) # Maintains orthogonality throughout training
Advantages:
- Numerically stable
- Preserves orthogonality
- One-shot solution
Disadvantages:
- Computationally expensive for large matrices
- Not suitable for non-linear problems
2. Gradient Descent Example
Same Linear Regression Problem
import numpy as np
# Same data
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2, 3, 4, 5])
# Gradient Descent Solution
weights_gd = np.random.randn(2) # Random initialization
learning_rate = 0.01
epochs = 1000
for epoch in range(epochs):
# Forward pass
predictions = X @ weights_gd
# Compute loss (MSE)
loss = np.mean((predictions - y) ** 2)
# Compute gradients
gradients = 2 * X.T @ (predictions - y) / len(y)
# Update weights
weights_gd -= learning_rate * gradients
if epoch % 200 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}, Weights: {weights_gd}")
print("Final GD Solution:", weights_gd) # Converges to [1.0, 1.0]
Neural Network Application:
import torch
import torch.nn as nn
import torch.optim as optim
# Simple neural network trained with gradient descent
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 50)
self.fc2 = nn.Linear(50, 1)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
return self.fc2(x)
# Training with gradient descent
model = SimpleNet()
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
# Training loop
for epoch in range(100):
optimizer.zero_grad()
output = model(torch.randn(32, 10))
loss = criterion(output, torch.randn(32, 1))
loss.backward() # Compute gradients
optimizer.step() # Update weights via gradient descent
Advantages:
- Works for non-linear optimization
- Memory efficient
- Can handle large-scale problems
- Online learning capability
Disadvantages:
- Requires many iterations
- Sensitive to learning rate
- Can get stuck in local minima
3. Matrix Inversion Example
Direct Solution for Linear System
import numpy as np
# Same problem
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2, 3, 4, 5])
# Matrix Inversion Solution (Normal Equation)
# Solve: (X^T X) w = X^T y
XtX = X.T @ X
Xty = X.T @ y
# Direct inversion
XtX_inv = np.linalg.inv(XtX)
weights_inv = XtX_inv @ Xty
print("Matrix Inversion Solution:", weights_inv) # [1.0, 1.0]
# Or using pseudo-inverse (more stable)
weights_pinv = np.linalg.pinv(X) @ y
print("Pseudo-inverse Solution:", weights_pinv) # [1.0, 1.0]
Neural Network Application:
import torch
class DirectSolveLayer(nn.Module):
"""Layer that directly computes optimal weights using matrix inversion"""
def __init__(self, in_features, out_features):
super().__init__()
self.in_features = in_features
self.out_features = out_features
def fit(self, X, y):
# Directly solve for optimal weights using pseudo-inverse
# w = (X^T X)^(-1) X^T y
self.weight = torch.linalg.pinv(X) @ y
def forward(self, x):
return x @ self.weight.T
# Usage - direct fitting without iterative training
layer = DirectSolveLayer(10, 1)
X_train = torch.randn(100, 10)
y_train = torch.randn(100, 1)
layer.fit(X_train, y_train) # One-shot solution
Advantages:
- Exact solution in one step
- No hyperparameter tuning
- Guaranteed global optimum for convex problems
Disadvantages:
- O(n³) complexity
- Numerically unstable for ill-conditioned matrices
- Requires entire dataset in memory
- Only works for linear problems
Practical Comparison Example
Let's solve the same problem with all three methods:
import numpy as np
import time
# Generate larger problem
np.random.seed(42)
n_samples, n_features = 1000, 100
X = np.random.randn(n_samples, n_features)
true_weights = np.random.randn(n_features)
y = X @ true_weights + 0.1 * np.random.randn(n_samples)
# 1. QR Decomposition
start = time.time()
Q, R = np.linalg.qr(X)
w_qr = np.linalg.solve(R, Q.T @ y)
time_qr = time.time() - start
# 2. Gradient Descent
start = time.time()
w_gd = np.zeros(n_features)
lr = 0.01
for _ in range(1000):
grad = X.T @ (X @ w_gd - y) / n_samples
w_gd -= lr * grad
time_gd = time.time() - start
# 3. Matrix Inversion (via normal equation)
start = time.time()
w_inv = np.linalg.inv(X.T @ X) @ X.T @ y
time_inv = time.time() - start
# Compare results
print(f"QR Error: {np.linalg.norm(w_qr - true_weights):.4f}, Time: {time_qr:.4f}s")
print(f"GD Error: {np.linalg.norm(w_gd - true_weights):.4f}, Time: {time_gd:.4f}s")
print(f"Inv Error: {np.linalg.norm(w_inv - true_weights):.4f}, Time: {time_inv:.4f}s")
When to Use Each Method
Use QR Decomposition when:
- You need numerical stability
- Solving medium-sized linear systems
- Maintaining orthogonality is important (e.g., in RNNs)
- Working with ill-conditioned matrices
Use Gradient Descent when:
- Problem is non-linear
- Dataset is too large for memory
- Need online/streaming updates
- Working with neural networks
Use Matrix Inversion when:
- Need exact solution immediately
- Problem is small and well-conditioned
- Have sufficient memory
- Working with simple linear regression
In neural networks, gradient descent dominates because of non-linearity and scale, but QR decomposition appears in specialized architectures for stability, and matrix inversion is rarely used except in specific analytical solutions or small linear layers.
Other Methods used with Neural Network
Excellent question! Beyond QR decomposition, gradient descent, and matrix inversion, neural networks employ many other optimization and mathematical methods. Let me break down the major categories with hand-calculation examples where feasible.
1. Advanced Gradient-Based Methods
Momentum-Based Methods
Classical Momentum
Instead of just using current gradient, accumulate velocity:
velocity = β × velocity + gradient
weights = weights - learning_rate × velocity
Hand Example: Starting from previous gradient descent at iteration 2 (a=1.3, b=0.8):
- Previous velocity: v_a = 0, v_b = 0
- Current gradient: ∇a = -4.1, ∇b = -2.5
- β = 0.9 (momentum coefficient)
v_a = 0.9 × 0 + (-4.1) = -4.1
v_b = 0.9 × 0 + (-2.5) = -2.5
a_new = 1.3 - 0.1 × (-4.1) = 1.71
b_new = 0.8 - 0.1 × (-2.5) = 1.05
Adam (Adaptive Moment Estimation)
Combines momentum with adaptive learning rates:
m = β₁ × m + (1-β₁) × gradient (momentum)
v = β₂ × v + (1-β₂) × gradient² (RMSprop)
weights = weights - lr × m/√(v + ε)
Key Insight: Adapts learning rate per parameter, faster convergence than vanilla gradient descent
Newton's Method and Quasi-Newton Methods
Newton's Method
Uses second-order derivatives (Hessian):
x_new = x_old - H⁻¹∇f
Hand Example for f(x) = x² - 4:
- Start: x₀ = 3
- f'(x) = 2x, f''(x) = 2
- x₁ = 3 - (2×3)/2 = 3 - 3 = 0 ❌ (overshoot)
- Better with damping: x₁ = 3 - 0.5×(6/2) = 1.5
- x₂ = 1.5 - (2×1.5)/2 = 0.75... → converges to 0
L-BFGS (Limited-memory BFGS)
Approximates Hessian using gradient history:
- Stores last m (typically 5-20) gradient updates
- Builds approximate inverse Hessian
Key Insight: Faster convergence than gradient descent, but memory intensive
2. Stochastic Methods
Stochastic Gradient Descent (SGD)
Uses random mini-batches instead of full dataset:
Hand Example: Dataset: [(1,3), (2,5), (3,7), (4,9)] Instead of using all 4 points, randomly pick 1-2 each iteration:
- Iteration 1: Use only (2,5) → gradient based on single point
- Iteration 2: Use only (3,7) → different gradient
- Adds noise but enables online learning
Simulated Annealing
Probabilistically accepts worse solutions to escape local minima:
If new_loss < old_loss: accept
Else: accept with probability e^(-(new_loss-old_loss)/T)
Key Insight: Temperature T decreases over time, allowing exploration early and exploitation later
3. Eigenvalue/Eigenvector Methods
Power Iteration
Finds dominant eigenvector:
Hand Example: Matrix A = [2 1; 1 2], start v₀ = [1; 0]
v₁ = Av₀ = [2; 1], normalize: [2/√5; 1/√5]
v₂ = Av₁ = [5/√5; 4/√5], normalize: [5/√41; 4/√41]
Converges to dominant eigenvector
Singular Value Decomposition (SVD)
Decomposes A = UΣVᵀ
Application in Neural Networks:
- Weight compression
- Noise reduction
- Computing pseudo-inverse
Key Insight: More general than eigendecomposition, works for non-square matrices
4. Iterative Linear System Solvers
Conjugate Gradient Method
For solving Ax = b iteratively:
Hand Example for simple 2×2:
A = [4 1; 1 3], b = [1; 2]
- Start: x₀ = [0; 0], r₀ = b - Ax₀ = [1; 2]
- p₀ = r₀ = [1; 2]
- α₀ = (r₀ᵀr₀)/(p₀ᵀAp₀) = 5/14
- x₁ = x₀ + α₀p₀ = [5/14; 10/14]
Key Insight: Guaranteed convergence in n steps for n×n matrix, no matrix inversion needed
Jacobi Method
Iterative solver using diagonal dominance:
x_i^(k+1) = (b_i - Σ(j≠i) a_ij × x_j^(k)) / a_ii
5. Specialized Neural Network Methods
Backpropagation Through Time (BPTT)
For RNNs - unrolls network and applies chain rule:
∂L/∂W = Σ_t ∂L_t/∂W
Contrastive Divergence
For training Restricted Boltzmann Machines:
- Positive phase: sample from data
- Negative phase: sample from model
- Update: weights += learning_rate × (positive - negative)
Expectation Maximization (EM)
For mixture models and latent variables:
- E-step: Estimate hidden variables
- M-step: Maximize parameters
- Repeat until convergence
6. Regularization and Constraint Methods
Proximal Gradient Methods
For L1 regularization (sparsity):
Soft threshold: S(x,λ) = sign(x) × max(|x|-λ, 0)
Hand Example: Weight after gradient step: 1.5 L1 penalty λ = 0.3 New weight = sign(1.5) × max(1.5-0.3, 0) = 1.2
Lagrangian Methods
For constrained optimization:
L(x,λ) = f(x) + λg(x)
7. Probabilistic/Sampling Methods
Monte Carlo Methods
Approximate expectations through sampling:
E[f(x)] ≈ (1/N) Σ f(x_i) where x_i ~ p(x)
Variational Inference
Approximate posterior with simpler distribution:
KL(q||p) = E_q[log q] - E_q[log p]
Comparison Table
| Method | Convergence Rate | Memory | Stability | Use Case |
|---|---|---|---|---|
| Vanilla GD | O(1/k) | Low | Good | Simple problems |
| Momentum | O(1/k²) | Low | Good | General training |
| Adam | Adaptive | Moderate | Very Good | Default choice |
| Newton | O(1/k²) quadratic | High | Poor | Small, smooth problems |
| L-BFGS | Superlinear | Moderate | Good | Batch optimization |
| SGD | O(1/√k) | Very Low | Good | Large datasets |
| Conjugate Gradient | O(√κ log(1/ε)) | Low | Good | Linear systems |
| SVD | Direct | High | Excellent | Dimensionality reduction |
Key Insights from All Methods:
- QR: Stability through orthogonalization
- Gradient Descent: Simple but slow
- Matrix Inversion: Fast but unstable
- Momentum Methods: Accelerate convergence using history
- Adaptive Methods (Adam): Adjust learning per parameter
- Second-Order (Newton): Use curvature information
- Stochastic Methods: Trade accuracy for speed/scalability
- Iterative Solvers: Avoid explicit inversion
- Decomposition Methods: Reveal structure in data
- Probabilistic Methods: Handle uncertainty
Modern neural networks typically use:
- Adam or SGD with momentum for training (99% of cases)
- L-BFGS for small networks or fine-tuning
- SVD/QR for initialization or specific architectures
- Almost never use direct matrix inversion due to scale
The choice depends on problem size, required accuracy, available memory, and whether the problem is convex or non-convex.
Comments
Post a Comment