QR decomposition in neural networks, compare to Gradient Descent and Matrix Inversion

QR decomposition in neural networks refers to using the QR matrix factorization technique within neural network architectures. Let me explain what this means and why it's useful.

What is QR Decomposition?

First, QR decomposition is a mathematical technique that factors a matrix A into two matrices:

Q: An orthogonal matrix (columns are orthonormal vectors)
R: An upper triangular matrix

So: A = QR

Applications in Neural Networks

1. Weight Initialization and Normalization

QR decomposition is used to initialize weight matrices with orthogonal properties:

Orthogonal initialization: Weights are initialized as orthogonal matrices using QR decomposition of random matrices
Benefits: Helps prevent vanishing/exploding gradients and improves training stability
Example: In RNNs and deep networks, orthogonal initialization helps maintain gradient flow

2. Orthogonal Regularization

During training, QR decomposition enforces orthogonality constraints:

# Pseudo-code for orthogonal regularization
W_normalized = QR_decompose(W)[0]  # Keep only Q part

This ensures weight matrices maintain orthogonal properties, which:

Prevents feature redundancy
Improves gradient flow
Reduces internal covariate shift

3. Spectral Normalization

QR decomposition helps control the spectral properties of weight matrices:

Constrains the singular values of weight matrices
Particularly useful in GANs for training stability
Prevents mode collapse and improves convergence

4. Efficient Computation in Transformers

Recent research uses QR decomposition for efficient attention mechanisms:

Performers and Linear Transformers use QR decomposition to approximate attention
Reduces computational complexity from O(n²) to O(n)
Maintains mathematical properties while speeding up computation

5. Neural ODE Networks

In continuous-depth neural networks (Neural ODEs):

QR decomposition ensures numerical stability
Helps solve differential equations within the network
Maintains orthogonality during continuous transformations

Specific Architectures Using QR Decomposition

Orthogonal RNNs (ORNNs)

Use QR decomposition to maintain orthogonal recurrent weight matrices
Solves vanishing gradient problem more effectively than LSTM/GRU
Better at learning long-term dependencies

Cayley Neural Networks

Use Cayley transform (related to QR) to parameterize orthogonal matrices
Guarantees orthogonality throughout training
More stable optimization landscape

Householder Networks

Use Householder reflections (building blocks of QR) as layers
Efficient parameterization of orthogonal transformations
Memory-efficient for large matrices

Implementation Example

Here's a simplified example of how QR decomposition might be used in practice:

import torch
import torch.nn as nn

class QRLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        
    def forward(self, x):
        # Apply QR decomposition to weight matrix
        q, r = torch.qr(self.weight)
        # Use only orthogonal part Q
        return torch.matmul(x, q.t())

Key Benefits

Training Stability: Orthogonal matrices preserve norm, preventing gradient issues
Feature Diversity: Orthogonal weights ensure different neurons learn different features
Computational Efficiency: In some cases, QR structure enables faster computations
Better Generalization: Orthogonality constraints act as implicit regularization

Challenges

Computational Cost: QR decomposition can be expensive for large matrices
Gradient Computation: Backpropagation through QR decomposition requires careful implementation
Limited Flexibility: Orthogonality constraints may be too restrictive for some tasks

QR decomposition neural networks represent an important technique for improving training dynamics and model performance, especially in scenarios requiring stable gradient flow or efficient computation. They're particularly valuable in RNNs, normalizing flows, and modern transformer variants.

I'll solve simple problems by hand for each method to show how they work step-by-step.

Problem Setup

Let's solve a simple linear system: Find the best-fit line through points (1,3) and (2,5).

This gives us the system:

a + b = 3 (when x=1, y=3)
2a + b = 5 (when x=2, y=5)

In matrix form: Ax = y where:

A = [1  1]    x = [a]    y = [3]
    [2  1]        [b]        [5]

1. QR Decomposition Solution (by hand)

Step 1: Decompose A into Q and R

Using Gram-Schmidt process to find Q:

Column 1 of A: v₁ = [1, 2]ᵀ

Normalize it:

||v₁|| = √(1² + 2²) = √5
q₁ = v₁/||v₁|| = [1/√5, 2/√5]ᵀ

Column 2 of A: v₂ = [1, 1]ᵀ

Orthogonalize against q₁:

projection = (v₂ · q₁)q₁ = ((1×1/√5 + 1×2/√5)) × [1/√5, 2/√5]ᵀ
projection = (3/√5) × [1/√5, 2/√5]ᵀ = [3/5, 6/5]ᵀ
v₂⊥ = v₂ - projection = [1, 1]ᵀ - [3/5, 6/5]ᵀ = [2/5, -1/5]ᵀ

Normalize it:

||v₂⊥|| = √((2/5)² + (-1/5)²) = √(4/25 + 1/25) = √(5/25) = 1/√5
q₂ = v₂⊥/||v₂⊥|| = [2/√5, -1/√5]ᵀ

So: Q = [1/√5 2/√5] [2/√5 -1/√5]

Step 2: Calculate R = QᵀA

R₁₁ = q₁ᵀ × column₁(A) = [1/√5, 2/√5] · [1, 2]ᵀ = 1/√5 + 4/√5 = 5/√5 = √5

R₁₂ = q₁ᵀ × column₂(A) = [1/√5, 2/√5] · [1, 1]ᵀ = 1/√5 + 2/√5 = 3/√5

R₂₂ = q₂ᵀ × column₂(A) = [2/√5, -1/√5] · [1, 1]ᵀ = 2/√5 - 1/√5 = 1/√5

So: R = [√5 3/√5] [0 1/√5]

Step 3: Solve Rx = Qᵀy

First, calculate Qᵀy:

Qᵀy = [1/√5   2/√5] [3]   = [3/√5 + 10/√5]  = [13/√5]
      [2/√5  -1/√5] [5]     [6/√5 - 5/√5]      [1/√5]

Now solve Rx = Qᵀy using back substitution:

[√5   3/√5] [a]   [13/√5]
[0    1/√5] [b] = [1/√5]

From row 2: (1/√5)b = 1/√5 → b = 1

From row 1: √5a + (3/√5)×1 = 13/√5

√5a = 13/√5 - 3/√5 = 10/√5
a = 10/5 = 2

Answer: a = 2, b = 1 (line equation: y = 2x + 1)

2. Gradient Descent Solution (by hand)

Setup

Minimize loss: L = ½[(a + b - 3)² + (2a + b - 5)²]

Starting point: a₀ = 0, b₀ = 0 Learning rate: η = 0.1

Iteration 1

Calculate gradients:

∂L/∂a = (a + b - 3)×1 + (2a + b - 5)×2
∂L/∂a = (0 + 0 - 3)×1 + (0 + 0 - 5)×2 = -3 - 10 = -13
∂L/∂b = (a + b - 3)×1 + (2a + b - 5)×1
∂L/∂b = (0 + 0 - 3)×1 + (0 + 0 - 5)×1 = -3 - 5 = -8

Update weights:

a₁ = a₀ - η(∂L/∂a) = 0 - 0.1×(-13) = 1.3
b₁ = b₀ - η(∂L/∂b) = 0 - 0.1×(-8) = 0.8

Iteration 2

Calculate gradients with a=1.3, b=0.8:

∂L/∂a = (1.3 + 0.8 - 3)×1 + (2×1.3 + 0.8 - 5)×2
∂L/∂a = (-0.9)×1 + (-1.6)×2 = -0.9 - 3.2 = -4.1
∂L/∂b = (1.3 + 0.8 - 3)×1 + (2×1.3 + 0.8 - 5)×1
∂L/∂b = (-0.9)×1 + (-1.6)×1 = -0.9 - 1.6 = -2.5

Update weights:

a₂ = 1.3 - 0.1×(-4.1) = 1.3 + 0.41 = 1.71
b₂ = 0.8 - 0.1×(-2.5) = 0.8 + 0.25 = 1.05

Iteration 3

Calculate gradients with a=1.71, b=1.05:

∂L/∂a = (1.71 + 1.05 - 3)×1 + (2×1.71 + 1.05 - 5)×2
∂L/∂a = (-0.24)×1 + (-0.53)×2 = -0.24 - 1.06 = -1.30
∂L/∂b = (-0.24)×1 + (-0.53)×1 = -0.77

Update weights:

a₃ = 1.71 - 0.1×(-1.30) = 1.84
b₃ = 1.05 - 0.1×(-0.77) = 1.13

After more iterations, this converges toward a = 2, b = 1

3. Matrix Inversion Solution (by hand)

Using Normal Equation: x = (AᵀA)⁻¹Aᵀy

Step 1: Calculate AᵀA

Aᵀ = [1  2]    A = [1  1]
     [1  1]        [2  1]

AᵀA = [1  2] [1  1] = [1×1+2×2  1×1+2×1] = [5  3]
      [1  1] [2  1]   [1×1+1×2  1×1+1×1]   [3  2]

Step 2: Calculate Aᵀy

Aᵀy = [1  2] [3] = [1×3+2×5] = [13]
      [1  1] [5]   [1×3+1×5]   [8]

Step 3: Find (AᵀA)⁻¹

For 2×2 matrix inverse:

[a  b]⁻¹ = 1/(ad-bc) × [d  -b]
[c  d]                  [-c  a]

For AᵀA = [5 3]: [3 2]

Determinant = 5×2 - 3×3 = 10 - 9 = 1

(AᵀA)⁻¹ = 1/1 × [2  -3] = [2  -3]
                 [-3  5]   [-3  5]

Step 4: Calculate x = (AᵀA)⁻¹Aᵀy

x = [2  -3] [13] = [2×13 + (-3)×8] = [26-24] = [2]
    [-3  5] [8]    [(-3)×13 + 5×8]   [-39+40]  [1]

Answer: a = 2, b = 1

Comparison Summary

Method	Steps	Result	Effort
QR Decomposition	1. Gram-Schmidt orthogonalization<br>2. Form Q and R matrices<br>3. Back substitution	a=2, b=1<br>(exact)	High complexity, but numerically stable
Gradient Descent	1. Initialize randomly<br>2. Calculate gradients<br>3. Update iteratively<br>4. Repeat until convergence	a≈2, b≈1<br>(after many iterations)	Simple steps, but many iterations needed
Matrix Inversion	1. Calculate AᵀA<br>2. Calculate Aᵀy<br>3. Find inverse<br>4. Multiply	a=2, b=1<br>(exact)	Direct but requires matrix inverse

Key Insights from Hand Calculations:

QR: Most work is in the orthogonalization process, but gives stable solution
Gradient Descent: Simple operations but needs many iterations to converge
Matrix Inversion: Fastest for small problems but finding inverse can be unstable for larger/ill-conditioned matrices

Each method arrives at the same solution (y = 2x + 1), but through very different mathematical paths!

I'll compare QR decomposition with Gradient Descent and Matrix Inversion in the context of solving linear systems and neural network training, providing clear examples for each.

Overview Comparison Table

Aspect	QR Decomposition	Gradient Descent	Matrix Inversion
Type	Direct algebraic method	Iterative optimization	Direct algebraic method
Use Case	Solving Ax=b, orthogonalization	Minimizing loss functions	Solving Ax=b exactly
Computational Complexity	O(n³)	O(n²) per iteration	O(n³)
Stability	Very stable	Depends on learning rate	Unstable for ill-conditioned matrices
Memory	Moderate	Low	High
Exact vs Approximate	Exact (numerical precision)	Approximate	Exact (if matrix invertible)

1. QR Decomposition Example

Solving Linear Regression: Ax = b

import numpy as np

# Problem: Find best-fit line for points
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])  # Design matrix
y = np.array([2, 3, 4, 5])  # Target values

# QR Decomposition Solution
Q, R = np.linalg.qr(X)
# Solve Rx = Q^T y
weights_qr = np.linalg.solve(R, Q.T @ y)

print("QR Solution:", weights_qr)  # [1.0, 1.0] - equation: y = 1 + 1*x

Neural Network Application:

import torch
import torch.nn as nn

class OrthogonalLayer(nn.Module):
    """Layer that maintains orthogonal weights using QR"""
    def __init__(self, size):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(size, size))
    
    def forward(self, x):
        # Ensure weights stay orthogonal
        Q, R = torch.linalg.qr(self.weight)
        return torch.matmul(x, Q)

# Usage
layer = OrthogonalLayer(64)
x = torch.randn(32, 64)
output = layer(x)  # Maintains orthogonality throughout training

Advantages:

Numerically stable
Preserves orthogonality
One-shot solution

Disadvantages:

Computationally expensive for large matrices
Not suitable for non-linear problems

2. Gradient Descent Example

Same Linear Regression Problem

import numpy as np

# Same data
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2, 3, 4, 5])

# Gradient Descent Solution
weights_gd = np.random.randn(2)  # Random initialization
learning_rate = 0.01
epochs = 1000

for epoch in range(epochs):
    # Forward pass
    predictions = X @ weights_gd
    
    # Compute loss (MSE)
    loss = np.mean((predictions - y) ** 2)
    
    # Compute gradients
    gradients = 2 * X.T @ (predictions - y) / len(y)
    
    # Update weights
    weights_gd -= learning_rate * gradients
    
    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}, Weights: {weights_gd}")

print("Final GD Solution:", weights_gd)  # Converges to [1.0, 1.0]

Neural Network Application:

import torch
import torch.nn as nn
import torch.optim as optim

# Simple neural network trained with gradient descent
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

# Training with gradient descent
model = SimpleNet()
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

# Training loop
for epoch in range(100):
    optimizer.zero_grad()
    output = model(torch.randn(32, 10))
    loss = criterion(output, torch.randn(32, 1))
    loss.backward()  # Compute gradients
    optimizer.step()  # Update weights via gradient descent

Advantages:

Works for non-linear optimization
Memory efficient
Can handle large-scale problems
Online learning capability

Disadvantages:

Requires many iterations
Sensitive to learning rate
Can get stuck in local minima

3. Matrix Inversion Example

Direct Solution for Linear System

import numpy as np

# Same problem
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2, 3, 4, 5])

# Matrix Inversion Solution (Normal Equation)
# Solve: (X^T X) w = X^T y
XtX = X.T @ X
Xty = X.T @ y

# Direct inversion
XtX_inv = np.linalg.inv(XtX)
weights_inv = XtX_inv @ Xty

print("Matrix Inversion Solution:", weights_inv)  # [1.0, 1.0]

# Or using pseudo-inverse (more stable)
weights_pinv = np.linalg.pinv(X) @ y
print("Pseudo-inverse Solution:", weights_pinv)  # [1.0, 1.0]

Neural Network Application:

import torch

class DirectSolveLayer(nn.Module):
    """Layer that directly computes optimal weights using matrix inversion"""
    def __init__(self, in_features, out_features):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        
    def fit(self, X, y):
        # Directly solve for optimal weights using pseudo-inverse
        # w = (X^T X)^(-1) X^T y
        self.weight = torch.linalg.pinv(X) @ y
        
    def forward(self, x):
        return x @ self.weight.T

# Usage - direct fitting without iterative training
layer = DirectSolveLayer(10, 1)
X_train = torch.randn(100, 10)
y_train = torch.randn(100, 1)
layer.fit(X_train, y_train)  # One-shot solution

Advantages:

Exact solution in one step
No hyperparameter tuning
Guaranteed global optimum for convex problems

Disadvantages:

O(n³) complexity
Numerically unstable for ill-conditioned matrices
Requires entire dataset in memory
Only works for linear problems

Practical Comparison Example

Let's solve the same problem with all three methods:

import numpy as np
import time

# Generate larger problem
np.random.seed(42)
n_samples, n_features = 1000, 100
X = np.random.randn(n_samples, n_features)
true_weights = np.random.randn(n_features)
y = X @ true_weights + 0.1 * np.random.randn(n_samples)

# 1. QR Decomposition
start = time.time()
Q, R = np.linalg.qr(X)
w_qr = np.linalg.solve(R, Q.T @ y)
time_qr = time.time() - start

# 2. Gradient Descent
start = time.time()
w_gd = np.zeros(n_features)
lr = 0.01
for _ in range(1000):
    grad = X.T @ (X @ w_gd - y) / n_samples
    w_gd -= lr * grad
time_gd = time.time() - start

# 3. Matrix Inversion (via normal equation)
start = time.time()
w_inv = np.linalg.inv(X.T @ X) @ X.T @ y
time_inv = time.time() - start

# Compare results
print(f"QR Error: {np.linalg.norm(w_qr - true_weights):.4f}, Time: {time_qr:.4f}s")
print(f"GD Error: {np.linalg.norm(w_gd - true_weights):.4f}, Time: {time_gd:.4f}s")
print(f"Inv Error: {np.linalg.norm(w_inv - true_weights):.4f}, Time: {time_inv:.4f}s")

When to Use Each Method

Use QR Decomposition when:

You need numerical stability
Solving medium-sized linear systems
Maintaining orthogonality is important (e.g., in RNNs)
Working with ill-conditioned matrices

Use Gradient Descent when:

Problem is non-linear
Dataset is too large for memory
Need online/streaming updates
Working with neural networks

Use Matrix Inversion when:

Need exact solution immediately
Problem is small and well-conditioned
Have sufficient memory
Working with simple linear regression

In neural networks, gradient descent dominates because of non-linearity and scale, but QR decomposition appears in specialized architectures for stability, and matrix inversion is rarely used except in specific analytical solutions or small linear layers.

Other Methods used with Neural Network

Excellent question! Beyond QR decomposition, gradient descent, and matrix inversion, neural networks employ many other optimization and mathematical methods. Let me break down the major categories with hand-calculation examples where feasible.

1. Advanced Gradient-Based Methods

Momentum-Based Methods

Classical Momentum

Instead of just using current gradient, accumulate velocity:

velocity = β × velocity + gradient
weights = weights - learning_rate × velocity

Hand Example: Starting from previous gradient descent at iteration 2 (a=1.3, b=0.8):

Previous velocity: v_a = 0, v_b = 0
Current gradient: ∇a = -4.1, ∇b = -2.5
β = 0.9 (momentum coefficient)

v_a = 0.9 × 0 + (-4.1) = -4.1
v_b = 0.9 × 0 + (-2.5) = -2.5
a_new = 1.3 - 0.1 × (-4.1) = 1.71
b_new = 0.8 - 0.1 × (-2.5) = 1.05

Adam (Adaptive Moment Estimation)

Combines momentum with adaptive learning rates:

m = β₁ × m + (1-β₁) × gradient     (momentum)
v = β₂ × v + (1-β₂) × gradient²    (RMSprop)
weights = weights - lr × m/√(v + ε)

Key Insight: Adapts learning rate per parameter, faster convergence than vanilla gradient descent

Newton's Method and Quasi-Newton Methods

Newton's Method

Uses second-order derivatives (Hessian):

x_new = x_old - H⁻¹∇f

Hand Example for f(x) = x² - 4:

Start: x₀ = 3
f'(x) = 2x, f''(x) = 2
x₁ = 3 - (2×3)/2 = 3 - 3 = 0 ❌ (overshoot)
Better with damping: x₁ = 3 - 0.5×(6/2) = 1.5
x₂ = 1.5 - (2×1.5)/2 = 0.75... → converges to 0

L-BFGS (Limited-memory BFGS)

Approximates Hessian using gradient history:

Stores last m (typically 5-20) gradient updates
Builds approximate inverse Hessian

Key Insight: Faster convergence than gradient descent, but memory intensive

2. Stochastic Methods

Stochastic Gradient Descent (SGD)

Uses random mini-batches instead of full dataset:

Hand Example: Dataset: [(1,3), (2,5), (3,7), (4,9)] Instead of using all 4 points, randomly pick 1-2 each iteration:

Iteration 1: Use only (2,5) → gradient based on single point
Iteration 2: Use only (3,7) → different gradient
Adds noise but enables online learning

Simulated Annealing

Probabilistically accepts worse solutions to escape local minima:

If new_loss < old_loss: accept
Else: accept with probability e^(-(new_loss-old_loss)/T)

Key Insight: Temperature T decreases over time, allowing exploration early and exploitation later

3. Eigenvalue/Eigenvector Methods

Power Iteration

Finds dominant eigenvector:

Hand Example: Matrix A = [2 1; 1 2], start v₀ = [1; 0]

v₁ = Av₀ = [2; 1], normalize: [2/√5; 1/√5]
v₂ = Av₁ = [5/√5; 4/√5], normalize: [5/√41; 4/√41]

Converges to dominant eigenvector

Singular Value Decomposition (SVD)

Decomposes A = UΣVᵀ

Application in Neural Networks:

Weight compression
Noise reduction
Computing pseudo-inverse

Key Insight: More general than eigendecomposition, works for non-square matrices

4. Iterative Linear System Solvers

Conjugate Gradient Method

For solving Ax = b iteratively:

Hand Example for simple 2×2:

A = [4 1; 1 3], b = [1; 2]

Start: x₀ = [0; 0], r₀ = b - Ax₀ = [1; 2]
p₀ = r₀ = [1; 2]
α₀ = (r₀ᵀr₀)/(p₀ᵀAp₀) = 5/14
x₁ = x₀ + α₀p₀ = [5/14; 10/14]

Key Insight: Guaranteed convergence in n steps for n×n matrix, no matrix inversion needed

Jacobi Method

Iterative solver using diagonal dominance:

x_i^(k+1) = (b_i - Σ(j≠i) a_ij × x_j^(k)) / a_ii

5. Specialized Neural Network Methods

Backpropagation Through Time (BPTT)

For RNNs - unrolls network and applies chain rule:

∂L/∂W = Σ_t ∂L_t/∂W

Contrastive Divergence

For training Restricted Boltzmann Machines:

Positive phase: sample from data
Negative phase: sample from model
Update: weights += learning_rate × (positive - negative)

Expectation Maximization (EM)

For mixture models and latent variables:

E-step: Estimate hidden variables
M-step: Maximize parameters
Repeat until convergence

6. Regularization and Constraint Methods

Proximal Gradient Methods

For L1 regularization (sparsity):

Soft threshold: S(x,λ) = sign(x) × max(|x|-λ, 0)

Hand Example: Weight after gradient step: 1.5 L1 penalty λ = 0.3 New weight = sign(1.5) × max(1.5-0.3, 0) = 1.2

Lagrangian Methods

For constrained optimization:

L(x,λ) = f(x) + λg(x)

7. Probabilistic/Sampling Methods

Monte Carlo Methods

Approximate expectations through sampling:

E[f(x)] ≈ (1/N) Σ f(x_i) where x_i ~ p(x)

Variational Inference

Approximate posterior with simpler distribution:

KL(q||p) = E_q[log q] - E_q[log p]

Comparison Table

Method	Convergence Rate	Memory	Stability	Use Case
Vanilla GD	O(1/k)	Low	Good	Simple problems
Momentum	O(1/k²)	Low	Good	General training
Adam	Adaptive	Moderate	Very Good	Default choice
Newton	O(1/k²) quadratic	High	Poor	Small, smooth problems
L-BFGS	Superlinear	Moderate	Good	Batch optimization
SGD	O(1/√k)	Very Low	Good	Large datasets
Conjugate Gradient	O(√κ log(1/ε))	Low	Good	Linear systems
SVD	Direct	High	Excellent	Dimensionality reduction

Key Insights from All Methods:

QR: Stability through orthogonalization
Gradient Descent: Simple but slow
Matrix Inversion: Fast but unstable
Momentum Methods: Accelerate convergence using history
Adaptive Methods (Adam): Adjust learning per parameter
Second-Order (Newton): Use curvature information
Stochastic Methods: Trade accuracy for speed/scalability
Iterative Solvers: Avoid explicit inversion
Decomposition Methods: Reveal structure in data
Probabilistic Methods: Handle uncertainty

Modern neural networks typically use:

Adam or SGD with momentum for training (99% of cases)
L-BFGS for small networks or fine-tuning
SVD/QR for initialization or specific architectures
Almost never use direct matrix inversion due to scale

The choice depends on problem size, required accuracy, available memory, and whether the problem is convex or non-convex.

Artificial Intelligence Theory and Application