Optimizer

The Optimizer trait defines the interface for all optimization algorithms in MaidenX. Optimizers update the parameters of neural network layers based on gradient information to minimize the loss function.

Optimizer Trait Definition

#![allow(unused)]
fn main() {
pub trait Optimizer {
    fn step(&mut self, parameters: &mut [&mut Tensor]) -> Result<()>;
    fn zero_grad(&mut self, parameters: &mut [&mut Tensor]) -> Result<()>;
    fn set_learning_rate(&mut self, learning_rate: impl Into<Scalar>);
}
}

Core Methods

step

#![allow(unused)]
fn main() {
fn step(&mut self, parameters: &mut [&mut Tensor]) -> Result<()>;
}

The step method updates the parameters based on their current gradients. This is the core method that performs the optimization algorithm's update rule.

zero_grad

#![allow(unused)]
fn main() {
fn zero_grad(&mut self, parameters: &mut [&mut Tensor]) -> Result<()>;
}

Resets the gradients of all parameters to zero, typically called before computing gradients for the next batch.

set_learning_rate

#![allow(unused)]
fn main() {
fn set_learning_rate(&mut self, learning_rate: impl Into<Scalar>);
}

Allows dynamic adjustment of the learning rate during training, which can be useful for learning rate scheduling.

Available Optimizers

SGD (Stochastic Gradient Descent)

The SGD optimizer implements basic gradient descent with a configurable learning rate:

#![allow(unused)]
fn main() {
pub struct SGD {
    learning_rate: Scalar,
}

impl SGD {
    pub fn new(learning_rate: impl Into<Scalar>) -> Self {
        Self {
            learning_rate: learning_rate.into(),
        }
    }
}
}

Usage Example:

#![allow(unused)]
fn main() {
let mut sgd = SGD::new(0.01);
// Training loop
for _ in 0..num_epochs {
    // Forward and backward pass
    // ...
    
    // Update parameters
    sgd.step(&mut parameters)?;
    sgd.zero_grad(&mut parameters)?;
}
}

Adam (Adaptive Moment Estimation)

The Adam optimizer implements adaptive learning rates for each parameter with momentum and RMSProp-like behavior:

#![allow(unused)]
fn main() {
pub struct Adam {
    learning_rate: Scalar,
    beta1: Scalar,       // Exponential decay rate for first moment
    beta2: Scalar,       // Exponential decay rate for second moment
    epsilon: Scalar,     // Small constant for numerical stability
    t: usize,            // Timestep
    m: Vec<Tensor>,      // First moment vectors
    v: Vec<Tensor>,      // Second moment vectors
}

impl Adam {
    pub fn new(
        learning_rate: impl Into<Scalar>, 
        beta1: impl Into<Scalar>, 
        beta2: impl Into<Scalar>, 
        epsilon: impl Into<Scalar>
    ) -> Self {
        // Initialization
    }
}
}

Usage Example:

#![allow(unused)]
fn main() {
let mut adam = Adam::new(0.001, 0.9, 0.999, 1e-8);
// Training loop
for _ in 0..num_epochs {
    // Forward and backward pass
    // ...
    
    // Update parameters
    adam.step(&mut parameters)?;
    adam.zero_grad(&mut parameters)?;
}
}

Implementing Custom Optimizers

To create a custom optimizer, implement the Optimizer trait:

#![allow(unused)]
fn main() {
#[derive(Optimizer)]
struct MyCustomOptimizer {
    learning_rate: Scalar,
    momentum: Scalar,
    velocity: Vec<Tensor>,
}

impl MyCustomOptimizer {
    pub fn new(learning_rate: impl Into<Scalar>, momentum: impl Into<Scalar>) -> Self {
        Self {
            learning_rate: learning_rate.into(),
            momentum: momentum.into(),
            velocity: Vec::new(),
        }
    }
    
    pub fn step(&mut self, parameters: &mut [&mut Tensor]) -> Result<()> {
        // Initialize velocity vectors if needed
        if self.velocity.is_empty() {
            self.velocity = parameters
                .iter()
                .map(|param| Tensor::zeros_like(param))
                .collect::<Result<Vec<_>>>()?;
        }
        
        // Update rule with momentum
        for (param_idx, param) in parameters.iter_mut().enumerate() {
            if let Some(grad) = param.grad()? {
                // Update velocity
                self.velocity[param_idx] = self.velocity[param_idx]
                    .mul_scalar(self.momentum)?
                    .add(&grad)?;
                
                // Update parameter
                param.sub_(&self.velocity[param_idx].mul_scalar(self.learning_rate)?)?;
            }
        }
        Ok(())
    }
    
    pub fn zero_grad(&mut self, parameters: &mut [&mut Tensor]) -> Result<()> {
        for param in parameters.iter_mut() {
            param.zero_grad()?;
        }
        Ok(())
    }
    
    pub fn set_learning_rate(&mut self, learning_rate: impl Into<Scalar>) {
        self.learning_rate = learning_rate.into();
    }
}
}

Learning Rate Scheduling

You can implement learning rate scheduling by adjusting the learning rate during training:

#![allow(unused)]
fn main() {
let mut optimizer = SGD::new(0.1);

for epoch in 0..num_epochs {
    // Decay learning rate every 10 epochs
    if epoch > 0 && epoch % 10 == 0 {
        let current_lr = optimizer.learning_rate.to_f32();
        optimizer.set_learning_rate(current_lr * 0.1);
    }
    
    // Training loop
    // ...
}
}