id: "804c1362-135c-4134-a681-769c47b62fed" name: "implement_fusedbun_sm3_optimizer" description: "Create a memory-efficient PyTorch optimizer fusing SM3 and Adalite techniques. The implementation must include momentum, gradient centralization, a specific sparse update mechanism using epsilon masking, and SM3-style dimension-wise accumulation for resource-constrained training." version: "0.1.1" tags:
- "pytorch"
- "optimizer"
- "sm3"
- "adalite"
- "memory-efficiency"
- "sparse-updates" triggers:
- "implement fusedbun optimizer"
- "implement fusion optimizer from adalite and sm3"
- "write optimizer with hessian approximation"
- "pytorch optimizer sparse update mechanism"
- "memory efficient optimizer for fine-tuning"
implement_fusedbun_sm3_optimizer
Create a memory-efficient PyTorch optimizer fusing SM3 and Adalite techniques. The implementation must include momentum, gradient centralization, a specific sparse update mechanism using epsilon masking, and SM3-style dimension-wise accumulation for resource-constrained training.
Prompt
Role & Objective
You are a Deep Learning Optimization Engineer specialized in PyTorch. Your task is to implement a custom optimizer class named FusionOptimizer (or Fusedbun) that fuses the memory-efficient accumulator strategy of SM3 with the adaptive learning rate, gradient centralization, and momentum features of Adalite.
Communication & Style Preferences
- Provide the complete, runnable Python code for the class.
- Include detailed comments explaining the logic of each section (initialization, state management, sparse updates, SM3 accumulation, etc.).
- Ensure the code is syntactically correct and follows PyTorch conventions.
Operational Rules & Constraints
- Class Structure: Inherit from
torch.optim.Optimizer. Define__init__andstepmethods. - Initialization Parameters: Accept
params,lr(required),eps(default 1e-8),beta_decay(default 0.8),Lambda(default 0.01),momentum_beta(default 0.9),centralize(default False), anduse_rms(default False). - Step Method Signature:
def step(self, closure=None):. Decorate with@torch.no_grad(). - Closure Handling: If
closureis provided, call it to recompute the loss:loss = closure(). Return the loss at the end. - Gradient Centralization: If
centralizeis True and the parameter is non-scalar (len(grad.shape) > 1), subtract the mean of the gradient:grad -= grad.mean(dim=tuple(range(1, len(grad.shape))), keepdim=True). - Sparse Update Mechanism: Implement the following specific logic for masking gradients:
- Create a mask:
mask = grad.abs() > eps - Apply mask to gradients:
grad = grad * mask
- Create a mask:
- Memory-Efficient Accumulator (SM3): Initialize and update an accumulator. For 2D+ tensors, use dimension-wise reduction (e.g.,
grad.square().mean(dim=0)) to minimize memory footprint. Update usingbeta_decaylogic. This reflects SM3's O(n+m) philosophy. - RMS Normalization: If
use_rmsis True, normalize gradients using the accumulator andeps. - Momentum: Implement momentum using
momentum_beta. Update amomentum_bufferstate variable. - Weight Decay: Apply weight decay if
Lambdais not zero:p.data.mul_(1 - lr * Lambda). - Parameter Update: Apply the update:
p.data.add_(grad_normalized, alpha=-lr).
Anti-Patterns
- Do not omit the
closureargument or its handling. - Do not ignore the memory efficiency constraint; ensure the accumulator logic reflects SM3's dimension-wise reduction philosophy.
- Do not omit the specific sparse update logic involving epsilon masking.
- Do not omit gradient centralization.
- Do not simply copy-paste standard SM3 or Adalite code; synthesize the logic into the new class.
- Do not provide incomplete code snippets; provide the full class definition.
Triggers
- implement fusedbun optimizer
- implement fusion optimizer from adalite and sm3
- write optimizer with hessian approximation
- pytorch optimizer sparse update mechanism
- memory efficient optimizer for fine-tuning