Multi-Network Training with Moving Average Target

Created July 22, 2025 · Updated March 18, 2026

When neural networks must learn from each other, they create unstable feedback loops where each network chases the moving targets produced by the other.

Traditional gradient descent assumes targets are fixed, but in many scenarios networks must learn from each other simultaneously. This creates a "chicken and egg" problem where:

Network A needs stable targets from Network B to train properly
Network B needs stable targets from Network A to train properly
But both are updating constantly, making their outputs unstable targets

Why This Causes Instability: Each network update changes the target distribution for the other network, leading to oscillations, divergence, or collapse to trivial solutions.

The Moving Average Solution: Use exponential moving averages (EMA) to create slowly-evolving "target" versions of networks:

target_params = τ * main_params + (1-τ) * target_params

Where τ ≈ 0.001-0.01 creates very slow updates.

Examples

Target Networks (RL - DQN/DDPG/SAC)

Problem: Q-learning update Q(s,a) ← r + γ·max Q(s',a') uses the same network on both sides
Instability: As Q-network updates, the targets Q(s',a') keep changing, preventing convergence
Solution: Separate target Q-network updated via EMA provides stable targets for hundreds of steps
Result: Main network can learn against consistent targets, achieving stable Q-learning

Momentum Encoders (SSL - MoCo/BYOL)

Problem: Learning representations by maximizing agreement between different augmented views
Instability: If both encoder networks update together, they can collapse to output identical constant vectors (trivial solution)
Solution: Asymmetric setup - "student" encoder trains normally, "teacher" encoder updates via EMA
Result: Teacher provides stable, diverse targets while slowly incorporating student's improvements

Why Moving Averages Work So Well

Temporal Decoupling: Breaks the circular dependency by introducing a time delay
Stability: Targets change slowly enough for networks to actually learn from them
Information Preservation: Still incorporates improvements, just gradually
Regularization Effect: The averaged network is often more robust than the instantaneous one
Simple Implementation: Just one line of code to dramatically improve training

The Elegance: You get the benefits of networks co-evolving and learning from each other, without the chaos of circular dependencies. It's a simple solution to a fundamental problem in multi-network training.