NOPROP#

NOPROP: TRAINING NEURAL NETWORKS WITHOUT BACK-PROPAGATION OR FORWARD-PROPAGATION

Qinyu Li, Yee Whye Teh, Razvan Pascanu (arxiv) arXiv:2503.24322

The authors propose NoProp, a completely gradient-free method for training neural networks. No prop does not require back-prop or forwrad prop. Instead of learning via layer gradients, each layer is independently trained to denoise a noisy version of the label. And during inference, the noise is removed layer by layer.

Problematic#

  • Biological implausibility of back-propagation.
  • Memory costs due to storing activations during forward pass to facilitate backward passes.
  • Sequential dependencies of propagation hinders parallel computing.
  • Catastrophic forgetting in continual learning.

Ideation#

The authors were inspired by recent advances in generative modeling, specifically diffusion models and flow matching methods. The key insight is to reconceptualize neural network training - which we can be broken down to:

  • Reframing the problem : denoising at each layer vs sequentially propagating information across layers.
  • Fixing the representation at each layer beforehand to a noised version of the target.
  • Questioning the assumption that hierarchical representations are necessary for effective learning.

Contribution#

Introducing NoProp and its variants.

Variants

  • Discrete-Time NoProp (NoProp-DT): has a fixed number of denoising steps.
  • Continuous-Time NoProp (NoProp-CT): learns a dynamic denoising process.
  • Flow Matching (NoProp-FM): learns a vector field to carry noise to the label embedding.

Modeling Considerations

  • NoProp requires pre-fixing representations at each layer -> careful modeling and design.
  • The authors experimented with different initialization strategies for the class embedding matrix, including one-hot vectors, orthogonal matrices, and prototype-based approaches.
  • For the continuous-time variants, there was added complexity in conditioning on time.

Validation#

  • NoProp-DT performs on par or better than backprop on MNIST and CIFAR-10.
  • NoProp variants outperform prior backprop-free methods like Forward-Forward (Hinton, 2022), Difference Target Propagation, Local Greedy Forward Gradient.
  • NoProp uses less memory during training.