NOPROP#

NOPROP: TRAINING NEURAL NETWORKS WITHOUT BACK-PROPAGATION OR FORWARD-PROPAGATION

Qinyu Li, Yee Whye Teh, Razvan Pascanu (arxiv) arXiv:2503.24322

The authors propose NoProp, a completely gradient-free method for training neural networks. No prop does not require back-prop or forwrad prop. Instead of learning via layer gradients, each layer is independently trained to denoise a noisy version of the label. And during inference, the noise is removed layer by layer.

Problematic#

Biological implausibility of back-propagation.
Memory costs due to storing activations during forward pass to facilitate backward passes.
Sequential dependencies of propagation hinders parallel computing.
Catastrophic forgetting in continual learning.

Ideation#

The authors were inspired by recent advances in generative modeling, specifically diffusion models and flow matching methods. The key insight is to reconceptualize neural network training - which we can be broken down to:

Reframing the problem : denoising at each layer vs sequentially propagating information across layers.
Fixing the representation at each layer beforehand to a noised version of the target.
Questioning the assumption that hierarchical representations are necessary for effective learning.

Contribution#

Introducing NoProp and its variants.

Variants

Discrete-Time NoProp (NoProp-DT): has a fixed number of denoising steps.
Continuous-Time NoProp (NoProp-CT): learns a dynamic denoising process.
Flow Matching (NoProp-FM): learns a vector field to carry noise to the label embedding.

Modeling Considerations

NoProp requires pre-fixing representations at each layer -> careful modeling and design.
The authors experimented with different initialization strategies for the class embedding matrix, including one-hot vectors, orthogonal matrices, and prototype-based approaches.
For the continuous-time variants, there was added complexity in conditioning on time.

Validation#

NoProp-DT performs on par or better than backprop on MNIST and CIFAR-10.
NoProp variants outperform prior backprop-free methods like Forward-Forward (Hinton, 2022), Difference Target Propagation, Local Greedy Forward Gradient.
NoProp uses less memory during training.