Dimensionality Reduction Through Autoencoders

Neural networks that can learn to compress input data into a lower-dimensional representation while preserving the most important patterns and relationships.

Author

NonNeutralZero

Published

Nov. 30, 2024

While working on machine learning projects, we often encounter datasets with hundreds of features, each potentially carrying valuable information we can't simply discard.

The immediate approach, when facing a challenge like this, may be to throw all these features into a model and let it figure things out. However, this often leads to increased computational complexity, longer training times, or the dreaded curse of dimensionality.

This is where dimensionality reduction techniques enter the picture. And one powerful dimensionality reduction technique is autoencoders - neural networks that learn to compress input data into a lower-dimensional representation while preserving the most important patterns and relationships.

In this article, I'll talk about implementing autoencoders to tackle high-dimensional data. We'll explore how autoencoders can effectively compress several features into a more manageable representation while maintaining the essential information needed for downstream tasks.

What is an autoencoder ?
The theoretical foundations of autoencoders
A practical implementation guide with code examples
Comparative analysis with other dimensionality reduction techniques

What is an autoencoder ?

At t its core, an autoencoder is a neural network designed to "copy" its input to its output. I am using copy in quotes because autoencoders are not an identity function and we don't want them to be an identity function, rather they try to learn an approximation to the identity function. While this copy behavior might sound trivial at first, there's an important constraint: the network must pass the input through a bottleneck, i.e. a layer with fewer neurons than the input dimensions. This constraint forces the network to learn a compressed representation of the data.

If we only need a compressed representation of the data, wouldn't the encoder be enough? We'll talk about this in the engineering design section.

Architecture breakdown

Autoencoder By Michela Massi - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=80177333

An autoencoder consists of three main components:

The encoder: transforms the high-dimensional input into a compressed representation
The bottleneck or code: stores the compressed representation (also called the latent space)
The decoder: attempts to reconstruct the original input from the compressed representation

So if we annotate the encoder function with g, the decoder function with f, the autoencoder's output is x' which is decoded from the encoding of x, we can write the autoencoder as:

x^{'} = f_{θ} (g_{ϕ} (x))

An autoencoder leans by trying to minimize the difference between input and output. But since the input must pass through the bottleneck layer, the network is forced to learn which features are most important for reconstruction.

The learning process consists of defining a reconstruction evaluation function that takes the input x and the output x', to then optimize a loss function over the parameters of the functions we saw above.

\arg \min_{θ, ϕ} L (θ, ϕ)

For example, if the reconstruction evaluation is L2, we can use an MSE loss:

L (θ, ϕ) = \frac{1}{n} \sum_{i = 1}^{n} {(x^{i} - f_{θ} (g_{ϕ} (x^{i})))}^{2}

Engineering design

For dimensionalily reduction, we technically only need the encoder, who's job is compressing the data into a lower dimension. But how do we know that the encoder did a good job? This is where the decoder comes into play. The decoder validates whether the encoding actually captures important information.

The elegance of the autoencoder's design is incorporating the decoder - which serves as a built-in validation mechanism during training.

Without the decoder:

We wouldn't have a built-in way to verify if our encoded representation is good.
We'd lose the self-supervisory signal that guides the encoder to learn.

This is the key essence - the decoder serves both as a training mechanism and an evaluation tool for the encoder's learned representations.

In practice, once training is complete, we discard the decoder when our sole goal is dimensionality reduction.

The theoretical foundations of autoencoders

While the term autoencoder as applied in modern deep learning is relatively recent, its theoretical underpinnings draw from several older concepts and foundational ideas from the fields of information theory, unsupervised learning, dimensionality reduction and neural networks.

One of the theoretical frameworks for understanding autoencoders comes from information theory. Tishby's Information Bottleneck theory (Tishby, et al.,2000) provides an explanation of why this "self-encoding" approach works: they find an optimal balance between compression (reducing dimensionality) and preservation of relevant information.

In the mid-2000s, autoencoders were formalized as a specific type of unsupervised learning model for feature learning. They were used as a way to learn efficient, compact representations of input dara, typically by training the model to reconstruct the input as accurately as possible from a lower-dimensional representation i.e the bottleneck.

In 2006, Hinton and Salakhutdinov (Hinton & Salakhutdinov, 2006) demonstrated that deep autoencoders could perform dimensionality reduction more effectively than principal component analysis (PCA), particularly for complex, non-linear data; marking a significant step in the development of autoencoders for feature learning.

What also makes autoencoders awesome is their evolution from simple reconstruction tools to powerful representation learning mechanisms. As outlined in comprehensive reviews (refer for example to Bengio et al. 2013), autoencoders shifted from being viewed as just dimensionality reduction tools to becoming fundamental building blocks in representation learning and generative modeling.

This evolution led to several autoencoder variants:

Denoising Autoencoders (Vincent et al., 2008)
Contractive Autoencoder (Rifai, et al, 2011)
Variational Autoencoders (Kingma & Welling, 2013)
k-Sparse Autoencoder (Makhzani & Frey, 2013)
Beta-VAE (Higgins et al., 2017)
TD-VAE (Gregor et al., 2019)

A practical guide to implementing deep autoencoders

The autoencoder architecture provides remarkable flexibility. Depending on the problem at hand, one can design an effective autoencoder tailored to the specificities of their task.

Before taking a look at some key flexibilities that deep autoencoders can offer, let's demonstrate a practical implementation of an autoencoder.

We'll recreate the architecture from Hinton and Salakhutdinov's influential 2006 paper. This implementation will serve as a concrete example of the concepts we've discussed and as a setting stone to some elements we'll be discussing in the following architecture variations sections.

Hinton and Salakhutdinov's work demonstrated that deep autoencoders could perform dimensionality reduction more effectively than PCA. We'll use the MNIST dataset for our demonstration purposes.

Hinton & Salakhutdinov (2006) - Reducing the dimensionality of data with neural networks.
Implementation using pytorch

We maintained the overall structure (as shown in the above figure), however there some differences in this impelmentation:

Activation Functions: We used ReLUs for hidden layers and sigmoid only for the output layer.
Training: The paper used a layer-wise pretraining strategy with Restricted Boltzmann Machines (RBMs), whereas we used end-to-end training with backpropagation.
Optimization: We used Adam optimizer.
Loss: We used Mean Squared Error (MSE) loss instead of Cross-entropy loss.

We aimed to achieve similar compression quality, and indeed the reconstructed output seemed satisfactory.

Output of our implementation of the Hinton and Salakhutdinov autoencoder.

When compared with PCA and Logistic PCA, the autoencoder yields better reconstruction outputs.

Comparison between PCA, Logistic PCA, and autoencoder reconstructions.

Now that we have an idea about an autoencoder's architecture, we can move to discussing some key variations its design offers.

1. Autoencoders can have an asymmetric architecture

The encoder and decoder don't have to be symmetrical in their layer sizes. For example, the encoder could use [784→1000→500→250→30] while decoder could use [30→400→800→784].

They can also have different layer depths (for example a deeper encoder vs a shallower decoder) and arrangements (for example: an encoder with concentrated capacity near input and decoder with concetrated capacity near output ).

This is useful when the encoding or/and decoding complexity differs.

2. Autoencoders can have mixed Encoder/Decoder architectures

The encoder and the decoder can use different neural network architectures. For example, CNN encoder with dense decoder for image feature extraction. This is useful for leveraging specialized architectures for specific data types.

A mixed encoder/decoder architecture autoencoder.

3. We can also implement skip connections in autoencoders

We can add residual connections between encoder and decoder layers. This involves directly passing outputs from the encoder layers to the decoder layers resulting in better gradient flow and reducing information loss.

Acknowledgments

Article template by distill.pub

Sources & References

Representation learning: A review and new perspectives
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35 (8):1798–1828, August 2013. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.50. URL https://doi.org/10.1109/TPAMI.2013.50.
Denoising Autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML’08, page 1096–1103, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605582054. doi: 10.1145/1390156.1390294. URL https://doi.org/10.1145/1390156.1390294.
Contractive auto-encoders
Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 833–840, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.
k-sparse autoencoders
Alireza Makhzani and Brendan Frey. k-sparse autoencoders, 2014. URL https://arxiv.org/abs/1312.5663.
betaVAE
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.
Temporal difference variational auto-encoder
DKarol Gregor, George Papamakarios, Frederic Besse, Lars Buesing, and Theophane Weber. Temporal difference variational auto-encoder, 2019. URL https://arxiv.org/abs/1806.03107.
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. URL https://arxiv.org/abs/1711.00937.
Reducing the Dimensionality of Data with Neural Networks
G. E. Hinton, R. R. Salakhutdinov ,Reducing the Dimensionality of Data with Neural Networks. Science313,504-507(2006).DOI:10.1126/science.1127647
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.

Updates and Corrections

If you see mistakes or want to suggest changes, please create a discussion entry on GitHub.

Citation

Please cite this work as

NNZ. (Nov 2024). Dimensionality Reduction Through Autoencoders. NonNeutralZero.https://non-neutralzero.github.io/article-dimensionality-reduction-autoencoders/.

BibTeX citation

@article{nnz2024dimredautoencoders, 
                title   = "Dimensionality Reduction Through Autoencoders",
                author  = "NNZ",
                journal = "nonneutralzero.com",
                year    = "2024",
                month   = "Nov",
                url     = "https://non-neutralzero.github.io/article-dimensionality-reduction-autoencoders"
            }