
Neural networks that can learn to compress input data into a lower-dimensional representation while preserving the most important patterns and relationships.
While working on machine learning projects, we often encounter datasets with hundreds of features, each potentially carrying valuable information we can't simply discard.
The immediate approach, when facing a challenge like this, may be to throw all these features into a model and let it figure things out. However, this often leads to increased computational complexity, longer training times, or the dreaded curse of dimensionality.
This is where dimensionality reduction techniques enter the picture. And one powerful dimensionality reduction technique is autoencoders - neural networks that learn to compress input data into a lower-dimensional representation while preserving the most important patterns and relationships.
In this article, I'll talk about implementing autoencoders to tackle high-dimensional data. We'll explore how autoencoders can effectively compress several features into a more manageable representation while maintaining the essential information needed for downstream tasks.
At t its core, an autoencoder is a neural network designed to "copy" its input to its output. I am using copy in quotes because autoencoders are not an identity function and we don't want them to be an identity function, rather they try to learn an approximation to the identity function. While this copy behavior might sound trivial at first, there's an important constraint: the network must pass the input through a bottleneck, i.e. a layer with fewer neurons than the input dimensions. This constraint forces the network to learn a compressed representation of the data.
If we only need a compressed representation of the data, wouldn't the encoder be enough? We'll talk about this in the engineering design section.
An autoencoder consists of three main components:
So if we annotate the encoder function with g, the decoder function with f, the autoencoder's output is x' which is decoded from the encoding of x, we can write the autoencoder as:
An autoencoder leans by trying to minimize the difference between input and output. But since the input must pass through the bottleneck layer, the network is forced to learn which features are most important for reconstruction.
The learning process consists of defining a reconstruction evaluation function that takes the input x and the output x', to then optimize a loss function over the parameters of the functions we saw above.
For example, if the reconstruction evaluation is L2, we can use an MSE loss:
For dimensionalily reduction, we technically only need the encoder, who's job is compressing the data into a lower dimension. But how do we know that the encoder did a good job? This is where the decoder comes into play. The decoder validates whether the encoding actually captures important information.
The elegance of the autoencoder's design is incorporating the decoder - which serves as a built-in validation mechanism during training.
Without the decoder:
This is the key essence - the decoder serves both as a training mechanism and an evaluation tool for the encoder's learned representations.
In practice, once training is complete, we discard the decoder when our sole goal is dimensionality reduction.
While the term autoencoder as applied in modern deep learning is relatively recent, its theoretical underpinnings draw from several older concepts and foundational ideas from the fields of information theory, unsupervised learning, dimensionality reduction and neural networks.
One of the theoretical frameworks for understanding autoencoders comes from information theory. Tishby's Information Bottleneck theory (Tishby, et al.,2000) provides an explanation of why this "self-encoding" approach works: they find an optimal balance between compression (reducing dimensionality) and preservation of relevant information.
In the mid-2000s, autoencoders were formalized as a specific type of unsupervised learning model for feature learning. They were used as a way to learn efficient, compact representations of input dara, typically by training the model to reconstruct the input as accurately as possible from a lower-dimensional representation i.e the bottleneck.
In 2006, Hinton and Salakhutdinov (Hinton & Salakhutdinov, 2006) demonstrated that deep autoencoders could perform dimensionality reduction more effectively than principal component analysis (PCA), particularly for complex, non-linear data; marking a significant step in the development of autoencoders for feature learning.
What also makes autoencoders awesome is their evolution from simple reconstruction tools to powerful representation learning mechanisms. As outlined in comprehensive reviews (refer for example to Bengio et al. 2013), autoencoders shifted from being viewed as just dimensionality reduction tools to becoming fundamental building blocks in representation learning and generative modeling.
This evolution led to several autoencoder variants:
The autoencoder architecture provides remarkable flexibility. Depending on the problem at hand, one can design an effective autoencoder tailored to the specificities of their task.
Before taking a look at some key flexibilities that deep autoencoders can offer, let's demonstrate a practical implementation of an autoencoder.
We'll recreate the architecture from Hinton and Salakhutdinov's influential 2006 paper. This implementation will serve as a concrete example of the concepts we've discussed and as a setting stone to some elements we'll be discussing in the following architecture variations sections.
Hinton and Salakhutdinov's work demonstrated that deep autoencoders could perform dimensionality reduction more effectively than PCA. We'll use the MNIST dataset for our demonstration purposes.
We maintained the overall structure (as shown in the above figure), however there some differences in this impelmentation:
We aimed to achieve similar compression quality, and indeed the reconstructed output seemed satisfactory.
When compared with PCA and Logistic PCA, the autoencoder yields better reconstruction outputs.
Now that we have an idea about an autoencoder's architecture, we can move to discussing some key variations its design offers.
The encoder and decoder don't have to be symmetrical in their layer sizes. For example, the encoder could use [784→1000→500→250→30] while decoder could use [30→400→800→784].
They can also have different layer depths (for example a deeper encoder vs a shallower decoder) and arrangements (for example: an encoder with concentrated capacity near input and decoder with concetrated capacity near output ).
This is useful when the encoding or/and decoding complexity differs.
The encoder and the decoder can use different neural network architectures. For example, CNN encoder with dense decoder for image feature extraction. This is useful for leveraging specialized architectures for specific data types.
We can add residual connections between encoder and decoder layers. This involves directly passing outputs from the encoder layers to the decoder layers resulting in better gradient flow and reducing information loss.
Article template by distill.pub
If you see mistakes or want to suggest changes, please create a discussion entry on GitHub.
Please cite this work as
NNZ. (Nov 2024). Dimensionality Reduction Through Autoencoders. NonNeutralZero.https://non-neutralzero.github.io/article-dimensionality-reduction-autoencoders/.
BibTeX citation
@article{nnz2024dimredautoencoders, title = "Dimensionality Reduction Through Autoencoders", author = "NNZ", journal = "nonneutralzero.com", year = "2024", month = "Nov", url = "https://non-neutralzero.github.io/article-dimensionality-reduction-autoencoders" }