Guide to Autoencoders
(A tutorial on autoencoders)
Useful Resources
Autoencoders
Introduction
We aren’t going to spend too much time on just autoencoders because they are not as widely used today due to the development of better models. However, we will cover them because they are essential to understanding the later topics of this guide.
The premise: you are trying to create a neural network that can efficiently encode your input data in a lower dimensional encoding, which it is then able to decode back into the original input, with losing as little of the original input as possible. This is useful for the following reason. Imagine your input data is very high dimensional, but in reality, the only valid inputs you would ever receive are in a subspace of this high dimension. In fact, they exist in a manifold of this space, which can be spanned using fewer dimensions, and these dimensions can have properties that are useful to learn, as they capture some intrinsic/invariant aspect of the input space.
To achieve this dimensionality reduction, the autoencoder was introduced as an unsupervised learning way of attempting to reconstruct a given input with fewer bits of information.
Basic Architecture
Now at this point, the theory starts to involve an understanding of what neural networks are. The prototypical autoencoder is a neural network which has input and output layers identical in width, and has the property of “funneling” the input, after a sequence of hidden layers, into a hidden layer less wide than the input, and then “fanning out” back to the original input dimension, and constructing the output. Typically, the sequence of layers to the middle layer are repeated in reverse order to scale back up to the output layer. The sequence of funneling layers are referred to as the “encoder,” and the fanning out layers are called the “deocoder.”
The loss function ^{1} typically used in these architectures is mean squared error $J(x,z) = \lVert x  z\rVert^2$, which measures how close the reconstructed input $z$ is to the original input $x$. When the data resembles a vector binary values or a vector of probabilities (which are both values in the range of $[0,1]$), you can also use the crossentropy of reconstruction loss function, which calculates how many “bits” of information are preserved in the reconstruction compared to the original. This loss function is
Once you’ve picked a loss function, you need to consider what activation functions to use on the hidden layers of the autoencoder. In practice, if using the reconstructed crossentropy as output, it is important to make sure
(a) your data is binary data/scaled from 0 to 1 (b) you are using sigmoid activation in the last layer
You can also optionally use sigmoid activations for each hidden layer, as that will keep the activation values between 0 and 1, and make it easier to perform linear transformations on the data that keeps it in the range of values that it is provided in.
Application to pretraining networks
There are many ways to select the initial weights to a neural network architecture. A common initialization scheme is random initialization, which sets the biases and weights of all the nodes in each hidden layer randomly, so they are in a random point of the space, and objective function, and then find a nearby local minima using an algorithm like SGD or Adam. In 20062007, autoencoders were discovered to be a useful way to pretrain networks (in 2012 this was applied to conv nets), in effect initializing the weights of the network to values that would be closer to the optimal, and therefore require less epochs to train. While I could try reexplaining how that works here, Quoc Le’s explanation from his series of Stanford lectures is much better, so I’ll include the links to that below.^{2} ^{3} In particular, look at section 2.2 of the deep learning tutorial for the part about pretraining with autoencoders.
However, other random initialization schemes have been found more recently to work better than pretraining with autoencoders. For more on this, see Martens for Hessianfree optimization as one of these methods, and Sutskever, Martens et al for an overview of initialization and momentum.
Sparsity
One of the things that I am currently experimenting with is the construction of sparse autoencoders. These can be implemented in a number of ways, one of which uses sparse, wide hidden layers before the middle layer to make the network discover properties in the data that are useful for “clustering” and visualization. Typically, however, a sparse autoencoder creates a sparse encoding by enforcing an l1 constraint on the middle layer. It does this by including the l1 penalty in the cost function, so, if we are using MSE, the cost function becomes
where $s$ is the sparse coding in the middle layer, and $\lambda$ is a regularization parameter that weights the influence of the l1 constraint over the entire cost function. For more on these, see sparse coding
Denoising Autoencoders
Introduction
 autoencoders to reconstruct noisy data
 Useful for weight initialization
 unsupervised learning criterion for layerbylayer initialization ^{4}:
 each layer is trained to produce higher level representation
 with successive layers, representation becomes more abstract
 then, global finetuning of parameters with another training criterion
 robustness to partial destruction of input
 unsupervised learning criterion for layerbylayer initialization ^{4}:
Denoising Approach
 introduce noise into the observed input:
 to yield almost the same representation
 guided by the fact that a good representation captures stable structures in the form of dependencies and regularities characteristic of the unknown distribution of the input
 goal:
 minimize average reconstruction error
 where $L$ is loss func like squared error
 An alternative loss is reconstruction cross entropy, for vectors of bit probabilities
 if $x$ is a binary vector, the binarycrossentropy becomes negative loglikelihood for $x$, given by Bernoulli parameters $z$. Eq 1
DAE objective function
 one way to destroy components of the input is by zeroing values of a random number of them. the corrupted input $\widetilde{x}$
 then, mapped with a hidden representation $y = f_\theta(\tilde{X}) = s(W\tilde{x} + b)$, and reconstruct $z = g_{\theta’}(y) = s(W’y + b’)$
 define the joint distribution
 $\delta_u(v)$ puts mass $0$ when $u \neq v$, Y is a deterministic function of $\tilde{X}$.
 objective function minimized by SGD is:
Layerwise initialization and finetuning
 representation of the $k$th layer used to train $(k+1)$th layer ^{5}
 used as initialization for network opt wrt supervised training criterion
 greedy layerwise approach is better than local minima than random initialization
Practical Considerations
So, what does any of this mean? How can I use this? First, it’s important to note what autoencoders are useful for. The main uses today for autoencoders are their generative and denoising capabilities, which is done with variational and denoising autoencoders. A third application is dimensionality reduction for data visualization, as autoencoders find interesting lowerdimensional embeddings of the data.
Variational Autoencoders
To learn more about the statistical background to VAEs, Eric Jang’s post is a great resource to get started.
Variational Autoencoders are a relatively recent application of neural networks to generate ‘samples’ based on the representations of the input space that they have ‘learned.’ Eric’s article goes in depth into the methods that are applied in these models, but the key take away is the goal of learning an approximation of an underlying distribution in the data that allows you to generate samples that are close to the data input into your model. This is done by optimizing the “encoding” $z \sim Q(ZX)$ and “decoding” $x \sim P(XZ)$ distributions to minimize the variational lower bound $\mathcal{L} = \log p(x)  KL(Q(ZX)P(ZX)) = \mathbb{E}_Q\big[ \log{p(xz)} \big]  KL(Q(ZX)P(Z))$
Adversarial Autoencoders
https://arxiv.org/abs/1511.05644
References

http://www.deeplearning.net/tutorial/dA.html ↩

http://www.trivedigaurav.com/blog/quocleslecturesondeeplearning/ ↩

http://ai.stanford.edu/~quocle/tutorial2.pdf ↩

http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf ↩

http://info.usherbrooke.ca/hlarochelle/publications/vincent10a.pdf ↩