Natural evolution strategies as a proxy to variational bayesian inference

Natural evolution strategies as a proxy to variational bayesian inference for deep policy networks, with applications to session-based recommendation.

ABSTRACT

This work aims to develop innovative preference learning methodologies and related policies for recommendation, suitable for data with uncertainty. This will be achieved by a new approach to the Variational Bayesian Inference for GRU networks, which resembles ideas from the theory of natural evolution strategies. A simple adaptation of truncated backpropagation through time can yield good quality uncertainty estimates and superior regularization with just a minor extra computational cost during training. Local gradient information is integrated with approximate posterior to sharpen it around the current batch statistics and it is demonstrated how such technique can be applied widely to train Bayesian Neural Networks instead of the exclusive use on Recurrent Neural Networks. This work will be deployed in Python and TensorFlow, expanding existing code.

1. INTRODUCTION

Recurrent Neural Networks (RNNs) accomplish best-in-class execution on an extensive variety of sequence prognosis tasks (Wu et al., 2016; Amodei et al., 2015; Jozefowicz et al., 2016; Zarembaet al., 2014; Lu et al., 2016). In this work, it is researched a way to add uncertainty and regularization to RNNs as a result of applying Bayesian methods in the training process. An approach like this, favors the system to express uncertainty through its parameters. Simultaneously, by utilizing a prior to integrate out the parameters to even out across over numerous models while training, it gives a regularization impact to the network. Modern methodologies either support dropout (Srivastava et al., 2014) and weight decay as a variational inference scheme (Gal & Ghahramani, 2016), or apply Stochastic Gradient Langevin dynamics (Welling & Teh, 2011, SGLD) to truncated backpropagation in time precisely (Gan et al., 2016). It is worth mentioning that recent work has not researched in depth directly applying a variational Bayes inference scheme (Beal, 2003) for RNNs as was done in Graves (2011). A clear approach in view of Bayes by Backprop is developed (Blundell et al., 2015) that demonstrate admirable functionality on substantial scale issues. The technique is a basic modification to truncated backpropagation through time that concludes to a projection of the posterior distribution on the weights of the RNN.

A formulation like this, it leads strait forward to a cost function justified with information theory leading to a bits-back argument (Hinton & Van Camp, 1993) where he regulariser role is carried out by a KL divergence.

The form of the posterior in variational inference constructs the quality of the uncertainty estimates and as a result, the overall performance of the model. Further in the document it is shown the improvement of the RNN on performance matter while sharpening the posterior locally to a batch. Using gradients based upon the batch, this sharpening process adapts the variational posterior to a batch of data. What was above mentioned can be projected as a form of hierarchical distribution in which, a local batch gradient is used to readjust a global posterior, forming a local resemblance for each batch. Thus, when variational inference is practiced on neural networks, is given a more flexible form to the standard hypothesis of Gaussian posterior which scales down variance. The described technique can find applications more widely across other Bayesian models.

Furthermore, an efficient application of Bayes by Backprop (BBB) to RNNs is demonstrated along side with the development of an innovative technique that leads to variance reduction of BBB which can be extensively adopted in further maximum likelihood frameworks. Established regularization techniques such as dropout by a big margin are performed better in order to improve performance on two widely studied benchmarks while a new benchmark for studying uncertainty of language models is introduced.

2. BAYES BY BACKPROP

The scheme that is used for learning the posterior distribution in this work is Bayes by Backprop (Graves, 2011; Blundell et al., 2015). It is a variational inference (Wainwright et al., 2008) scheme and it is used on the weights θ∈R^d of a neural network. Bayes by Backprop is typically taken to be a Gaussian with mean parameter μ∈R^d and standard deviation parameter σ∈R^d, denoted N(θ|〖μ,σ〗^2). Alongside, a diagonal covariance matrix is used, and d as the dimensionality of the parameters of the network symbol, is typically in the order of millions. Let log⁡〖p(y|θ,x)〗 be the log-likelihood of the model, then the network is trained by minimizing the variational free energy:

L(θ)=Ε_q(θ) [log⁡〖q(θ)/(p(y│θ,x)p(θ) )〗 ] ,(1)

where p(θ) is a prior on the parameters.

By minimizing the variational free energy (1) the log-likelihood log⁡p(y|θ,x) is maximized, as a subject to a KL complexity term on the parameters of the network that acts as a regulariser:

L(θ)= -E_q(θ) [log⁡〖p(y│θ,x)〗 ]+KL [q(θ)||p(θ)] . (2)

In the Gaussian case with a zero mean prior, the KL term can be seen as a form of weight decay on the mean parameters. The standard deviation parameters of the prior and posterior automatically tune the rate of weight decay.

Training of feedforward models for supervised learning and aiding exploration by reinforcement learning agents manage to successfully use the uncertainty afforded by Bayes by Backprop trained networks (Blundell et al., 2015; Lipton et al., 2016; Houthooft et al., 2016), but not yet found its applications to recurrent neural networks.

3. TRUNCATED BAYES BY BACKPROP THROUGH TIME

The neural network maps the RNN state s_t at step t, and an input observation x_t to a new RNN state s_(t+1), f∶(s_t,x_t )→ s_(t+1) . Composing them they produce the core of the RNN, as f on the previous equation. An LSTM core Hochreiter & Schmidhuber (1997) s_t consists of two states. State c and state h, where c an internal core state and h an exposed state such as s_t =(c_t,h_t ). The effect of the inputs on the outputs is inflected by intermediate gates, where input gate i_t, forget gate f_t and output gate o_t. The that describe the above mentioned gates of an LSTM cell are as follows:

i_t= σ(W_i [x_t,h_(t-1) ]^T+ b_i ),

f_t= σ(W_f [x_t,h_(t-1) ]^T+ f),

c_t= f_t c_(t-1)+ i_t tanh⁡(W_c [x_t,h_(t-1) ]+ b_c ),

o_t= σ(W_o [x_t,h_(t-1) ]^T+ b_o ),

h_t= o_t tanh⁡(c_t)

where the weights (biases) W_i (b_i ),W_f (b_f ),W_c (b_c ) and W_i (b_i ) affect respectively the input gate, forget gate, cell update, and output gate.

In a feedforward network, a sequence of length T by backpropagation is unrolled T times powering the training of the RNN. Designating s_i=f (s_(i-1),x_i ), for i=1,…..,T. An RNN core unrolled for T steps is assigned through s_(1:T)=F_T (x_(1:T),s_0 ). Taking s_0 as the last state of the previous batch, s_T is the truncated version of the algorithm.

Likewise, RNN parameters are learnt in an identical way as in a feedforward neural network. A loss is applied to the states s_(1:T) of the RNN, and then updating the weights of the network using backpropagation. Crucially, the weights at each of the unrolled steps are shared. Resulting at each weight of the RNN core to receive T gradient contributions when the RNN is unrolled for T steps.

The illustration of applying BBB to RNNs is represented in Figure 1 where the weight matrices of the RNN are drawn from a distribution learnt by BBB. Yet, two important questions are raised through this direct application: when to sample the parameters of the RNN, and how to weight the contribution of the KL regulariser of the equation (2). A brief justification of the adaptation of BBB to RNNs is described below. The variational free energy of the equation (2) for an RNN on a sequence of length Tis:

L(θ)= -E_q(ϑ) [log⁡〖p(y_(1:T)│θ,x_(1:T) )〗 ]+KL [Q(θ) ┤|| P(θ)] , (3)

where P(y_(1:T) ┤|θ,x_(1:T)) is the likelihood of a sequence produced when the states of an unrolled RNN F_T are fed into an appropriate probability distribution. The parameters of the entire network are θ. Each weight is penalized just once by the KL term rather than T times, even though the RNN is unrolled T times. Also clear from (3) is that when a Monte Carlo approximation is taken to the expectation, the parameters θ should be held fixed throughout the entire sequence.

From the above naive derivation in practice, two major complications derive: First, long sequences and models sufficiently large, that prohibits unrolling the RNN for the whole sequence. Second, more than one sequence is trained at a time in order to reduce variance in the gradients. As a result, the typical regime for training RNNs involves training on mini-batches of truncated sequences. Let B be the number of mini-batches and C the number of truncated sequences (“cuts”), then equation (3) can be written as:

L(θ)= -E_q(0) [log⁡∏_(b=1)^B▒∏_(c=1)^C▒〖p(y^((b,c) )│θ,x^((b,c) ) ) 〗 ]+KL [q(θ)┤|| p(θ)], (4)

where the (b,c) superscript denotes elements of cth truncated sequence in the bth minibatch. In such way, the free energy of mini-batch b of a truncated sequence c can be written as:

L_((b,c) ) (θ)= -E_q(θ) [log⁡〖p(y^((b,c) )│θ,x^((b,c) ),s_prev^((b,c) ) )〗 ]+w_KL^((b,c) ) KL [q(θ)┤|| p(θ)], (5)

where w_KL^((b,c) ) allocates the responsibility of the KL cost among minibatches and truncated sequences (thus ∑_(b=1)^B▒∑_(c=1)^C▒〖w_KL^((b,c) )=1〗 ), and s_prev^((b,c) ) is the initial state of the RNN for the minibatch x^((b,c) ). In practice, w_KL^((b,c) ) is picked so that the KL penalty is equally distributed among all mini-batches and truncated sequences such as w_KL^((b,c) )=1/CB. The truncated sequences in each subsequent mini-batches are picked in order, and so s_prev^((b,c) ) is set to the last state of the RNN for x^((b,c-1) ).

Finally, the question of when to sample weights follows naturally from taking a Monte Carlo approximation to equation (5) so that for each minibatch, sample a fresh set of parameters.

4. POSTERIOR SHARPENING

As described above, enhancing the choice of variational posterior q(θ) can be achieved by adding side information, resulting in a more accurate posterior over the parameters and concluding to the reduction of the variance of the learning process.

In this paper is proposed an approach similar to Variational Auto Encoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014), which proposes a powerful distribution q(z|x) to improve the gradient estimates of the (intractable) likelihood function p(x). Specifically, a q(θ|(x,y)) is constructed for a given minibatch of data (inputs and targets) (x,y) sampled from the training set. Thus, a proposal distribution is calculated, where the latents (z in VAEs) are the parameters θ (which wished to integrate out), and the “privileged” information upon which chosen as a minibatch of data.

Instead, if the above was implemented on a single example (x,y), different parameter vectors θ per example would have been generated. The major advantage of producing a single θ per minibatch when focusing on the full minibatch helps the matrix-matrix operations to still be carried out.

A more stable optimization is provided through this “sharpened” posterior, so that the common unsuspected danger of Bayesian – the instability on training neural networks – is avoided, while the methods’ collarbone derives from strong empirical evidence and extensive work on VAEs.

The large number of dimensions of θ ∈ R^d is the real challenge of modelling the variational posterior q(θ│(x,y) ). Like the proposal in Kingma & Welling (2013); Rezende et al. (2014). An effective non-linear function – such as a neural network – can be used when the dimensionality is not in the order of millions, resulting in transformation of the observations (x,y) to the parameters of a Gaussian distribution.

Such use of this neural network would end up unsuitable due to the massive amount of parameters, emerging to an impossible and impractical approach.

Due to the diversity of a given that the loss -log⁡〖p(y|θ,x)〗 in respect to θ the proposal is to parameterize q as a linear combination of θ and g_θ= -∇_θ log⁡〖p(y|θ,x)〗, both d-dimensional vectors.

Thus, is defined a hierarchical posterior of the form

q(θ│x,y)= ∫▒〖q(θ│φ,(x,y) )q(φ)dφ,〗 (6)

with μ,σ ∈ R^d, and q(φ)=N(φ|μ,σ) – the same as in the standard BBB method. Finally, let * denote element-wise multiplication, the above equation adjusts as follows:

q(θ│φ(x,y))= N (θ│φ-η*g_φ,σ_0^2 I ), (7)

where η ∈ R^d is a free parameter to be learnt and σ_0 a scalar hyper-parameter of the model. The interpretation of η is a per-parameter learning rate.

During training, it is obtained θ ~ q(θ|(x,y)) via ancestral sampling for loss optimization

L(μ,σ,η)= E_((x,y)) [E_((q(φ)q(θ|φ,(x,y))) [L(x,y,θ,φ|μ,σ,η)]], (8)

with L(x,y,θ,φ|μ,σ,η) given by

L(x,y,θ,φ│μ,σ,η)= -log⁡〖p(y│θ,x)〗+KL [q(θ|φ,(x,y))┤|| p(θ|φ)]+ 1/C KL[q(φ)┤|| p(φ)], (9)

where μ,σ,η are the model parameters, and p are the priors for the distributions defining q. The constant C it is as represented above, in Section 3, the number of truncated sequences. The bound on the true data likelihood which yields eq. (8) is derived in Sec 4.1. Algorithm 1 presents how learning is performed in practice.

The lower bound in equation (8) will improve when the improvement of the log likelihood log⁡〖p(y|θ,x)〗 term along the gradient g_φ is greater than the KL cost added for posterior sharpening (KL [q(θ|φ,(x,y))┤|| p(θ|φ)]). The above mentioned is the justification of the effectiveness of the posterior over the parameters proposed in equation (7) which will be effective when the curvature of log⁡〖p(y|θ,x)〗 is large. Since η is learnt, it is resulting on taking control of the exchange between curvature improvement and KL loss.

There are two ways to perform inference under posterior sharpening. The first implicates using q(φ) and discount any KL terms – akin to regular BBB -, where the second involves the use of q(θ|φ,(x,y)) which requires the use of the term KL [q(θ|φ,(x,y))┤|| p(θ|φ)] resulting an upper bound on perplexity. This parameterization involves computing an extra gradient which concludes an extra cost on the speed while training.

4.1 DERIVATION OF FREE ENERGY FOR POSTERIOR SHARPENING

In this section the focus turns on determining the training loss function that is used for posterior sharpening. The main idea is to factorize hierarchically a variational approximation to the marginal likelihood p(x). Previous study on hierarchical variational schemes for topic models has been done in Ranganath et al. (2016). Assuming a hierarchical prior for the parameters such that p(x)= ∫▒〖p(x│θ)p(θ│φ)p(φ)dϑdφ〗, and then picking a variational posterior that conditions upon x, and factorizes as q(θ,φ│x)= (θ,φ│x)q(φ). The lower bound on p(x) as anticipated is:

log⁡〖p(x)= log⁡〖(∫▒〖p(x|θ〗)p(θ│φ)p(φ)dϑdφ〗〗) (10)≥E_(q(φ,θ│x) ) [log⁡〖(p(x│θ)p(θ│φ)p(φ))/(q(φ,θ│x) )〗 ] (11)= E_(q(θ│φ,x)q(φ) ) [log⁡〖(p(x│θ)p(θ│φ)p(φ))/(q(θ│φ,x)q(φ) )〗 ] (12)= E_q(φ) [E_q(θ|φ,x) [log⁡〖p(x│θ)〗⁡〖+ log⁡〖(p(θ│φ))/(q(θ│φ,x) )〗]+ log⁡〖p(φ)/q(φ) 〗〗 ] (13)= E_q(φ) [E_q(θ|φ,x) [log⁡〖p(x│θ)]〗⁡〖- KL [q(θ│φ,x)┤|| p(θ│φ)]〗 ] – KL [q(φ) ┤|| p(φ)] (14)

4.2 DERIVATION OF PREDICTIONS WITH POSTERIOR SHARPENING

The prediction making process is done by means of Bayesian model averaging over the approximate posterior. Contingent upon no posterior sharpening, evaluating E_(q(θ)) [log⁡〖p(x ̂ ┤|θ)〗 ] provides the necessary predictions. A constrain is determined on a Bayesian model average over the approximate posterior of φ for posterior sharpening:

E_q(φ) [log⁡〖p(x ̂ ┤|φ)〗 ]= E_q(φ) 〖[ log〗⁡∫▒〖p(x ̂ ┤|θ)p(θ ┤|φ)dθ ] (15)≥ E_q(φ) [ E_(q(θ│φ,x) ) [ log⁡〖(p(x ̂ ┤|θ)p(θ ┤|φ))/(q(θ│φ,x) )〗 ]] (16)= 〗 E_q(φ) [E_(q(θ│φ,x) ) [log⁡〖p(x ̂ ┤|θ)〗 ]-KL [q(θ│φ,x) ┤|| P(θ|φ)]] (17)

5. RESULTS AND DISCUSSION

The experiment that took place did not came back with results. Considerably, a reason that caused to not obtain results can be the large amount of data fed in our model. The dataset that has been used is the youchoose-dataset, taken from the RecSys 2015 challenge and has been processed to fit our model input. A successful implementation of this model can be achieved with the Penn Tree Bank (PTB) dataset from the Linguistic Data Consortium with a substantial amount of less data (youchoose: approx. 40,000,000 words compared to PTB: 10,000 words).

Another reason could be the wrong size of the mini batches and step size which result to another reason that is worth taking into account, the insufficient optimization of the code in order to run smoothly without causing a memory overload and conclude properly.

What was obtained from this experiment, is that the proposed model lacks of code optimization on memory matters, and the matter that was exposed is although the mini batches length and other variables are free and obtained automatically, it may result in bad memory optimization, a problem that can find its solution on better algorithms for proper calculation of the free variables.

It is noted that the procedure of sharpening the posterior as explained above has similarities with other techniques. On the other hand, it is introduced a new way of training such models which is not yet shown its true capabilities according the results on further although not many studies. Probabilistic interpretations have been given to line search in e.g. Mahsereci & Hennig (2015), but the model that is proposed in this paper is the first that uses a variational posterior with the reparameterization approach/perturbation analysis gradient. A trust region method can be considered as an interpretation of the probabilistic treatment to line search.

The dynamic evaluation (Mikolov et al., 2010), which trains an RNN during evaluation of the model with a fixed learning rate, is one more associated technique. The adjustment practiced in this case is accruing, and only uses previously seen data. Hence, a purely deterministic approach can be taken, ignoring any KL between a posterior with privileged information and a prior.

Finally, learning to optimize (or learning to learn) (Li & Malik, 2016; Andrychowicz et al., 2016) is related in that a learning rate is learned in order to produce better updates than those provided by e.g. AdaGrad (Duchi et al., 2011) or Adam (Kingma & Ba, 2014). While they train a parametric model, in this paper, the model treats the parameters as free that they adapt quicker to a non-stationary distribution. Gradient information is used to inform a variational posterior in order to reduce variance of Bayesian Neural Networks. Thus, it quite differentiates from the rest of the models.

The long history of applying Bayesian methods to neural networks having the most common approximations already been tried. Numerous maximum a posterior scheme for neural networks have been proposed by Buntine & Weigend (1991), including an approximate posterior centered at the mode. Buntine & Weigend (1991) also suggest using second order derivatives in the prior for the encouragement of smoothness of the resulting network. Several uses of variational methods for compressing the weights of neural networks as a regulariser were proposed as well by Hinton & Van Camp (1993). Hochreiter et al. (1995) suggested an MDL loss for single layer networks that penalizes non-robust weights by means of an approximate penalty based upon perturbations of the weights on the outputs. Denker & Lecun (1991); MacKay (1995) investigated using the Laplace approximation for capturing the posterior of neural networks. The use of hybrid Monte Carlo was investigated by Neal (2012), for training neural networks, although it has so far been difficult to apply these to the large sizes of networks.

In a more recent proposals, Graves (2011) derived a variational inference scheme for neural networks and Blundell et al. (2015) extended this with an update for the variance that is unbiased and simpler to compute. Graves (2016) derives a similar algorithm in the case of a mixture posterior. Several authors have claimed that dropout (Srivastava et al., 2014) and Gaussian dropout (Wang & Manning, 2013) can be viewed as approximate variational inference schemes (Gal & Ghahramani, 2015; Kingma et al., 2015, respectively).

Approximate Bayesian recurrent neural networks has been only investigated by few papers. A second-order, online training scheme for recurrent neural networks, proposal by Mirikitani & Nikolaev (2010) while Chien & Ku (2016) only capture a single point estimate of the weight distribution. Gal & Ghahramani (2016) highlighted Monte Carlo dropout for LSTMs whilst Graves (2011) proposed a variational scheme with biased gradients for the variance parameter using the Fisher matrix. The model proposed in this report extends this by using an unbiased gradient estimator without need for approximating the Fisher and also add a novel posterior approximation. Several papers explore applying expectation propagation to neural networks: Soudry et al. (2014) exports a closed form approximate online expectation propagation algorithm, whereas Hernandez-Lobato & Adams (2015) proposed using multiple passes of assumed density filtering attaining good performance on a number of small data sets. Hasenclever et al. (2015) proposes a distributed expectation propagation scheme with SGLD (Welling & Teh, 2011) as an inner loop. Others have also considered applying SGLD to neural networks (Li et al., 2015) and Gan et al. (2016) more recently used SGLD for LSTMs.

2018-5-29-1527618696

Essay: Natural evolution strategies as a proxy to variational bayesian inference

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: