PHD Literature review Report (1)Yannan (Summarize Methods to optimize the DNN)1. Machine Learning and relative deep learningAs the subject of my PHD is carrying out in terms of deep learning based neuro morphic system with applications. The categories of deep learning algorithms should be selected carefully depending on different types of real problem as well as neuromorphic.Normally we set up a NN, the performance of NN which including: training speed, training set accuracy and validation set accuracy are most important to prevent the results from overfitting we usually concern about. The recent optimization method from literatures/tutorials online can be summarised as:1. L1/L2 RegularizationDefine a cost function we trying to minimise asJ(w,b)=1N∑F(Yout(i),Y(i))mi=1The L2 regularization using the Euclidean Norm with the prime vector to w, and omitted the low variance parameter bias b to reduce the effect of high variance as:J(w,b)=1∑F(Yout(i),Y(i))mi=1+λ‖w‖22Where‖w‖22=∑w i2ni=1=w T∙wThe L1 regularization makes more parameters are set to zero and makes the model becomes sparse:J(w,b)=1∑F(Yout(i),Y(i))mi=1+λ|w|1Where|w|1=∑|w|ni=12. DropoutDropout is widely used in stop the deep NN from overfitting problem with a manual set keep-probability to randomly eliminate neurons from layers when training. This usually implement by multiplying a matrix with same shape as previous layer’s output containing ones and zeros. The dropout can shrink the weights and does some of those regularization and help prevent overfitting, this is similar to L2 regularization. However, dropout can be shown to be an adaptive form without a regularization while L2 penalty different depends on different weights relate to the size of activations being calculated.3. Data augmentationThis method is useful when the data set is very poor but each data contains a lot of features like colourful images. The flipping, rotated zoomed image and add some distortions to image can helps generate original training data.Figure. 1: Dropout sample with (a) before dropout (b) after dropoutFigure. 2: Horizontally flipped imagesFigure. 3: Rotated zoomed image4. Early stoppingAs shown in figure 4 that, the testing set accuracy is not always increasing with the training set accuracy and local minima could be found for before completion of total iterations. The early stopping is usually work to improving the accuracy of validation set with some sacrificing of training set accuracy and simultaneously prevent network training from overfitting.Figure. 4: Early stopping description5. Normalize inputNormalizing input can usually speed up training speed and increase the performance of neural network. The usually step including substract mean and normal variance and set total training set range to be same length. Thus the learning rate does not need to be set as adaptive and to change along with every gradient descent, the normalization helps GD algorithm finds optimal parameters more accurate and quick.Figure. 5: Left: after data normalization; Right: before normalization6. Weight initialization for Vanishing/exploding gradientsWhen training very deep neural network, the derivatives can sometimes either very big or very small. A very deep neural network with ignored bias value can be considered as a stack multiplying of weights of each layer that:Y=W n∙W n−1∙W n−2∙∙∙∙∙∙W3∙W2∙W1∙XWhere either a value of W is greater than 1 or less than 1 could results in a W n−1which in a huge or tiny value.The square root of variance could be multiplied to the initialised weight to reduce the vanishing and exploding problem and the variance is activation function dependent that:tanh(Xavier initalization)=√1 n l−1RELU(var)=√2 l−17. Mini-batch gradient decentWhen the training set becomes really large then the traditionally SGD will results in a really slow training process due to gradient decent happen on individual inputs. The Mini-batch split the whole training samples into several batches with assigned batch size (for 10000 inputs with 100 batch size, the quantity of batches is 1000). And make the inputs within every batches to be a matrix/Vector and training all the data together. If the batch size is set to 1, thenthis is exactly stochastic gradient decent and it will implement on every input rather than a group of inputs. The one epoch/iteration means all the batches have been trained by NN once.The Typical mini-batch size could be 64, 128, 256, 512, 1024 and usually be the power of 2 for large training data set.8. MomentumThe momentum in every iteration computes dW and db on current mini-batch And then computeVdW=βVdW+(1−β)dWVdb=βVdb+(1−β)dbThen the update weight and bias by:W=W−αVdWb=b−αVdWThe momentum could be understood as the applying the Exponentially weighted averages (EWA) in the gradient decent and thus the updated regression is averaged outputs in terms of previous outputs with defined parameter βwhich is the learning rate in the NN. The regular choose of the βis 0.9 and corresponds to average the last 11−βdata to give the most suitable updates.9. RMSpropThe RMSprop also computes dW and db in every iteration on the current mini-batchAnd then computeFigure. 6: Mini-batch for 10 batchesSdW =βSdW +(1−β)dW 2 Sdb =βSdb +(1−β)db 2The RMSprop update parameters as follow:W =W −αdW√SdWb =b −α√SdbThe RMSprop can basically speed up the learning rate based on the features of weights and bias where sometimes its need either of them to be large and another one to be small that making GD converge more quikly.10. AdamAdam is basically the combination of Momentum and RMSprop, that its compute dW and db on current mini-batch. Then compute the same things from momentum and RMSprop we get:VdW =β1VdW +(1−β1)dW Vdb =β1Vdb +(1−β1)db SdW =β2SdW +(1−β2)dW 2 Sdb =β2Sdb +(1−β2)db 2With the different hyperparameters β1 and β2 On the nth order iteration Adam computesVdW(after EWA bias correction)=VdW1nVdb(after EWA bias correction)=Vdb(1n )SdW(after EWA bias correction)=SdW(1−β2n )Sdb(after EWA bias correction)=Sdb(1−β2n )The W and b updated asW =W −αVdW√SdW +εb =b −αVdb√Sdb +εThe general hyperparameter choice for Adam is Learning rate: need to be tuneβ1:0.9 β2:0.99ε doesn ′t really affect performance set as 10−811. Learning rate decayThe fixed learning rate usually results in noisy learning process and cannot reach the optimal point. The learning rate decay algorithm can reduce the learning ratealong with the iterations that allow NN can finally ends with relative accurate Optimal result.This could be implemented with the epochs thatα=11+decay_rateα0alternativelyα=decay_rate∗α0(expotentially decay)α=√num of epocℎ0(discrete staircase)12. Pick hyperparametersThe common painful on the DNN is to pick a sheer of hyperparameters with may including: learning rate, momentum factor β1, adam factor β1, β2, ε, the number of layers, number of hidden units, learning rate decay rate, batch size and so on.The range of hyperparameter could be determined depending on the problem to be solved, the usually way is to randomly sample between the reasonable scale and take few of them into test and then reduce the range or changing the scale of sampling to improve the decision.13. Batch normalizationSimilar to the input normalization, a good distribution of the data could save the computation energy that making algorithms works faster, the batch normalization normalise the outputs of previous hidden layer (or the input of one hidden layer) that makes the computation within this layer becomes more faster. This also implemented by extracting the mean and variance of the computed data and normalize as:Z(i)norm=Z(i)−μ√σ2+εFor the hidden units with alternative mean and varianceZ(i)N=ΥZ(i)norm+βWhere Υand βare learnable parameters from the model if Υ=√σ2+ε and β=μthen Z(i)N=Z(i).The implementation of the batch normalization is just simple like add a layer named BM layer with the additional hyperparameter βand Υfor each of them, they can also be updated by the optimizer like SGD, RMSprop etc. One thing needs to note that is the mean process actually eliminated the bias in the operation, this means that the hyperparameter b could be deleted from the layer in front of BM layer.The mean and variance usually is estimated using EWA across mini-batch in training set, and use it in the test set.。