Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Artificial Intelligence and Machine Learning for EDGE Computing
Artificial Intelligence and Machine Learning for EDGE Computing
Artificial Intelligence and Machine Learning for EDGE Computing
Ebook1,421 pages12 hours

Artificial Intelligence and Machine Learning for EDGE Computing

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Artificial Intelligence and Machine Learning for Predictive and Analytical Rendering in Edge Computing focuses on the role of AI and machine learning as it impacts and works alongside Edge Computing. Sections cover the growing number of devices and applications in diversified domains of industry, including gaming, speech recognition, medical diagnostics, robotics and computer vision and how they are being driven by Big Data, Artificial Intelligence, Machine Learning and distributed computing, may it be Cloud Computing or the evolving Fog and Edge Computing paradigms.

Challenges covered include remote storage and computing, bandwidth overload due to transportation of data from End nodes to Cloud leading in latency issues, security issues in transporting sensitive medical and financial information across larger gaps in points of data generation and computing, as well as design features of Edge nodes to store and run AI/ML algorithms for effective rendering.

  • Provides a reference handbook on the evolution of distributed systems, including Cloud, Fog and Edge Computing
  • Integrates the various Artificial Intelligence and Machine Learning techniques for effective predictions at Edge rather than Cloud or remote Data Centers
  • Provides insight into the features and constraints in Edge Computing and storage, including hardware constraints and the technological/architectural developments that shall overcome those constraints
LanguageEnglish
Release dateApr 26, 2022
ISBN9780128240557
Artificial Intelligence and Machine Learning for EDGE Computing

Related to Artificial Intelligence and Machine Learning for EDGE Computing

Related ebooks

Science & Mathematics For You

View More

Related articles

Related categories

Reviews for Artificial Intelligence and Machine Learning for EDGE Computing

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Artificial Intelligence and Machine Learning for EDGE Computing - Rajiv Pandey

    Part I

    AI and machine learning

    Chapter 1: Supervised learning

    Kanishka Tyagia; Chinmay Raneb; Michael Manryc    a Aptiv Advanced Research Center, Agoura Hills, CA, United States

    b Quantiphi, Inc., Marlborough, MA, United States

    c Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, United States

    Abstract

    Machine learning models learn different tasks with different paradigms that effectively aim to get the models better through training. Supervised learning is a common form of machine learning training paradigm that has been used successfully in real-world machine learning applications. Typical supervised learning involves two phases. In phase 1 (commonly called training), we give input and expect output (also known as ground truth) and train a model with respect to a metric using an optimization algorithm. In phase 2 (commonly called testing), we deploy the model with unseen data and expect it to either classify or approximate the outputs. Although supervised learning is covered in almost all machine learning textbooks, we will introduce and explain supervised learning from an application point of view and its relationship to edge computing. All the algorithms will be explained from a mathematical and theoretical point of view and a programmer’s perspective. This will help in doing hands-on experience in implementing algorithms for a variety of problems.

    Keywords

    Linear regression; Logistic regression; Steepest descent; Conjugate gradient; Multilayer perceptron; Second-order algorithms; KL divergence; Generalized linear models; Kernel machines; Bootstrapping

    1: Introduction

    This chapter discusses various supervised learning paradigm that is used to train and deploy machine learning models. While the theory on supervised learning is predominated mainly by the computer vision and natural language processing community, much research is required in its application to edge computing. With the recent advances in machine learning algorithms, combined with the increasing computational power, edge computing is another area of application where machine learning will be improving the current technology. According to a white paper from Cisco [1], 50 billion internet of things (IoT) devices will be connected to the internet by the end of 2020, and even though the estimation that nearly 850 ZB of data will be generated per year outside the cloud by 2021, the global data center traffic is only 21 ZB approximately. This means that a transformation from bug cloud data centers to a wide range of edge devices is happening.

    We plan to cover the traditional supervised learning algorithm basics and the trade tricks while implementing them for various real-life applications. The assumptions made and the programming decisions made while implementing them are also discussed. We start with generalized linear models along with optimization algorithms (gradient descent, Newton’s method) and metrics (mean square error [MSE], cross-entropy) used in training the model. Then, Naïve Bayes and kernel methods are discussed along with the decision tree and its variants. We provide pseudocode for all the supervised learning algorithms that we discuss to give the readers an idea of how the algorithms are implemented from theory to practice. In addition, supervised learning algorithms that capture the variation of input data distribution and algorithms that project data in higher subspace and first- and second-order learning paradigms are the subject matter of this chapter. The chapter covers the learning paradigms in sync with the modern-day application to edge computing from an implementation perspective.

    Edge computing and machine learning will enable the devices to have a more pervasive and fine-grained intelligence. Supervised learning is a common way of training machine learning algorithms that can be later used in the inference stage on IoT and other edge devices.

    This chapter covers several supervised learning algorithms that are commonly used in various real-life applications. Section 2 covers the basic perceptron model that helps understand the learning algorithm. In Section 3, the linear regression algorithm is discussed that extends the idea of perceptron algorithm to solve supervised regression tasks. We discuss commonly used gradient-based learning algorithms along with their pseudocodes. Section 4 discusses logistic regression, which is a very commonly used classifier due to its simplicity in structure and generalizability. In Section 5, we bring the concepts together from the previous sections to discuss the multilayer perceptron (MLP) network, a widely used neural network in real-life applications. We explain the structure, initialization strategy, and detail out various learning algorithms that are reused to train an MLP. Section 6 discusses the Kullback-Leibler (KL) divergence measure used to calculate how two probability distributions are different from each other. Section 7 discusses the generalized linear models that extend to the linear models but with a nonlinear activation on the output nodes. Section 8 talks about the kernel methods leading to Section 9 that explains nonlinear support vector machine (SVM) classifiers. We conclude this chapter in Section 10 by discussing various tree ensemble algorithms.

    2: Perceptron

    Perceptron algorithms came in early 1960; however, Minsky and Papert [2] show the restriction they have. Rosenblatt [3] described an alpha perceptron that is an example of a statistical pattern recognition (SPR) system. In a typical SPR system, the features are obtained from the raw input using no learning mechanism but instead using a common-sense rule. Therefore, a human decides what is a good feature and sees if it works, and if it does not work, tries another. We learn the weight associated with each feature activation to get a single scalar quantity and then based on if this quantity is above or below a threshold. In a typical perceptron model for input vector, x is used to compute a weighted sum from all the neurons and added with a bias vector. This bias vector is also known as the threshold vector in the literature.

    si5_e    (1)

    The linear perceptron will output y depending on the following rule as shown in Fig. 1.

    si6_e    (2)

    The perception algorithm is fast and straightforward, and if the dataset is linearly separable, it is guaranteed to converge.

    Fig. 1

    Fig. 1 Perceptron activation functions.

    3: Linear regression

    In this section, we discuss the structure and notation of a linear regression. As shown in Fig. 2, a linear regression is a weight matrix W that transforms an input vector x into a discriminant vector y [4]. The weight w(m, n) connects the nth input to the mth output. The training dataset (xp, tp) consists of N-dimensional input vectors xp and M-dimensional desired output vectors tp. The pattern number p varies from 1 to Nv, where Nv denotes the number of training patterns. The threshold is handled by augmenting x with an extra element, which is equal to 1 as xa = [1 : xT]T. So xa contains Nu basis functions, where Nu = N + 1. For the pth training pattern, the network output vector, yp can be written as

    si7_e    (3)

    where xap denotes xa for the pth pattern.

    Fig. 2

    Fig. 2 Linear regression.

    3.1: Training a linear regression

    To train the linear regression, we minimize the error function E that is a surrogate for a nonsmooth classification error. As in Ref. [5], from a Bayesian point of view, we consider maximizing likelihood function or minimizing MSE in the least-square sense, where the MSE between the inputs and the outputs is

    si8_e    (4)

    Here, the target output for the correct output. M denotes the total number of outputs. We minimize the error function from Eq. (9) with respect to W by solving the M sets of N + 1 linear equations given by

    si9_e    (5)

    where the cross-correlation matrix C and the auto-correlation matrix R are, respectively,

    si10_e    (6)

    si11_e    (7)

    Since R in Eq. (35) is often ill-conditioned, it is unsafe to use Gauss-Jordan elimination. Eq. (35) is solved using orthogonal least squares (OLS) [6]. In Ref. [7], OLS is used to solve for radial basis function network parameters. OLS is useful for practical applications for two primary reasons. First, the training is fast since solving linear equations is straightforward. Second, it helps us to avoid some local minima [8]. In terms of optimization theory, solving Eq. (35) for W is merely Newton’s algorithm for the output weights [9].

    Given

    si12_e    (8)

    and the error function

    si13_e    (9)

    expanding on Eq. (9)

    si14_e

       (10)

    si15_e    (11)

    Doing a partial of error w.r.t. weight w.

    si16_e    (12)

    Eq. (12) equal to 0 leads to the following normal equation:

    si17_e    (13)

    Since there is a matrix inversion in Eq. (13), the computation cost is roughly O(n³) using a Gauss-Jordan elimination method. Therefore, normal equations are used only for a few thousand patterns, and with a more complex or higher-order equation, gradient descent-based algorithms are applied.

    si18_e

       (14)

    si19_e

       (15)

    3.2: Steepest descent

    The steepest descent gradient algorithm can be summarized in Algorithm 1.

    Algorithm 1 Steepest descent gradient algorithm.

    1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g from Eq. (12)4:       Compute B2 from Eq. (15)5:       Update w as w w + B2 ⋅ g6:       it ← it + 17: end while

    Fig. 3 illustrates gradient descent using a two-dimensional (2D) contour plot. The X-axis denotes the first weight w1 and the Y-axis is second weight w2. The arrows [g0,g1,g2,g3,g4,g5] in Fig. 3 determines the direction of the negative gradients for each of the weights to reach a minimum. The learning factor controls the step size derived in Eq. (15). We can observe that gradients get smaller as they approach the minimal point.

    Fig. 3

    Fig. 3 Steepest descent 2D contour plot.

    From a programmers perspective, some of the debugging steps are:

    1.E is nonincreasing

    2.Eg approaches 0

    3.B2 ≥ 0

    4. si20_e

    3.3: Conjugate gradient

    As we see from the previous section, the weights are updated in the negative gradient direction in a basic gradient algorithm. Although the error function reduces most rapidly along the negative direction of the gradient, it does not necessarily create fast convergence. Conjugate gradient (CG) algorithm [10] performs a line search in the conjugate direction and has faster convergence than the backpropagation (BP) algorithm. Although CG is a general unconstrained optimization technique, its use in efficiently training an MLP is well documented in Ref. [11].

    To train an MLP using CG algorithm, we use a direction vector that is obtained from the gradient g as

    si21_e    (16)

    Here, p = vec(P, Poh, Poi) and P, Poi, and Poh are the direction vectors. B1 is the ratio of the gradient energy from two consecutive iterations. This direction vector, in turn, updates all the weights simultaneously as follows:

    si22_e    (17)

    CG algorithm has many attractive qualities such as the following:

    1.The number of iterations it takes to converge is equal to the number of unknowns.

    2.It performs better than steepest descent, and it can be applied to nonquadratic error functions.

    3.Since there is no Hessian involved, we do not invert any matrix, and the computational cost is O(w), where w is the size of the weight vector.

    The CG algorithm can be summarized in Algorithm 2.

    Algorithm 2 Conjugate gradient algorithm.

    1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate p from g4:       Compute z from Eq. (15)5:       Update w as w w + z p6:       it ← it +17: end while

    Fig. 4 illustrates CG using a 2D contour plot. The X-axis denotes the first weight w1 and the Y-axis is second weight w2. From the plot, we can observe that CG needs N + 1 number of steps to reach the minimum.

    Fig. 4

    Fig. 4 Conjugate gradient 2D contour plot.

    From a programmer’s perspective, some of the debugging steps are as follows:

    1.E is nonincreasing if E is quadratic.

    2.Forcing B1 = 0. This should result in a steepest descent algorithm.

    3.B2 ≥ 0.

    4.Calculate IP = si23_e for every it.

    4: Logistic regression

    The name for logistic regression is a misnomer since it is still a classification. A general two-class classification with posterior probability of one class being

    si24_e    (18)

    si25_e    (19)

    si26_e    (20)

    Since it is a two-class classification, the classes C1 and C2 can be denoted as y ∈ 0, 1. Linear regression does not normally work in classification since even if y = 0 or 1, si27_e can be greater than or less than 1. Therefore, we use logistic regression in which y ∈ 0, 1. It should be emphasized that this is still a classification rather than a regression. Here, σ() is a logistic function defined as

    si28_e    (21)

    when wTx tends to then si29_e tends to 1 and when wTx tends to −then si29_e tends to 0.

    The decision boundary for logistic regression is the property of the parameters and not the training set. So, if there is a nonlinear decision boundary, it is mapped according to the nonlinear activation function used in logistic regression.

    Since si27_e is nonlinear (sigmoid), the decision boundary can be learned as manifold learning of nonlinear surfaces. In order to train the logistic regression, the cost function J(w) should preferably be convex so that only one global minimum is present. We would not use the MSE as in linear regression since it is a nonconvex function with multiple local minima. However, there are ways to train a logistic regression along with MSE [12, 13].

    The logistic regression cost function will be as follows:

    si32_e

       (22)

    Eq. (22) comes from maximum-likelihood estimation (MSE). To fit parameter w, we minimize J(w). Once the training is completed, the prediction for new input x is given as

    si33_e    (23)

    In order to get g, we do the following calculations by chain rule:

    si34_e    (24)

    si35_e    (25)

    si36_e    (26)

    si37_e    (27)

    Therefore,

    si38_e    (28)

    In order to train, we minimize J(w) using gradient descent algorithm as in Algorithm 3. Algorithm 3 is a fundamental building block for more advanced algorithms like CG, BFGS, or L-BFGS.

    Algorithm 3 Gradient descent algorithm.

    1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g4:       Update w as w w + z g5:       it ← it + 16: end while

    4.1: Softmax classifier

    The softmax classifier [5] is a generalized logistic regression classifier that outputs approximate class probabilities. Structurally, it is a linear model with softmax function [14] at the output units. For the pth pattern, it maps the input vector xp to the output class labels as

    si39_e    (29)

    where W is a weight matrix. The performance measure is a cross-entropy loss function [5] as

    si40_e

       (30)

    The softmax classifier is often trained using L-BFGS training algorithm [15].

    5: Multilayer perceptron

    In this chapter, we start by describing the multilayer perceptron (MLP), a nonlinear signal processor with good approximation and classification properties. The MLP has basis functions that can adapt during the training process by utilizing example input and desired outputs. An MLP will minimize an error criterion and closely mimic an optimal processor in which the computational burden in processing an input vector is controlled by slowly varying the number of coefficients [16, 17]. We review the first- and second-order training algorithms for MLP followed by a classifier design of MLP through regression.

    5.1: Structure and notation

    Fig. 5 illustrates a single-layer fully connected MLP. The input weights w(k, n) connect the nth input to the kth hidden unit. Output weights woh(m, k) connect the kth hidden unit’s nonlinear activation Op(k) to the mth output yp(m), which has a linear activation. The bypass weights woi(m, n) connects the nth input to the mth output. The training data described by the set of independent identically distributed input-output pair si41_e consist of N-dimensional input vectors xp and M-dimensional desired output vectors tp. The pattern number p varies from 1 to Nv, where Nv denotes the number of training vectors present in the datasets. Let Nh denote the number of hidden units. In order to handle the thresholds in the input layer, the input unit is augmented by an extra element xp(N + 1), where xp(N + 1) = 1. For each training pattern p, the hidden layer net function vector np can be written as

    si42_e    (31)

    The kth element of the hidden unit activation vector Op is calculated as Op(k) = f(np(k)), where f(⋅) denotes the sigmoid activation function. The network output vector yp can be written as

    si43_e    (32)

    The expression for the actual outputs given in Eq. (32) can be rewritten as

    si44_e    (33)

    where Xa = [xpT:OpT]T is the augmented input column vector with Nu basis functions, where Nu = 1 + N + Nh. The total number of weights Nw = Nh ⋅ (1 + N) + M Nh. Similarly, Wo is the M- by Nu-dimensional augmented weight matrix defined as Wo = [Woh:Woi].

    Fig. 5

    Fig. 5 Fully connected MLP.

    To train an MLP, we recast the MLP learning problem as an optimization problem and use a structural risk minimization framework to design the learning algorithm [5, 18]. Essentially, this framework is used to minimize the error function E as in Eq. (9) that is a surrogate for a nonsmooth classification error. As in Ref. [5], from a Bayesian point of view, we consider maximizing-likelihood function or minimizing MSE in a least-square sense. Therefore, the MSE between the inputs and the outputs is defined as

    si45_e

       (34)

    Here, λ is an L2 regularization parameter used to avoid memorization and overfitting. The nonlinearity in yp causes the error E to be nonconvex, and so in practice, local minima of the error function may be found. In the earlier discussion, we have assumed that tp has a Gaussian distribution with input xp. In case the conditional distribution of targets, given input has a Bernoulli distribution, the error function, which is given by the negative log likelihood, is then a cross-entropy error function [5].

    In Ref. [5], it is concluded that using a cross-entropy error function instead of the MSE for a classification problem leads to faster training as well as improved generalization. Apart from cross-entropy and L2 error form, we also have an L1 error measure. Golik et al. [19] and Simard et al. [20] discuss a good comparison between the L2 and cross-entropy and suggest using cross-entropy error function for classification in order to have faster training and improved generalization. Our goal is to obtain an optimal value of the weights connected in an MLP. In order to achieve this, we use empirical risk minimization [17] framework to design the learning algorithms. An essential benefit of converting the training of an MLP into an optimization problem is that we can now use various optimizing algorithms to optimize the learning of an MLP.

    5.2: Initialization

    5.2.1: Input means and standard deviations

    If some inputs have even more significant standard deviations than others, they can dominate the training, even if they are relatively useless. Inputs are normalized with zero mean and unit standard deviation.

    5.2.2: Randomizing the input weights

    As from Manry [16], the input weights matrix W is initialized randomly from a zero-mean Gaussian random number generator. The training of the input weights strongly depends on the gradient of the hidden unit’s activation functions with respect to the inputs. Training of input weights will cease if the hidden units it feeds into have an activation function derivative of zero for all patterns. In order to remove the dominance of large variance inputs, we divide the input weights by the input’s standard deviation. Therefore, we adjust the mean and standard deviation of all the hidden units net functions. This is called net control as in Ref. [21]. At this point, we have determined the initial input weights, and we are now ready to initialize the output weights. To solve for the weights connected to the output of the network, we use a technique called output weight optimization (OWO) [22, 23]. OWO minimizes the error function from Eq. (9) with respect to Wo by solving the M sets of Nu equations given by

    si46_e    (35)

    Here, the cross-correlation matrix C, auto-correlation matrix R

    si47_e

       (36)

    In order to incorporate the regularization, we modify the R matrix elements except the threshold as

    si48_e    (37)

    where r is a vector containing the diagonal elements of R and diag() is an operator that creates a diagonal matrix from the vector.

    The MLP network is now initialized and ready to be trained with first- or second-order algorithms. Training an MLP can be seen as an unconstrained optimization problem that usually involves first-order gradient methods such as BP, CG, and second-order Levenberg-Marquardt (LM), Newton’s method as the most popular learning algorithm. Training algorithms can be classified as

    1.One-stage algorithm, in which all the weights of the network are updated simultaneously.

    2.Two-stage algorithm, in which input and output weights are trained alternately.

    Fig. 6 shows a flowchart that summarizes all the training algorithms that will be described in the subsequent sections.

    Fig. 6

    Fig. 6 Typical training algorithms for training an MLP.

    5.3: First-order learning algorithms

    The first-order learning algorithms update the weights of the MLP based on gradient matrices, that is, the first-order information, hence the name. We start by discussing the training of an MLP with a one-stage algorithm. In this, we train both the output and input weights simultaneously using either BP or CG algorithm. We then describe a two-stage algorithm called OWO-hidden weight optimization.

    5.3.1: Backpropagation algorithm

    The BP algorithm is a greedy line search algorithms that have a step size to achieve the maximum amount of decrease of the objective function at each step [5]. BP is a computationally efficient method in conjunction with gradient-based algorithms that are used widely to train an MLP [24]. However, due to the nonconvexity of error function (Eq. 9) in neural networks, BP is not guaranteed to find global minima but rather only local minima. Although this is considered as a major drawback, recently in Ref. [25] it is discussed as to why local minima are still useful in many practical problems. In each training epoch, we update all the weights of the network in a BP algorithm as follows:

    si49_e    (38)

    Here, w is a vector of network weights as w = vec(W, Woh, Woi) and g is a vector of network gradients g = vec(G, Goh, Goi). The gradient matrices are negative partial of E w.r.t. the weights, si50_e , si51_e , and si52_e . vec() operator performs a lexicographic ordering of a matrix into a vector. z is the optimal learning factor that is derived using a Taylor series expansion of the MSE E, expressed in terms of z, as [26]

    si53_e    (39)

    The BP algorithm can be summarized Algorithm 4.

    Algorithm 4 Backpropagation algorithm.

    1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g4:       Compute z from Eq. (39)5:       Update w as w w + z  ⋅  g6:       it ← it + 17: end while

    As in Ref. [5], the BP algorithm has two major criticism. First, it does not scale well, that is, it takes si54_e operations for sufficiently large Nw and second, being a simple gradient descent procedure, it is unduly slow in the presence of flat error surfaces and is not a very reliable learning paradigm.

    5.3.2: Training lemmas

    Lemma 1

    For the kth hidden unit, if f  ′(np(k)) = 0 for all patterns, then weights feeding into the unit will not change during BP training.

    Proof.

    Observe the partial derivative formula below. The partial of E with respect to w(k, n) is 0 under the conditions of the lemma.

    si55_e    (40)

    si56_e

       (41)

    Note that f  ′(np(k)) is greatest for 0-valued net functions.

    Implications

    1.It would seem then that 0-valued initial weights are a good idea, since f  ′() is largest for net values equal to 0.

    2.During net control, it seems that standard deviations should be less than 2 and mean should be close to 0.

    Lemma 2

    Let the kth hidden unit have any set of weights. Let units k + j for j between 1 and (K − 1) have identically valued input and output weights as the kth unit. In other words, w(k + j, n) is the same for j between 0 and (K − 1), and woh(i, k + j) is the same for j between 0 and (K − 1). After training with BP, all K of these hidden units still have identically valued weights, and can be replaced by a single hidden unit.

    Proof.

    If the units start with identical weights, they have identically valued weight changes and are still identical after training.

    The hidden unit outputs start identical and may experience identical weight changes.

    OWO-BP is more difficult to predict.

    Implications

    Random initial weights ensure that no hidden units are identical.

    Lemma 3

    During BP training, input weights are sensitive to input means.

    Proof.

    Modeling xp(n) as ep(n) + mn, where ep(n) is zero mean and mn is the mean of the nth input, the negative partial of E with respect to w(k, n) is found as

    si57_e

       (42)

    Note that as mn increases, si58_e becomes dominated by the first term, which has no information related to changes in the nth input.

    Lemma 4

    OWO-BP will not change thresholds (θ(k) or w(k, N + 1)) unless f′(np(k)) changes with p.

    Proof

    Assume f′(np(k)) does not change with p. Hence, we get

    si59_e

       (43)

    Lemma 5

    An optimal MLP structure for the multioutput approximation case (M > 1) has different hidden units for each output.

    Proof.

    The input weight gradient matrix G is found as

    si60_e    (44)

    Updating the weights the Gi and learning factor differs for each output, the optimal MLP will have different hidden units with different input weights for each output.

    Implications

    1.If the kth hidden unit’s activation has constant slope f′(np(k)), θk will not change during BP training. So, during net control, we should not have mean = 0 and standard deviation too small.

    2.Together, Lemmas 1 through 4 provide the motivation for zero-mean inputs, random initial weights, and net control with mean and standard deviation ≠0.

    3.Lemma 5 shows that if a multioutput MLP is trained, using the same hidden units for each output, it is not optimal.

    4.Lemma 3 shows that BP lacks affine invariance [27].

    5.4: Second-order learning algorithms

    The basic idea behind using a second-order method is to improve the first-order algorithms by using the second derivative along with the first derivative [5]. We present two, one-stage algorithms, namely Newton’s method and LM and then a two-stage algorithm called as OWO-multiple optimal learning factor [28–30].

    5.4.1: Newton’s method

    For Newton’s method, given a starting point, we construct a quadratic approximation to a double differentiable error function that matches the first- and second-order derivative value at that point. We then minimize this quadratic function instead of the original error function by expanding the Taylor series of E′ about the point wk as is clear from the equation below:

    si61_e

       (45)

    Here, si62_e is the new version of w that we are trying to find. We calculate the Hessian and gradient, H and g of the MSE error function where the elements of H are given by

    si63_e    (46)

    and the elements of g are given by

    si64_e    (47)

    We calculate the second-order direction, d, by solving the set of linear equations with OLS

    si65_e    (48)

    Assuming quadratic error function in Eq. (9) and H to be positive definite, applying first-order necessary condition (FONC) [31], on all the weights in an MLP, we update the weights as

    si66_e    (49)

    The Newton’s algorithm can be summarized in Algorithm 5.

    Algorithm 5 Newton’s algorithm.

    1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Calculate g and H from Eqs. (46), (47).4:       Compute d from Eq. (48).5:       Update w as w w + d6:       it ← it + 17: end while

    Newton’s method is quadratic convergent and affine invariant [16]. Since it converges fast, we would like to use it to train an MLP, but generally, the Hessian H is singular [32].

    If the error function is quadratic, then the approximation is exactly a one-step solution; otherwise the approximation will provide only an estimate to the exact solution. In case of a nonquadratic error measure, we will require a line search and w is updated as

    si67_e    (50)

    5.4.2: LM algorithm

    The LM algorithm is a compromise between Newton’s method, which converges rapidly near local or global minima but may diverge, and gradient descent, which has assured convergence through a proper selection of step size parameter but converge slowly. Following Eq. (45), the LM algorithm is a suboptimal method. Since usually H is singular in Newton’s method, an alternate is to modify the Hessian matrix as in LM [33] algorithm or use a two-step method such as layer-by-layer training [34]. In LM, we modify the Hessian as

    si68_e    (51)

    Here, I is the identity matrix of the exact dimensions as H and λ is a regularizing parameter that forces the sum matrix (H + λ I) to be positive definite and safely well-conditioned throughout the computation. We calculate the second-order direction, d, similar to Newton’s method as

    si69_e    (52)

    After obtaining HLM, weights of the model are updated using Eq. (49).

    The regularizing parameter λ plays a crucial role in the way the LM algorithm functions. If we set λ equal to 0, then Eq. (52) reduces to Newton’s method (Eq. 49). On the other hand, if we assign a large value to λ such that λ I overpowers the Hessian H, the LM algorithms are effective as a gradient descent algorithm. Press et al. [35] recommend an excellent Marquardt recipe for the selection of λ.

    From a practical perspective, the computational complexity of obtaining HLM can be demanding, mainly when the dimensionality of the weight vector w is high. Therefore, due to scalability constraints, LM is particularly suitable for a small network.

    The LM algorithm can be summarized in Algorithm 6.

    Algorithm 6 LM algorithm.

    1: Initialize w, Nit, it ← 02: while it < Nitdo3:       Present all patterns to the network to computer error Eold from Eq. (9)4:       Calculate g and H from Eqs. (46), (47)5:       Obtain HLM from Eq. (51)6:       Compute d from Eq. (52)7:       Update w as w w + d8:       Recompute the error Enew by using the updated weights.9:       ifEnew < Eoldthen10:             Reduce the value of λ11:             goto Step 312:     else13:             Increase the value of λ14:     end if15:     it ← it + 116: end while

    6: KL divergence

    From an information theory point of view, Kullback-Leibler (KL) divergence is a relative entropy measure that estimates how close we have modeled an approximate distribution q(x) with respect to the unknown distribution p(x). In order to arrive at the expression for KL divergence, we follow the probabilistic viewpoint. The following explanation is suitable for both continuous and discrete distribution. In order to quantitatively determine how good the approximate model q(x) is compared to (x), we calculate the log-likelihood ratio for an entire dataset

    Enjoying the preview?
    Page 1 of 1