Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Deep Learning for Medical Image Analysis
Deep Learning for Medical Image Analysis
Deep Learning for Medical Image Analysis
Ebook1,034 pages10 hours

Deep Learning for Medical Image Analysis

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Deep Learning for Medical Image Analysis, Second Edition is a great learning resource for academic and industry researchers and graduate students taking courses on machine learning and deep learning for computer vision and medical image computing and analysis. Deep learning provides exciting solutions for medical image analysis problems and is a key method for future applications. This book gives a clear understanding of the principles and methods of neural network and deep learning concepts, showing how the algorithms that integrate deep learning as a core component are applied to medical image detection, segmentation, registration, and computer-aided analysis.

· Covers common research problems in medical image analysis and their challenges

· Describes the latest deep learning methods and the theories behind approaches for medical image analysis

· Teaches how algorithms are applied to a broad range of application areas including cardiac, neural and functional, colonoscopy, OCTA applications and model assessment

· Includes a Foreword written by Nicholas Ayache

LanguageEnglish
Release dateNov 23, 2023
ISBN9780323858885
Deep Learning for Medical Image Analysis

Read more from S. Kevin Zhou

Related to Deep Learning for Medical Image Analysis

Related ebooks

Computers For You

View More

Related articles

Reviews for Deep Learning for Medical Image Analysis

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Deep Learning for Medical Image Analysis - S. Kevin Zhou

    Part 1: Deep learning theories and architectures

    Outline

    Chapter 1. An introduction to neural networks and deep learning

    Chapter 2. Deep reinforcement learning in medical imaging

    Chapter 3. CapsNet for medical image segmentation

    Chapter 4. Transformer for medical image analysis

    Chapter 1: An introduction to neural networks and deep learning

    Ahmad Wisnu Mulyadib; Jee Seok Yoonb; Eunjin Jeonb; Wonjun Kob; Heung-Il Suka,b    aKorea University, Department of Artificial Intelligence, Seongbuk-Gu, Seoul, Korea

    bKorea University, Department of Brain and Cognitive Engineering, Seongbuk-Gu, Seoul, Korea

    Abstract

    Artificial neural networks, conceptually and structurally inspired by neural systems, are of great interest along with deep learning, thanks to their great successes in various fields, including medical imaging analysis. In this chapter, we describe the fundamental concepts and ideas of (deep) neural networks and explain algorithmic advances to learn network parameters efficiently by avoiding overfitting. Specifically, this chapter focuses on introducing 1) feed-forward neural networks, 2) gradient descent-based parameter optimization algorithms, 3) different types of deep models, 4) technical tricks for fast and robust training of deep models and 5) open-source deep learning frameworks for quick practice.

    Keywords

    Artificial neural network; Deep learning; Feedforward neural network; Convolutional neural network; Recurrent neural network; Deep generative models

    1.1 Introduction

    A brain or biological neural network is considered as the most well-organized system that processes information from different senses such as sight, hearing, touch, taste and smell in an efficient and intelligent manner. One of the key mechanisms for information processing in a human brain is that the complicated high-level information is processed by means of the collaboration, i.e., connections (called synapses), of a large number of the structurally simple elements (called neurons). In machine learning, artificial neural networks are a family of models that mimic the structural elegance of the neural system and learn patterns inherent in observations.

    1.2 Feed-forward neural networks

    This section introduces neural networks that process information in a feed-forward manner. Throughout the chapter, matrices and vectors are denoted as boldface uppercase letters and boldface lowercase letters, respectively, and scalars are denoted as normal italic letters. For a transpose operator, a superscript ⊤ is used.

    1.2.1 Perceptron

    The simplest learnable artificial neural model, known as perceptron [1], is structured with input visible units , trainable connection weights and a bias , and an output unit y as shown in Fig. 1.1(a). Since the perceptron model has a single layer of an output unit, not counting the input visible layer, it is also called a single-layer neural network. Given an observation¹ or datum , the value of the output unit y is obtained from an activation function by taking the weighted sum of the inputs as follows:

    (1.1)

    where denotes a parameter set, is a connection weight vector and is a bias. Let us introduce a pre-activation variable z that is determined by the weighted sum of the inputs, i.e., . As for the activation function , a "logistic sigmoid" function, i.e., , is commonly used for a binary classification task.

    Figure 1.1 An architecture of a single-layer neural network.

    Regarding a multi-output task, e.g., multi-class classification or multi-output regression, it is straightforward to extend the perceptron model by adding multiple output units (Fig. 1.1(b)), one for each class, with their respective connection weights as follows:

    (1.2)

    where , denotes a connection weight from to . As for the activation function, it is common to use a "softmax" function for multi-class classification, where the output values can be interpreted as probability.

    1.2.2 Multi-layer perceptron

    One of the main limitations of the single-layer neural network comes from its linear separation for a classification task, despite the use of nonlinear activation function. This limitation can be circumvented by introducing a so-called "hidden" layer between the input layer and the output layer as shown in Fig. 1.2. For a two-layer neural network, which is also known as multi-layer perceptron (MLP), we can write its composition function as follows:

    (1.3)

    where the superscript denotes a layer index, M denotes the number of hidden units and . Hereafter, the bias term is omitted for simplicity. It is possible to add a number of hidden layers ( ) and the corresponding estimation function is defined as

    (1.4)

    Although different types of activation functions can be applied to different layers or even different units, in theory, it is common to apply the same type of activation function for the hidden layers in the literature. Here, it should be a nonlinear function; otherwise, the function will be represented by a single-layer neural network with a weight matrix equal to the resulting matrix of multiplying weight matrices of hidden layers. Regarding the activation function, a sigmoidal function such as a logistic sigmoid function and a hyperbolic tangent function is commonly used in earlier models thanks to their nonlinear and differential characteristics. However, the two activation functions make it difficult to train the neural network when stacking layers deeply. In this respect, recent works [2–5] proposed other nonlinear functions, and their details are provided in Section 1.6.2.

    Figure 1.2 An architecture of a two-layer neural network.

    1.2.3 Learning in feed-forward neural networks

    In terms of network learning, there are two fundamental problems, namely, network architecture learning and network parameters learning. While the network architecture learning still remains an open question,² there exists an efficient algorithm for network parameters learning as circumstantiated below.

    The problem of learning parameters of an L-layer neural network can be formulated as error function minimization. Given a training data set , where denotes an observation and denotes a class indicator vector with one-of-K encoding, i.e., for a class k, only the kth element in a vector is 1 and all the other elements are 0. For a K-class classification, it is common to use a cross-entropy cost function defined as follows:

    (1.5)

    where denotes the kth element of the target vector . is the kth element of the prediction vector for , which is obtained by Eq. (1.4) with the parameter set .

    The error function in Eq. (1.5) is highly nonlinear and nonconvex. Thus there is no analytic solution of the parameter set W that minimizes Eq. (1.5). Instead, we resort to a gradient descent algorithm by updating the parameters iteratively. Specifically, the parameters of L-layers, W, are updated as follows:

    (1.6)

    where τ denotes an iteration index, η is a learning rate, while denoting as the gradient set of W that are obtained by means of error backpropagation [6]. To compute the derivative of an error function E with respect to the parameters of lth layer, i.e., , we propagate errors from the output layer back to the input layer by a chain rule:

    (1.7)

    where and denote, respectively, the pre-activation vector and the activation vector of the layer l and . Note that , or equally , corresponds to the error computed at the output layer. For the estimation of the gradient of an error function E with respect to the parameter , it utilizes the error propagated from the output layer through the chains in the form of , , along with . The fraction can also be computed in a similar way as follows:

    (1.8)

    (1.9)

    where denotes a gradient of an activation function with respect to the pre-activation vector .

    As for the parameter update in Eq. (1.6), there are two different approaches depending on the timing of parameter update, namely, batch gradient descent and stochastic gradient descent. The batch gradient descent updates the parameters based on the gradients ∇E evaluated over the whole training samples. Meanwhile, the stochastic gradient descent sequentially updates weight parameters by computing gradient on the basis of one sample at a time. When it comes to large-scale learning such as deep learning, it is advocated to apply stochastic gradient descent [7]. As a trade-off between batch gradient and stochastic gradient, a mini-batch gradient descent method, which computes and updates the parameters on the basis of a small set of samples, is commonly used in the literature [8].

    1.3 Convolutional neural networks

    In conventional multi-layer neural networks, the inputs are always in vector form. However, for (medical) images, the structural or configural information among neighboring pixels or voxels is another source of information. Hence, vectorization inevitably destroys such structural and configural information in images. A convolutional neural network (CNN) that typically has convolutional layers interspersed with pooling (or sub-sampling) layers and then followed by fully connected layers as in a standard multi-layer neural network (Fig. 1.3) is designed to better utilize such spatial and configuration information by taking 2D or 3D images as input. Unlike the conventional multi-layer neural networks, a CNN exploits extensive weight-sharing to reduce the degrees of freedom of models. A pooling layer helps reduce computation time and gradually builds up spatial and configural invariance.

    Figure 1.3 An architecture of a convolutional neural network.

    1.3.1 Convolution and pooling layer

    The role of a convolution layer is to detect local features at different positions in the input feature maps with learnable kernels , i.e., connection weights between the feature map i at the layer and the feature map j at the layer l. Specifically, the units of the convolution layer l compute their activations based only on a spatially contiguous subset of units in the feature maps of the preceding layer by convolving the kernels as follows:

    (1.10)

    where denotes the number of feature maps in the layer , ⁎ denotes a convolution operator, is a bias parameter and is a nonlinear activation function. Due to the local connectivity and weight sharing, we can greatly reduce the number of parameters compared to a fully connected neural network, and it is possible to avoid overfitting. Further, when the input image is shifted, the activation of the units in the feature maps are also shifted by the same amount, which allows a CNN to be equivariant to small shifts, as illustrated in Fig. 1.4. In the figure, when the pixel values in the input image are shifted by one-pixel right and one-pixel down, the outputs after convolution are also shifted by one-pixel right and one-pixel down.

    Figure 1.4 Illustration of translation invariance in convolution neural network. The bottom leftmost input is a translated version of the upper leftmost input image by one-pixel right and one-pixel down.

    A pooling layer follows a convolution layer by downsampling the feature maps of the preceding convolution layer. Specifically, each feature map in a pooling layer is linked with a feature map in the convolution layer, and each unit in a feature map of the pooling layer is computed based on a subset of units in a receptive field. Similar to the convolution layer, the receptive field that finds a maximal value among the units in its field is convolved with the convolution map but with a stride of the size of the receptive field so that the contiguous receptive fields are not overlapped. The role of the pooling layer is to progressively reduce the spatial size of the feature maps and to reduce the number of parameters and computation involved in the network. Another important function of the pooling layer is for translation invariance over small spatial shifts in the input. In Fig. 1.4, while the bottom leftmost image is a translated version of the top leftmost image by one-pixel right and one-pixel down, their outputs after convolution and pooling operations are the same, especially for the units in green.

    1.3.2 Computing gradients

    Assume that a convolution layer is followed by a pooling layer. In such a case, units in a feature map of a convolution layer l are connected to a single unit of the corresponding feature map in the pooling layer . By up-sampling the feature maps of the pooling layer to recover the reduced size of maps, all we need to do is to multiply with the derivative of the activation function evaluated at the convolution layer's pre-activations as follows:

    (1.11)

    where ⊙ and denote an element-wise multiplication and up-sampling operation, respectively.

    For the case when a current layer, whether it is a pooling layer or a convolution layer, is followed by a convolution layer, we must figure out which patch in the current layer's feature map corresponds to a unit in the next layer's feature map. The kernel weights multiplying the connections between the input patch and the output unit are exactly the weights of the convolutional kernel. The gradients for the kernel weights are computed by the chain rule similar to backpropagation. However, since the same weights are now shared across many connections, we need to sum the gradients for a given weight over all the connections using the kernel weights as follows:

    (1.12)

    where denotes the patch in the ith feature map of the layer , i.e., , which was multiplied by during convolution to compute the element at in the output feature map .

    1.3.3 Deep convolutional neural networks

    With the advances in computing hardware, recent works utilizing neural networks have grown in depth (i.e., convolution and pooling layers) and width (i.e., channel size). However, as CNNs get deeper and wider, the difficulties (e.g., computational cost, vanishing gradients and degradation) in training them also grow. Thus various methods for improving the computational efficiency of deep models are described in the following sections.

    1.3.3.1 Skip connection

    Skip connection, or shortcut connection, constructs an alternative path for gradients to flow from one layer to layers in the deeper part of the neural network. Specifically, skip connection constructs a path that jumps over one or more layers via addition or concatenation. For example, residual connections [9] constructs the skip connections via addition as follows:

    (1.13)

    where and is the kernel and bias parameter, respectively. Similarly, dense connection [10] constructs the skip connection via channel-wise concatenation. This construction via concatenation allows connections to receive from all previously connected feature maps, introducing connections in the lth skip connection instead of l connections in residual connection.

    1.3.3.2 Inception module

    Inception module [11] was introduced to significantly reduce the computational cost of deep neural networks via sparse multi-scale processing. Specifically, it reduces the number of arithmetic operations of convolution functions by reducing the filter size, i.e., introducing sparsity, via convolution layers followed by convolution layers in different kernel sizes. In practice, the inception module achieved state-of-the-art performance in image classification tasks while reducing the number of arithmetic operations by 100 times compared to its counterparts. Several improvements to the inception module were made throughout the past decade. For example, Inception-v2 and -v3 [12] utilize even sparser, i.e., smaller kernel sizes, convolution operations to improve the computational efficiency. Inception-v4, or Inception-ResNet-v1 and -v2 [13], include the skip connection construction in addition to the multi-scale convolution operations.

    1.3.3.3 Attention

    Attention mechanism in deep learning allows neural networks to attend to salient information from noisy data but also can act as a memory function. These attention methods can be broadly categorized by the form of the attention function: soft vs. hard attention, global vs. local attention, and multi-head attention. Soft attention [14], also commonly known as global attention [15], places attention over all patches of an image, while hard attention [16] selects a single patch at a time. To this end, soft attention is generally more favorable in terms of computational efficiency due to the fact that hard attention models are nondifferentiable and require special techniques such as reinforcement learning. As such, local attention [14] is a differential model that combines the advantages of soft and hard attention. Meanwhile, multi-head attention [17] attends to different information in a parallel manner.

    In practice, attention mechanisms for medical image analysis typically utilize channel-wise and spatial-wise attention [18] as well as global attention [19] to improve the model performance. Note that such attention techniques have been used as a tool for visual interpretation methods [20] as well, where most attended regions can localize the features that support the decision made by a neural network.

    1.4 Recurrent neural networks

    The (medical) images are commonly embodied with corresponding attributes at the time of measurement. In the case of the medical domain, it could be the subject's clinical measurements (i.e., vital measurements, lab results, clinical notes, etc.). Then the (multi-modal) sequential data emerge as these data were measured periodically, which require distinguished deep models for effectively incorporating the entire timespan of such data. Thus we dispense this section to concisely cover the recurrent neural networks (RNNs) as it is reputably robust in handling variable-length sequential data over diverse (clinical) downstream tasks.

    1.4.1 Recurrent cell

    RNNs process the sequential data through the so-called recurrent cell, as illustrated in Fig. 1.5. Suppose that the T-length sequential data with comes with the corresponding labels , where . For each timestep, a typical recurrent cell integrates the input with the previous hidden state as

    (1.14)

    where , and b denote the input and hidden state transformation weights as well as the bias, respectively. Here, the hyperbolic tangent served as the activation function to transform the outcomes into . Meanwhile, the initial hidden state could be either initialized with zeros or inferred from auxiliary networks. Thus, as holds the summarization of the underlying information in the sequences so far, the prediction could be inferred as

    (1.15)

    (1.16)

    with and c denote the weights and bias, correspondingly. Note that other circumstances may employ only the last hidden state for predicting the single label y. For now, let us assume that sequence of data and its label has an equal length so that we require the model to predict it for each tth timestep. Thus we train the RNNs by devising the following loss function over entire T-length sequences as

    (1.17)

    As RNNs incorporate the forward propagation over the whole timestep of sequences, the gradient should be evaluated via a backpropagation through time [21]. Furthermore, as the weights of the recurrent cell are shared across the sequences, starting from the output, we can calculate each gradient with respect to and c, respectively, as follows:

    (1.18)

    (1.19)

    Subsequently, we could aggregate the gradients with respect to weights , and bias b across the entire timesteps as

    (1.20)

    (1.21)

    (1.22)

    Figure 1.5 Graphical illustration of RNNs with the (a) rolled and (b) unrolled computational graph over the timesteps.

    1.4.2 Vanishing gradient problem

    As the RNNs deal with the T-length sequential data, the more it has longer T, it involves the higher risk of getting either gradient vanishing or exploding issues due to the repetitive matrix multiplications in inferring the gradients across the timesteps [22]. To this end, an improvement upon the vanilla recurrent cell was proposed to address such issue and was pioneered by the long-short term memory (LSTM) [23], as it incorporates a dedicated memory cell and gating mechanism to adequately govern the information flow over the sequences, allowing for long-term dependencies learning.

    Such an LSTM cell is comprised of several following gating operations:

    (1.23)

    (1.24)

    (1.25)

    (1.26)

    (1.27)

    (1.28)

    with ⊕ denotes the concatenation operator. In a nutshell, given the input and , an LSTM cell introduces the and , which served as the forget and input gate, respectively. These factors then regulate which information shall be pruned or retained to be stored in the new cell state . It further incorporates the cell state to obtain the current hidden state by considering the output gate . Therefore due to incorporating these gating mechanisms and the novel memory cell, the meaningful information can then be conveyed over by the recurrence cell to the (distant) future timesteps, mitigating the issue of vanishing (or exploding) gradient, and also promising an improvement over the downstream task. Finally, similar to vanilla RNNs, we could further employ such hidden state for obtaining the predicted label for each timestep as follows:

    (1.29)

    In addition to the LSTM, these days numerous alternatives to the vanilla recurrent cell were proposed, with peephole LSTM [24] and gated recurrent unit (GRU) [25] being the two most popular among them.

    1.5 Deep generative models

    1.5.1 Restricted Boltzmann machine

    A restricted Boltzmann machine (RBM) is a two-layer undirected graphical model with visible and hidden units in each layer. Note that the visible units are related to observations, and the hidden units represent the structures or dependencies over the visible units. It assumes symmetric connectivity between the visible layer and the hidden layer but no connections within the layers, and each layer has a bias term, and , respectively. Due to the symmetry of the weight matrix W, it is possible to reconstruct the input observations from the hidden representations. Hence, an RBM is naturally regarded as an autoencoder [26] and these favorable characteristics are used in RBM parameters learning [26]. In RBM, a joint probability of is given by

    (1.30)

    where , is an energy function, and is a partition function that can be obtained by summing over all possible pairs of v and h. For the sake of simplicity, by assuming binary visible and hidden units, which are the commonly studied case, the energy function is defined as

    (1.31)

    The conditional distribution of the hidden units given the visible units and also the conditional distribution of the visible units given the hidden units are respectively computed as

    (1.32)

    (1.33)

    where is a logistic sigmoid function. Due to the unobservable hidden units, the objective function is defined as the marginal distribution of the visible units as

    (1.34)

    The RBM parameters are usually trained using a contrastive divergence algorithm [27] that maximizes the log-likelihood of observations.

    1.5.2 Deep belief network

    Since an RBM is a kind of an autoencoder, it is straightforward to stack multiple RBMs for deep architecture construction, similar to stacked autoencoder (SAE) (will be covered later in Section 1.6.1), which results in a single probabilistic model called a Deep Belief Network (DBN). That is, a DBN has one visible layer v and a series of hidden layers . Between any two consecutive layers, let denote the corresponding RBM parameters. Note that while the top two layers still form an undirected generative model, i.e., RBM, the lower layers form directed generative models. Hence, the joint distribution of the observed units v and the L hidden layers ( ) in DBN is given as follows:

    (1.35)

    where , corresponds to a conditional distribution for the units of the layer l given the units of the layer , and denotes the joint distribution of the units in the layers and L.

    As for the parameter learning, the pretraining scheme described in Section 1.6.1 can also be applied as follows:

    (i)  Train the first layer as an RBM with .

    (ii)  Use the first layer to obtain a representation of the input that will be used as observation for the second layer, i.e., either the mean activations of or samples of .

    (iii)  Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training examples (for the visible layer of the RBM).

    (iv)  Iterate (ii) and (iii) for the desired number of layers, each time propagating upward either samples or mean activations.

    This greedy layerwise training of the DBN can be justified as increasing a variational lower bound on the log-likelihood of the data [26]. After the greedy layerwise procedure is completed, it is possible to perform generative fine-tuning using the wake-sleep algorithm [28]. However, in practice, no further procedure is made to train the whole DBN jointly. In order for the use of a DBN in classification, a trained DBN can also be directly used to initialize a deep neural network with the trained weights and biases. Then the deep neural network can be fine-tuned by means of backpropagation and (stochastic) gradient descent.

    1.5.3 Deep Boltzmann machine

    A deep Boltzmann machine (DBM) is also structured by stacking multiple RBMs in a hierarchical manner. However, unlike DBN, all the layers in DBM still form an undirected generative model after stacking RBMs. For a classification task, DBM replaces its RBM at the top hidden layer with a discriminative RBM [29]. That is, the top hidden layer is now connected to both the lower hidden layer and an additional label layer (the label of the input). In order to learn the parameters, including the connectivities among hidden layers and another connectivity between the top hidden layer and the label layer, we maximize the log-likelihood of the observed data (i.e., the visible data and a class-label) with a gradient-based optimization strategy. In this way, a DBM can be trained to discover hierarchical and discriminative feature representations [29]. Similar to DBN, it can be applied for a greedy layerwise pre-training strategy to provide a good initial configuration of the parameters, which helps the learning procedure converge much faster than random initialization. However, since the DBM integrates both bottom-up and top-down information, the first and last RBMs in the network need modification by using weights twice as big as in one direction. Then it is performed for iterative alternation of variational mean-field approximation to estimate the posterior probabilities of hidden units and stochastic approximation to update model parameters.

    1.5.4 Variational autoencoder

    1.5.4.1 Autoencoder

    An autoencoder, also called as auto-associator, is a special type of a two-layer neural network composed of an input layer, a hidden layer and an output layer. The input layer is fully connected to the hidden layer (i.e., an encoder), which is further fully connected to the output layer (i.e., a decoder) as illustrated in Fig. 1.6(a). Depending on the nature of the input data, the choices upon the type of such neural network are quite broad, ranging from straightforward MLP, CNNs to graph neural networks (GNNs). In general, the aim of an autoencoder is to learn a latent or compressed representation of the input by minimizing the reconstruction error between the input and the reconstructed values from the learned representation.

    Figure 1.6 A graphical illustration of (a) an autoencoder, (b) variational autoencoder, and (c) stacked autoencoder. Note that the dashed red arrows indicate the reparameterization trick.

    Let and denote the number of hidden units and the number of input units in a neural network, respectively. An autoencoder maps an input to a latent representation through a linear mapping and then a nonlinear transformation with a nonlinear activation function f as follows:

    (1.36)

    where is an encoding weight matrix, and is a bias vector. The representation z of the hidden layer is then mapped back to a vector , which approximately reconstructs the input vector x by another mapping as follows:

    (1.37)

    where and are a decoding weight matrix and a bias vector, respectively. Structurally, the number of input units and the number of output units is determined by the dimension of an input vector. Meanwhile, the number of hidden units can be determined based on the nature of the data. If the number of hidden units is less than the dimension of the input data, then the autoencoder can be used for dimensionality reduction. However, it is worth noting that to obtain complicated nonlinear relations among input features, it is possible to allow the number of hidden units to be even larger than the input dimension, from which we can still find an interesting structure by imposing a sparsity constraint [30,31].

    From a learning perspective, the goal of an autoencoder is to minimize the reconstruction error between the input x and the output with respect to the parameters. Given a training set , let denote a reconstruction error over training samples. To encourage sparseness of the hidden units, it is common to use Kullback–Leibler (KL) divergence to measure the difference between the average activation of the jth hidden unit over the training samples and the target average activation ρ defined as [32],

    (1.38)

    Then our objective function can be written as

    (1.39)

    where γ denotes a sparsity control parameter. With the introduction of the KL divergence, the error function is penalized by a large average activation of a hidden unit over the training samples by setting ρ to be small. This penalization drives the activation of many hidden units to be equal or close to zero by making sparse connections between layers.

    1.5.4.2 Variational autoencoder

    We could further extend the concept of the autoencoder as the deep generative model by means of variational autoencoder (VAE), as illustrated in Fig. 1.6(b). In contrast with vanilla autoencoder, a VAE takes into account the prior distributions of the latent representation z via , in which we assume that such latent vector governing the generation the data x through a conditional distribution . Furthermore, a typical VAE approximate the intractable true posterior by introducing an approximate posterior using a Gaussian distribution . Such mean and variance are inferred from the respective encoding neural networks (i.e., encoder ) as

    (1.40)

    with ϕ denoting the parameters of such networks. To obtain the latent representation z, we draw and further apply the reparameterization trick [33] such that

    (1.41)

    Such a trick is necessary for enabling the optimization of the network's parameters via gradient-based approaches. Furthermore, we could generate the x by utilizing the latent representation z through a decoding neural network with parameters θ as

    (1.42)

    Finally, VAE is trained to optimize the variational evidence lower bound (ELBO) through the objective function in Eq. (1.43), consisting of an expected reconstruction error as well as KL divergence to impose the approximate posterior as being close as possible to the prior :

    (1.43)

    1.5.5 Generative adversarial network

    Recently, generative adversarial network (GAN), a deep learning-based implicit density estimation model, has demonstrated its characteristic caliber of generation by learning deep representations of data distribution without labels [34]. As conceptualized in Fig. 1.7, GAN is composed of two neural networks: (i) a generator which tries to synthesize realistic samples, , using a latent code vector z; and (ii) a discriminator that learns to discriminate the real sample x from the generated one, i.e., , by estimating a probability of whether the input is real. GAN, to simultaneously optimize those two neural networks and , uses a game theory-based min-max objective function:

    (1.44)

    where and denote the real data distribution and the latent code distribution, respectively. Mathematically, in Eq. (1.44), the Jensen–Shannon distance (JSD) estimates the distance between those two distributions, the real and the generated data distributions. Note that is minimized when is coming close to 1, i.e., the generator makes realistic samples, and is maximized when is going to 1 as well as reaches 0; therefore tries to correctly decide the real and fake samples. Since GAN has shown promising generation performance, there is still room for improvement with a modification of the loss function [35,36]. In this regard, attempts to exploit other distances for the GAN loss function instead of the JSD have gained widespread attention from deep learning researchers.

    Figure 1.7 Illustration of a generative adversarial network.

    Mao et al. [35] slightly modified the GAN loss function and named their method least-square GAN (LSGAN). More specifically, they minimized the Pearson- distance between the real and the synthesized data distributions. To do so, they modified the loss to

    (1.45)

    (1.46)

    and set while . This modified objective function gives a greater gradient value to fake samples, which are farther from the decision boundary of real samples, thereby suppressing the gradient vanishing problem.

    Similar to LSGAN [35], Arjovsky et al. [36] also focused on replacing the JSD to the other distance. They showed that the Wasserstein distance can be applied to the GAN objective function in a mathematically rigorous manner and proposed a modified loss function:

    (1.47)

    where , a critic, is the 1-Lipschitz function, which is used instead of the discriminator. In this objective, the critic scores the realness or fakeness of the input, whereas the discriminator estimates the probability of whether the input is real. To make the critic satisfy the Lipschitz constraint, Arjovsky et al. used weight clipping on the critic and this method is widely known as Wasserstein GAN (WGAN). On the other hand, Gulrajani et al. [37] removed the weight clipping by adding a regularization term, the so-called gradient penalty (GP). The objective function of WGAN with GP is

    (1.48)

    where is the -norm and is defined as

    (1.49)

    Here, Gulrajani et al. give penalization to weights of the critic network. By doing so, WGAN with GP also could gratify the Lipschitz condition.

    1.6 Tricks for better learning

    Earlier, LeCun et al. presented that by transforming data to have an identity covariance and a zero mean, i.e., data whitening, the network training could converge faster [38,39]. Besides such a simple trick, recent studies have devised other nice tricks to better train deep models.

    1.6.1 Parameter initialization in autoencoder

    In regards to the autoencoder, note that the outputs of units in the hidden layer of the encoding networks become the latent representation of the input vector. However, due to its simple shallow structural characteristic, the representational power of a single-layer autoencoder is known to be very limited. However, when stacked with multiple autoencoders by taking the activation values of hidden units of an autoencoder as the input to the following upper autoencoder and building a SAE (Fig. 1.6(c)), it is possible to improve the representational power greatly [40]. Thanks to the hierarchical structure, one of the most important characteristics of the SAE is to learn or discover highly nonlinear and complicated patterns such as the relations among input features. When an input vector is presented to a SAE, the different layers of the network represent different levels of information. That is, the lower the layer in the network, the simpler the patterns that are learned; the higher the layer, the more complicated or abstract patterns inherent in the input feature vector. With regard to training parameters of the weight matrices and the biases in an SAE, a straightforward way is to apply backpropagation with the gradient-based optimization technique starting from random initialization by regarding the SAE as a conventional multi-layer neural network. Unfortunately, it is generally known that deep networks trained in that manner perform worse than networks with a shallow architecture, suffering from falling into a poor local optimum [31]. A greedy layerwise learning [26] could be used to circumvent this problem. The key idea in a greedy layerwise learning is to train one layer at a time by maximizing the variational lower bound. That is, we first train the 1st hidden layer with the training data as input, and then train the 2nd hidden layer with the outputs from the 1st hidden layer as input, and so on. That is, the representation of the lth hidden layer is used as input for the -th hidden layer. This greedy layerwise learning is performed as "pretraining (Figs. 1.8(a)–1.8(c)). The important feature of pre-training is that it is conducted in an unsupervised manner with a standard backpropagation algorithm [41]. When it comes to a classification problem, we stack another output layer on top of the SAE (Fig. 1.8(d)) with an appropriate activation function. This top output layer is used to represent the class-label of an input sample. Then by taking the pretrained connection weights as the initial parameters for the hidden units and randomly initializing the connection weights between the top hidden layer and the output layer, it is possible to train the whole parameters jointly in a supervised manner by gradient descent with a backpropagation algorithm. Note that the initialization of the parameters via pretraining helps the supervised optimization, called fine-tuning," reduce the risk of falling into poor local optima [26,31].

    Figure 1.8 Greedy layerwise pretraining (highlighted with the blue connections in (a–c)) and fine-tuning of the whole network. ( H i denotes the i th hidden layer in the network.)

    1.6.2 Activation functions

    In a deep learning framework, the main purpose of the activation function is to introduce nonlinearity into deep neural networks. The nonlinearity means that the output of the neural network cannot be reproduced from affine transformations, i.e., the output must be different from a linear combination of the input values.

    There are many old-fashioned nonlinear activations such as the logistic sigmoid, and the hyperbolic tangent. But the gradient of those functions vanishes as the values of the respective inputs increases or decreases, which is known as one of the sources to cause the vanishing gradient problem. In this regard, Nair and Hinton suggested using a rectified linear unit (ReLU) function [2]. The ReLU function only takes positive input values:

    (1.50)

    thereby validating its usefulness to improve training time by resolving the vanishing gradient problem. However, the ReLU has two mathematical problems: (i) it is nondifferentiable at , and thus not valid to be used along with a gradient-based method; (ii) it is unbounded on the positive side and can be a potential problem to cause overfitting. Nonetheless, as for the first problem, since it is highly unlikely that the input to any hidden unit will be at exactly at any time, in practice, the gradient of the ReLU at is set either 0 or 1. Regarding the unboundedness, the application of a regularization technique is helpful to limit the magnitude of weights, thus circumventing the overfitting issue. The curve for the ReLU function is depicted in Fig. 1.9(a).

    Figure 1.9 Plots of (a) rectified linear unit (ReLU), (b) leaky ReLU, (c) exponential linear unit (ELU) and (d) Swish activation functions.

    Since the ReLU activation showed its power, many variations are proposed for more robust and sound learning. A leaky ReLU (lReLU) is one of the improved versions of the ReLU function [3]. For the ReLU function, the gradient is 0 for , which would deactivate the outputs in the negative region. The leaky ReLU function slightly activates negative outputs to address this problem and is defined

    Enjoying the preview?
    Page 1 of 1