Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Bayesian Learning: Fundamentals and Applications
Bayesian Learning: Fundamentals and Applications
Bayesian Learning: Fundamentals and Applications
Ebook252 pages1 hour

Bayesian Learning: Fundamentals and Applications

Rating: 0 out of 5 stars

()

Read preview

About this ebook

What Is Bayesian Learning


In the field of statistics, an expectation-maximization (EM) algorithm is an iterative approach to discover (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. EM algorithms are also known as maximum likelihood or maximum a posteriori (MAP) estimations. The expectation (E) step of the EM iteration creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and the maximization (M) step of the EM iteration computes parameters with the goal of maximizing the expected log-likelihood found on the expectation step. These two steps are performed in alternating fashion throughout the iteration. These parameter-estimates are then utilized in the subsequent E phase, which serves the purpose of determining the distribution of the latent variables.


How You Will Benefit


(I) Insights, and validations about the following topics:


Chapter 1: Expectation-maximization algorithm


Chapter 2: Likelihood function


Chapter 3: Maximum likelihood estimation


Chapter 4: Logistic regression


Chapter 5: Exponential family


Chapter 6: Fisher information


Chapter 7: Generalized linear model


Chapter 8: Mixture model


Chapter 9: Variational Bayesian methods


Chapter 10: EM algorithm and GMM model


(II) Answering the public top questions about bayesian learning.


(III) Real world examples for the usage of bayesian learning in many fields.


Who This Book Is For


Professionals, undergraduate and graduate students, enthusiasts, hobbyists, and those who want to go beyond basic knowledge or information for any kind of bayesian learning.


What is Artificial Intelligence Series


The artificial intelligence book series provides comprehensive coverage in over 200 topics. Each ebook covers a specific Artificial Intelligence topic in depth, written by experts in the field. The series aims to give readers a thorough understanding of the concepts, techniques, history and applications of artificial intelligence. Topics covered include machine learning, deep learning, neural networks, computer vision, natural language processing, robotics, ethics and more. The ebooks are written for professionals, students, and anyone interested in learning about the latest developments in this rapidly advancing field.
The artificial intelligence book series provides an in-depth yet accessible exploration, from the fundamental concepts to the state-of-the-art research. With over 200 volumes, readers gain a thorough grounding in all aspects of Artificial Intelligence. The ebooks are designed to build knowledge systematically, with later volumes building on the foundations laid by earlier ones. This comprehensive series is an indispensable resource for anyone seeking to develop expertise in artificial intelligence.

LanguageEnglish
Release dateJul 1, 2023
Bayesian Learning: Fundamentals and Applications

Read more from Fouad Sabry

Related to Bayesian Learning

Titles in the series (100)

View More

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Bayesian Learning

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Bayesian Learning - Fouad Sabry

    Chapter 1: Expectation–maximization algorithm

    To find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models that depend on unobserved latent variables, statisticians frequently turn to expectation-maximization (EM) algorithms. A function for the expectation of the log-likelihood evaluated using the current estimate for the parameters is constructed in the expectation (E) step of the EM iteration, and parameters that maximize the expected log-likelihood are computed in the maximization (M) step. The E step makes use of these parameter estimates to find out how the latent variables are distributed.

    In a seminal paper published in 1977, Arthur Dempster, Nan Laird, and Donald Rubin described and named the EM algorithm. The 1977 paper by Dempster, Laird, and Rubin generalized the approach and sketched a convergence analysis for a larger set of problems. The paper by Dempster, Laird, and Rubin (EM) solidified its status as a crucial method in statistical research. Consider Meng and van Dyk as well (1997).

    In 1983, C. F. Jeff Wu published a correct convergence analysis, which showed that the Dempster-Laird-Rubin algorithm's analysis of convergence was flawed. Dempster-Laird-claim Rubin's that the EM method converges holds, and Wu provided proof that it does so even outside of the exponential family.

    When solving a statistical model's equations directly proves impossible, the EM algorithm is turned to in order to determine (local) maximum likelihood parameters. These models typically include latent variables, as well as hidden parameters and observed data. That is, either some values in the data are missing, or the model can be expressed more concisely by presuming the existence of additional unobserved data points. Assuming that each observed data point corresponds to an unobserved data point, or latent variable, specifying the mixture component to which each data point belongs simplifies the description of a mixture model.

    Maximum likelihood solutions are typically found by solving a system of simultaneous equations involving the likelihood function and all of the unknown values, parameters, and latent variables. This is typically not possible in statistical models involving latent variables. Parameter solutions often require latent variable values, and vice versa; however, if one tries to solve a parameter equation by substituting a latent variable equation into a parameter equation, an intractable system of equations results.

    Based on the realization that these two systems of equations can be solved numerically, the EM algorithm is developed. The two sets of unknowns can be estimated independently by picking random values for one set and then using that estimate to improve the other set's estimate, and so on, until the estimates converge to the same fixed point. It is not immediately apparent that this will be effective, but it can be demonstrated here. It can also be shown that the likelihood derivative is (very nearly) zero at that point, indicating that it is either a local maximum or a saddle point. There is no assurance that the global maximum will be found, and multiple maxima are possible. Singularities, or illogical maxima, can also be found in certain probabilities. For instance, EM in a mixture model may find a solution wherein one component's variance is zero and the mean parameter for the same component is equal to one of the data points.

    Given the statistical model which generates a set \mathbf {X} of observed data, a set of unobserved latent data or missing values \mathbf {Z} , and a vector of unknown parameters {\boldsymbol {\theta }} , along with a likelihood function {\displaystyle L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )=p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})} , Estimating the unknown parameters with the highest possible MLE involves maximizing the marginal likelihood of the data.

    {\displaystyle L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} \mid {\boldsymbol {\theta }})=\int p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} =\int p(\mathbf {X} \mid \mathbf {Z} ,{\boldsymbol {\theta }})p(\mathbf {Z} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} }

    However, this quantity is often intractable since \mathbf {Z} is unobserved and the distribution of \mathbf {Z} is unknown before attaining {\boldsymbol {\theta }} .

    The EM algorithm uses these two procedures in an iterative fashion to determine the MLE of the marginal likelihood:

    Expectation step (E step): Define {\displaystyle Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})} as the expected value of the log likelihood function of {\boldsymbol {\theta }} , with respect to the current conditional distribution of \mathbf {Z} given \mathbf {X} and the current estimates of the parameters \boldsymbol\theta^{(t)} :

    {\displaystyle Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})=\operatorname {E} _{\mathbf {Z} \sim p(\cdot |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})\right]\,}

    Finding the optimal values for these controls constitutes the maximization step (M step):

    {\displaystyle {\boldsymbol {\theta }}^{(t+1)}={\underset {\boldsymbol {\theta }}{\operatorname {arg\,max} }}\ Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\,}

    It can be expressed more simply as a single equation:

    {\displaystyle {\boldsymbol {\theta }}^{(t+1)}={\underset {\boldsymbol {\theta }}{\operatorname {arg\,max} }}\operatorname {E} _{\mathbf {Z} \sim p(\cdot |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})\right]\,}

    The typical models to which EM is applied use \mathbf {Z} as a latent variable indicating membership in one of a set of groups:

    The observed data points \mathbf {X} may be discrete (taking values in a finite or countably infinite set) or continuous (taking values in an uncountably infinite set).

    Each data point could have an accompanying vector of observations.

    The missing values (aka latent variables) \mathbf {Z} are discrete, selected from a predetermined range, and only a single latent variable per data point.

    There are two types of continuous parameters: those that apply to all data points and those that apply only to a single value of a latent variable (i.e., associated with all data points whose corresponding latent variable has that value).

    However, EM can be used with models of varying types.

    Here are the reasons why.

    If the value of the parameters {\boldsymbol {\theta }} is known, usually the value of the latent variables \mathbf {Z} can be found by maximizing the log-likelihood over all possible values of \mathbf {Z} , either simply by iterating over \mathbf {Z} or through an algorithm such as the Viterbi algorithm for hidden Markov models.

    Conversely, if we know the value of the latent variables \mathbf {Z} , we can find an estimate of the parameters {\boldsymbol {\theta }} fairly easily, Typically, this is done by averaging the values of the observed data points into groups based on the value of the corresponding latent variable, and not a metric based on the values, percentages, within each category,.

    That points to an algorithm with loops, in the case where both {\boldsymbol {\theta }} and \mathbf {Z} are unknown:

    First, initialize the parameters {\boldsymbol {\theta }} to some random values.

    Compute the probability of each possible value of \mathbf {Z} , given {\boldsymbol {\theta }} .

    Then, use the just-computed values of \mathbf {Z} to compute a better estimate for the parameters {\boldsymbol {\theta }} .

    Until convergence is reached, repeat steps 2 and 3.

    The just-described algorithm converges monotonically on the minimum value of the cost function.

    Although additional information is gathered during an EM iteration (i.e, function of (the marginal likelihood, There is no assurance that the sequence will inevitably reach a maximum likelihood estimate.

    For distributions with multiple modes, In other words, the likelihood function of the observed data may converge to a local maximum when using an EM algorithm, with respect to initial conditions.

    Many different heuristic and metaheuristic strategies exist for breaking out of a local maximum, such as random-restart hill climbing (starting with several different random initial estimates \boldsymbol\theta^{(t)} ), or using techniques like simulated annealing.

    For a detailed discussion of EM and its applications, see Sundberg (2019, Chapter 8). EM excels when the likelihood belongs to an exponential family:

    In their original paper, Dempster, Laird, and Rubin adapted the EM technique to calculate maximum a posteriori (MAP) estimates for Bayesian inference.

    Alternatives to the Gauss-Newton algorithm for locating maximum likelihood estimates include gradient descent, conjugate gradient, and others. In contrast to EM, these techniques often necessitate calculating the first and/or second derivatives of the likelihood function.

    Expectation-Maximization works to improve {\displaystyle Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})} rather than directly improving {\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})} .

    Here, we demonstrate that enhancing the former necessarily enhances the latter.

    For any \mathbf {Z} with non-zero probability {\displaystyle p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }})} , we can write

    {\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})=\log p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})-\log p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }}).}

    We take the expectation over possible values of the unknown data \mathbf {Z} under the current parameter estimate \theta^{(t)} by multiplying both sides by {\displaystyle p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }}^{(t)})} and summing (or integrating) over \mathbf {Z} .

    An assumption of constancy (on the left), therefore, it follows:

    {\displaystyle {\begin{aligned}\log p(\mathbf {X} \mid {\boldsymbol {\theta }})&=\sum _{\mathbf {Z} }p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }}^{(t)})\log p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})-\sum _{\mathbf {Z} }p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }}^{(t)})\log p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }})\\&=Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)}),\end{aligned}}}

    where {\displaystyle H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})} is defined by the negated sum it is replacing.

    This last equation holds for every value of {\boldsymbol {\theta }} including \boldsymbol\theta = \boldsymbol\theta^{(t)} ,

    {\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)}),}

    The result of this last equation being subtracted from the preceding one is

    {\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})-\log p(\mathbf {X} \mid {\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})-Q({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})-H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)}).}

    However, Gibbs' inequality tells us that

    {\displaystyle H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\geq H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})}

    , we draw the following conclusion

    {\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})-\log p(\mathbf {X} \mid {\boldsymbol {\theta }}^{(t)})\geq Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})-Q({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)}).}

    In words, choosing {\boldsymbol {\theta }} to improve {\displaystyle Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})} causes {\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})} to improve at least as much.

    The EM algorithm can be seen as an instance of coordinate descent, with two iterative maximization steps. Think about what it does:

    {\displaystyle F(q,\theta ):=\operatorname {E} _{q}[\log L(\theta ;x,Z)]+H(q),}

    where H(q) is the entropy of the distribution q, and q is any arbitrary probability distribution over the unobserved data z. For this function, we have

    {\displaystyle F(q,\theta )=-D_{\mathrm {KL} }{\big (}q\parallel p_{Z\mid X}(\cdot \mid x;\theta ){\big )}+\log L(\theta ;x),}

    where {\displaystyle p_{Z\mid X}(\cdot \mid x;\theta )} is the conditional distribution of the unobserved data given the observed data x and {\displaystyle D_{KL}} is the Kullback–Leibler divergence.

    Then the EM algorithm's steps can be understood as:

    Expectation step: Choose q to maximize F :

    {\displaystyle q^{(t)}=\operatorname {arg\,max} _{q}\ F(q,\theta ^{(t)})}

    Maximization step: Choose \theta to maximize F :

    {\displaystyle \theta ^{(t+1)}=\operatorname {arg\,max} _{\theta }\ F(q^{(t)},\theta )}

    Parameter estimation in mixed models is a common application of EM, When it comes to estimating the parameters and latent abilities of item response theory models, EM is a crucial tool for psychometricians.

    EM is becoming a valuable tool for pricing and managing the risk of a portfolio thanks to its ability to handle missing data and monitor unidentified variables.

    In medical imaging reconstruction, the EM algorithm (and its faster variant ordered subset expectation maximization) is frequently employed, particularly in positron emission tomography, single-photon emission computed tomography, and x-ray computed tomography. For additional, quicker EM variants, see the list below.

    The Structural Identification using Expectation Maximization (STRIDE) algorithm is an output-only method in structural engineering that determines the structural system's natural vibration properties from sensor data (see Operational Modal Analysis).

    Cluster analysis is another application for EM. The Baum-Welch algorithm for hidden Markov models and the inside-outside algorithm for unsupervised induction of probabilistic context are two well-known applications of this algorithm in the field of natural language processing. no cost grammars.

    The EM algorithm has been very useful in the study of intertrade waiting times, or the intervals between stock trades on a stock exchange.

    On-line state estimation is typically performed using a Kalman filter, while off-line or batch state estimation may make use of a minimum-variance smoother. These minimum-variance solutions, however, call for parameter estimates from a state-space model. Joint state and parameter estimation problems are amenable to EM algorithms.

    By iterating this two-stage process, we can generate EM algorithms for filtering and smoothing:

    E-step

    Get new state estimates by running a Kalman filter or a minimum-variance smoother built with the latest parameter estimates.

    M-step

    Maximum-likelihood estimations can be improved by incorporating the filtered or smoothed state estimates.

    Let's pretend that additive white noise is present in the measurements of a simple system with one input and one output, and that these measurements are processed by a Kalman filter or minimum-variance smoother. Maximum likelihood provides a current estimate of the variance in the measurement noise.

    {\displaystyle {\widehat {\sigma }}_{v}^{2}={\frac {1}{N}}\sum _{k=1}^{N}{(z_{k}-{\widehat {x}}_{k})}^{2},}

    where {\displaystyle {\widehat {x}}_{k}} are scalar output estimates calculated by a filter or a smoother from N scalar measurements z_{k} .

    The aforementioned revision is also applicable to modifying the noise level of a Poisson measurement.

    Similarly, for an autoregressive process of the first order, new procedures The variance of the noise can be approximated by

    {\displaystyle {\widehat {\sigma }}_{w}^{2}={\frac {1}{N}}\sum _{k=1}^{N}{({\widehat {x}}_{k+1}-{\widehat {F}}{\widehat {x}}_{k})}^{2},}

    where {\displaystyle {\widehat {x}}_{k}} and {\displaystyle {\widehat {x}}_{k+1}} are scalar state estimates calculated by a filter or a smoother.

    New estimates of the model coefficients are derived by

    {\displaystyle {\widehat {F}}={\frac {\sum _{k=1}^{N}{({\widehat {x}}_{k+1}-{\widehat {F}}{\widehat {x}}_{k})}^{2}}{\sum _{k=1}^{N}{\widehat {x}}_{k}^{2}}}.}

    The above estimates of convergence of parameters are well studied.

    Conjugate gradient and modified Newton's methods (Newton-Raphson) are two approaches that have been proposed to speed up the EM algorithm's sluggish convergence. In addition, constrained estimation methods can be used with EM.

    us[ing] a covariance adjustment to correct the analysis of the M step, capitalizing on extra information captured in the imputed complete data, says the paper describing the Parameter-expanded expectation maximization (PX-EM) algorithm, which is often found to provide speedup. therefore employ any tools established for the broader context.

    The EM algorithm uses a Q-function derived from the log likelihood.

    Therefore, Considered to be the log-EM algorithm.

    The use of the log likelihood can be generalized to that of the α-log likelihood ratio.

    Then, the α-log likelihood ratio of the observed data can be exactly expressed as equality by using the Q-function of the α-log likelihood ratio and the α-divergence.

    It is a generalized version of the E-step to obtain this Q-function.

    Its optimum solution is an arbitrary M step.

    This pair is called the α-EM algorithm which contains the log-EM algorithm as its subclass.

    Thus, the α-EM algorithm by Yasuo Matsuyama is an exact generalization of the log-EM algorithm.

    There is no requirement for calculating the gradient or Hessian matrix.

    The α-EM shows faster convergence than the log-EM algorithm by choosing an appropriate α.

    The α-EM algorithm leads to a faster version of the Hidden Markov model estimation algorithm α-HMM.

    Although EM contains some non-Bayesian, maximized probability approach.

    Its final result gives a probability distribution over the latent variables (in the Bayesian style) together with a point estimate for θ (either a maximum likelihood estimate or a posterior mode).

    It's possible that a fully Bayesian variant of this is required, giving a probability distribution over θ and the latent variables.

    The Bayesian approach to inference is simply to treat θ as another latent variable.

    Using this framework, E and M steps merge into one another.

    According to the factorized Q approximation (variational Bayes), solving can iterate over each latent variable (now including θ) and optimize them one at a time.

    Enjoying the preview?
    Page 1 of 1