Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Handbook of HydroInformatics: Volume I: Classic Soft-Computing Techniques
Handbook of HydroInformatics: Volume I: Classic Soft-Computing Techniques
Handbook of HydroInformatics: Volume I: Classic Soft-Computing Techniques
Ebook1,292 pages11 hours

Handbook of HydroInformatics: Volume I: Classic Soft-Computing Techniques

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Classic Soft-Computing Techniques is the first volume of the three, in the Handbook of HydroInformatics series.? Through this comprehensive, 34-chapters work, the contributors explore the difference between traditional computing, also known as hard computing, and soft computing, which is based on the importance given to issues like precision, certainty and rigor. The chapters go on to define fundamentally classic soft-computing techniques such as Artificial Neural Network, Fuzzy Logic, Genetic Algorithm, Supporting Vector Machine, Ant-Colony Based Simulation, Bat Algorithm, Decision Tree Algorithm, Firefly Algorithm, Fish Habitat Analysis, Game Theory, Hybrid Cuckoo–Harmony Search Algorithm, Honey-Bee Mating Optimization, Imperialist Competitive Algorithm, Relevance Vector Machine, etc.?It is a fully comprehensive handbook providing all the information needed around classic soft-computing techniques.

This volume is a true interdisciplinary work, and the audience includes postgraduates and early career researchers interested in Computer Science, Mathematical Science, Applied Science, Earth and Geoscience, Geography, Civil Engineering, Engineering, Water Science, Atmospheric Science, Social Science, Environment Science, Natural Resources, and Chemical Engineering.

  • Key insights from global contributors in the fields of data management research, climate change and resilience, insufficient data problem, etc.  
  • Offers applied examples and case studies in each chapter, providing the reader with real world scenarios for comparison.  
  • Introduces classic soft-computing techniques, necessary for a range of disciplines.
LanguageEnglish
Release dateNov 30, 2022
ISBN9780128219706
Handbook of HydroInformatics: Volume I: Classic Soft-Computing Techniques

Related to Handbook of HydroInformatics

Related ebooks

Environmental Engineering For You

View More

Related articles

Reviews for Handbook of HydroInformatics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Handbook of HydroInformatics - Saeid Eslamian

    Preface

    Classic Soft-Computing Techniques is the first volume of three in the Handbook of HydroInformatics series. Through this comprehensive, 26-chapter work, the contributors explore the difference between traditional computing, also known as hard computing, and soft computing, which is based on the importance given to issues like precision, certainty, and rigor. The chapters go on to define fundamental classic soft-computing techniques such as multivariate regressions, bat algorithm optimized extreme learning machine (Bat-ELM), Bayesian inference, computational fluid dynamics (CFD) models, cross validation, selected node and link-based performance indices, conodal system analysis, data assimilation, data reduction techniques, decision tree algorithm, entropy and resilience indices, generalized autorregressive conditional heteroskedasticity (GARCH), exponential general autoregressive conditional heteroskedastic (EGARCH), and Glosten, Jagannathan, and Runkle (GJR) models, gene expression models, gradient-based optimization, gray wolf optimization (GWO) algorithm, kernel-based modeling, subgrid-scale (SGS) modeling with neural network, lattice Boltzmann method (LBM), multigene genetic programming (MGGP), ontology-based knowledge management framework, parallel chaos search-based incremental extreme learning, relevance vector machine (RVM), stochastic learning algorithms, support vector machine, uncertainty analysis using fuzzy logic models, uncertainty-based resiliency evaluation, etc. It is a fully comprehensive handbook providing all the information needed regarding classic soft-computing techniques.

    This volume is a true interdisciplinary work, and the intended audience includes postgraduates and early-career researchers interested in computer science, mathematical science, applied science, Earth and geoscience, geography, civil engineering, engineering, water science, atmospheric science, social science, environment science, natural resources, and chemical engineering.

    The Handbook of HydroInformatics corresponds to courses that could be taught at the following levels: undergraduate, postgraduate, research students, and short course programs. Typical course names of this type include: HydroInformatics, Soft Computing, Learning Machine Algorithms, Statistical Hydrology, Artificial Intelligence, Optimization, Advanced Engineering Statistics, Time Series, Stochastic Processes, Mathematical Modeling, Data Science, Data Mining, etc.

    The three-volume Handbook of HydroInformatics is recommended not only for universities and colleges, but also for research centers, governmental departments, policy makers, engineering consultants, federal emergency management agencies, and related bodies.

    Key features are as follows:

    •Contains key insights from global contributors in the fields of data management research, climate change and resilience, insufficient data problems, etc.

    •Offers applied examples and case studies in each chapter, providing the reader with real-world scenarios for comparison

    •Introduces classic soft-computing techniques necessary for a range of disciplines

    Saeid Eslamian, College of Agriculture, Isfahan University of Technology, Isfahan, Iran

    Faezeh Eslamian, McGill University, Montreal, QC, Canada

    Chapter 1: Advanced machine learning techniques: Multivariate regression

    Reza Daneshfara; Mohammad Esmaeilib; Mohammad Mohammadi-Khanaposhtanic; Alireza Baghband; Sajjad Habibzadehd; Saeid Eslamiane,f    a Department of Petroleum Engineering, Ahwaz Faculty of Petroleum Engineering, Petroleum University of Technology, Ahwaz, Iran

    b Department of Petroleum Engineering, Amirkabir University of Technology (Polytechnic of Tehran), Tehran, Iran

    c Fouman Faculty of Engineering, College of Engineering, University of Tehran, Tehran, Iran

    d Chemical engineering Department, Amirkabir University of Technology (Tehran Polytechnic), Mahshahr Campus, Mahshahr, Iran

    e Department of Water Engineering, College of Agriculture, Isfahan University of Technology, Isfahan, Iran

    f Center of Excellence in Risk Management and Natural Hazards, Isfahan University of Technology, Isfahan, Iran

    Abstract

    Complicated problems in a variety of fields that cannot be solved using conventional techniques are handled using machine learning. Linear regression is a simple and popular machine technique employed for prediction purposes. It was introduced by Galton (1894). It is a mathematical approach in order to analyze and quantify the associations of variables. To incorporate the outputs of other founders/covariates into a model, one cannot utilize univariate regression—i.e., chi-square, Fisher exact test, and analysis of variance (ANOVA). As a result, partial correlation and regression are employed to identify the association of two variables and evaluate the confusion effect. Mathematical algorithms typically employ linear regression for the purpose of predicted effect measurement and modeling versus several inputs. This data analysis approach linearly relates independent and dependent variables, modeling the relationships between the independent and dependent variables based on model training. The present study conducts a review of recent popular methodologies in the machine learning and linear regression literature, including databases, performance, accuracy, and algorithms, from 2017 to 2020.

    Keywords

    Machine learning; Multivariate regression; Linear regression; Learning curve; Polynomial regression; Gradient descent method

    1: Introduction

    Complicated problems in a variety of fields that cannot be solved using conventional techniques are handled using machine learning (Zeebaree et al., 2019; Bargarai et al., 2020; Dargan et al., 2020). Linear regression is a simple and popular machine technique employed for prediction purposes. It was introduced by Galton (1894). It is a mathematical approach in order to analyze and quantify the associations of variables (Akgün and Öğüdücü, 2015; Dehghan et al., 2015; Liu et al., 2017). To incorporate the outputs of other founders/covariates into a model, one cannot utilize univariate regression—i.e., chi-square, Fisher exact test, and analysis of variance (ANOVA). As a result, partial correlation and regression are employed to identify the association of two variables and evaluate the confusion effect (Zebari et al., 2020; Sulaiman, 2020; Epskamp and Fried, 2018). Mathematical algorithms typically employ linear regression for the purpose of predicted effect measurement and modeling versus several inputs (Lim, 2019). This data analysis approach linearly relates independent and dependent variables, modeling the relationships between the independent and dependent variables based on model training. The present study conducts a review of recent popular methodologies in the machine learning and linear regression literature, including databases, performance, accuracy, and algorithms, from 2017 to 2020 (Sarkar et al., 2015).

    This chapter is divided into the following sections: The first section focuses on linear regression, this is followed by an explanation of multivariate linear regression, and then the gradient descent method is described. The polynomial regression concept is then explained and concepts such as overfitting and under-fitting, cross-validation, and learning curve are expressed in a clear and fluent manner. Finally, the attractive and practical concepts that are discussed include: regularized linear models, ridge regression, outliers impact, lasso regression, elastic net, early stopping, and logistic regression.

    2: Linear regression

    When we know a property or a dependent variable in general depends on several variables but the way of this dependence is not clear to us, a linear model is the simplest choice to get an insight into this dependence. Although the simplest choice is not necessarily the best one, linear models can do a lot in the case of algebraic dependency between a function and its variables. A linear model can provide a reasonable estimation of any function at least in a small neighborhood. Moreover, some nonlinear dependencies as suggested by theories could be transformed to a linear dependency. For example, consider the following chemical reaction rate law

    si1_e    (1)

    In which k and n are constants to be determined from experimental data of reaction rate (rA) versus species concentration (CA). To apply the favorable linear model, one can transform the above equation taking a natural logarithm from both sides to have

    si2_e    (2)

    Another example that can be suited to multivariate problems is the polynomial regression. This topic will be discussed in a separate section; however, the linear model can somewhat cover such problems by an interesting trick. Consider the following model

    si3_e    (3)

    In this case, by introducing a new variable, the nonlinear model is transformed into a linear model. Assume that z = x1 and z² = x2 then

    si4_e    (4)

    Although the values of x2 are not independent of x1, this does not have anything to do with the application of linear regression algorithm. These two examples demonstrate that linear models for multivariate problems are a fundamental tool that could not be ignored by practitioners, especially in the field of machine learning or more elegantly artificial intelligence (Olive, 2017; Matloff, 2017).

    In this section, we are going to go through a project called nutrient removal efficiency data and we use a data set containing 7876 data to predict the total phosphorus (TP), ammonium (NH4-N), and total nitrogen (TN) removal efficiency of an anaerobic anoxic-oxic membrane bioreactor system and the output values are predicted by nine input data given in Table 1. This dataset was taken from the data reported from an article published by Yaqub et al. (2020).

    Table 1

    In this part, we are only using one explanatory variable (e.g., TOC) to explain the output (RE of TN).

    The linear regression diagram for this example is shown in Fig. 1. After successful fitting, it is well known that with increasing TOC, the values of removal efficiency of TN increase.

    Fig. 1

    Fig. 1 Linear regression for nutrient removal efficiency project.

    3: Multivariate linear regression

    When y is a function of n variables namely x1 to xn the simplest model for dependency is a linear model which can provide an estimation si5_e of the function as

    si6_e    (5)

    Where α0 to αn are the model parameters to be determined using available data in combination with a proper linear regression algorithm (Hackeling, 2017). Matrix notations help provide a compact form of the equations in multivariate problems. In the matrix form

    si7_e    (6)

    Where xT = [1 x1x2 ⋯ xn] indicates the transpose of column matrix x and similarly α shows the column matrix of parameters. Note that a new term i.e. x0 = 1 is introduced to make the matrix product possible. Now the problem is reduced to the determination of elements of matrix α under suitable constraints that eventually provide a system of linear equations for specifying the model parameters. The first one must note that at each point the model error is defined as si8_e that is

    si9_e    (7)

    Where xiT can be interpreted as the i′th row of the matrix XT which includes the values of each variable at different points. The error vector could be defined as a column matrix as

    si10_e    (8)

    Where both e and y have p elements (column vectors with p rows) and XT is an p × (n + 1) matrix:

    si11_e    (9)

    At the first glance, minimization of the absolute value of the error results in the best values of model parameters. But since the error is presented by the vector e, one should talk about the minimization of a suitable norm of that. Moreover, the first norm which adds up the absolute values of the elements of e would bring some problems in terms of differentiation. Thus, the better choice is the second norm or the Euclidian norm of the error vector:

    si12_e    (10)

    And finally, since minimization of the above function is equivalent to the minimization of the summation on the right side, the Sum of Squares of Errors (SSE) is taken as the target function in linear regression problems:

    si13_e

       (11)

    And minimization of this function without any constraint would result in an Ordinary Least Square (OLS) method for determination of model parameters, i.e., α0 to αn. Obviously, this method involves vanishing the first partial derivatives of SSE concerning the model parameters which provide the required n + 1 equations:

    si14_e

       (12)

    Or in the matrix form

    si15_e

       (13)

    Of course solution of the system of linear equations could be accomplished by rather low cost calculations. Indeed, the determination of the inverse matrix for the solution of a linear system of equations is almost always avoided. Instead, a direct method such as the Gauss-Jordan method of LU-factorization is advised when the system is not too large (say for n < 100) and indirect methods such as SOR are advised for large systems. Of course, a favorable model, should not have too many parameters, that is, the number of independent variables is kept low by incorporating the effective terms and neglecting the variables with minor impact on the output. Therefore, no matter how large the data set, one intends to solve a linear system of equations with a reasonable number of unknowns.

    Fig. 2 illustrates synthetic data for which a random error is involved in the measurement. Here, the data point is determined by y = 5 + 2.5x + error.

    Fig. 2

    Fig. 2 Applying linear regression for the synthetic data.

    And the linear regression results in α0 = 4.9943 and α1 = 2.4824.

    The calculations involve the solution of a linear system of n + 1 equations which involves inverting of the matrix which generally requires O(np) operations where p lies between 2.373 and three depending on the direct method applied. For example, Gauss-Jordan requires O(n³) operations; hence if the number of variables (terms) is doubled, the operations are increased to eight times of the original problem. For many problems, this does not introduce much difficulty but for problems with a large number of variables, iterative procedures are advised. Among these methods, the gradient descent method is helpful for both problems with too many variables and problems with a very large set of data (Konishi, 2014; Izenman, 2008).

    4: Gradient descent method

    This technique is also known as steepest descent, is a well-known method for the optimization of differentiable functions. The strategy is based on taking steps proportional to the negative gradient of the function at each point to get closer to the local minimum of the function. For convex function, the global minimum of the function could be determined albeit by a proper choice of the step size (Harrington, 2012). For a multivariable function F(x), the gradient descent method is put to action as follows

    si16_e    (14)

    Where λ is a small positive number which must be small enough to prevent missing the local minimum but not too small to literally get the process stuck at the neighborhood of our initial guess. Note that λ can be updated at each step and under certain circumstances, values of this parameter could be chosen to guarantee the convergence. As an example, consider the contour plot shown in Fig. 3.

    Fig. 3

    Fig. 3 Contours of function y  =  x 1 ² /4 +  x 2 ²  −  x 2 ³ /8 + 1: reaching local minimum by GD.

    This plot represents the function y = x1²/4 + x2² − x2³/8 + 1 and if one starts searching for the minimum from (x1, x2) = (4, 4), the direction for the steepest descent is opposite to the gradient of the function at this point, i.e., − ∇ y(4, 4) = − 2i − 2j. Hence, the new point is given by

    si17_e    (15)

    Usually, at the first step, values of λ are set less than unity and then larger values are examined. Practically, enlarging λ is permitted as long as it provides smaller values of the objective function. For the present problem, the new values of variables in terms of λ could be replaced in function "y" and with straight-forward single variable optimization one arrives at λopt = 2 and in this way the local minimum is determined by just one shot at (x1, x2) = (0, 0) and ymin = 1. Of course, real-world problems are not that easy and several steps with proper step size are required. In each step, a single variable optimization might be performed to infer the best value of the step size. Of course, direct searching based on small step size, its enlargement (usually by 10 times), and comparison of the resulting function values is better suited for sophisticated functions.

    As long as our regression problem is concerned, one has to put it in the form of a minimization problem to apply the gradient descent method. The objective function is simply the sum of the squares of error

    si18_e    (16)

    Whose gradient could be simply determined from previous arguments on its derivatives concerning model parameters; that is

    si19_e    (17)

    With an initial guess and a small enough value of step size, one can initiate the algorithm to obtain the right values of the model’s parameters:

    si20_e    (18)

    There are three approaches to apply the gradient descent method for the training set of the data depending on how the huge data set is handled by this method. In this respect, if the whole training set is used at every step of the calculations, the method is addressed as batch gradient descent. For a very large data set, the batch gradient descent might not be economic. Hence, two other variants are proposed by practitioners: Stochastic gradient descent and mini-batch gradient descent. In stochastic GD at each step, only a small random set of the training data-set is used to calculate the gradient which makes it much faster than the original Batch GD; however, there are some issues due to the stochastic nature of the method which results in a nonmonotonic convergence to the local minimum and hence it needs stopping criteria to prevent bouncing around the local minimum. On the other hand, the mini-batch variant of GD splits the training data-set into small sets and computes the gradient on these sets which allows taking the benefit of parallel computation while it resolves the problem of oscillatory convergence observed in stochastic GD (Shalev-Shwartz and Ben-David, 2014; Brownlee, 2016).

    We are going to apply gradient descent to the nutrient removal efficiency project. We do these using parameters of TOC and the removal efficiency of NH4-N. As can be seen from the Fig. 4, the sum of the squares of error decreases with increasing epoch to a certain level until the error reaches a minimum value.

    Fig. 4

    Fig. 4 SSE in terms of epoch for the nutrient removal efficiency project.

    5: Polynomial regression

    When the data do not depict a linear behavior, still linear regression could be applied by introducing new variables in terms of the powers of the original variable as discussed in the previous section. Consider Fig. 3 which displays synthetic data built up by y = 5 + 3x − 0.5x² + noise. To apply linear regression to this problem, consider the following model

    si21_e    (19)

    Where x1 = x itself and x2 = x² so that the values of the new variable are known everywhere. Therefore, the matrix of variables X is built up as follows

    si22_e

       (20)

    And finally, the model parameters are determined as

    si23_e

       (21)

    The model prediction is also displayed in Fig. 5.

    Fig. 5

    Fig. 5 Noisy data and the second-degree least-square polynomial.

    When there are indeed multiple variables (or features), a true polynomial regression is necessary to capture the relationship between these features. Mathematically, this relationship is depicted in nonlinear terms which contain a combination of the variables. For example, the second-order terms are either made by squaring a single variable or multiplying one feature by another. In this respect, the number of terms of a polynomial of degree m for a problem with n features will be determined as

    si24_e    (22)

    which includes all possible combinations of variables to construct a multivariate mth degree polynomial. Now, the question is what degree is the best for a problem. A high-degree polynomial can get closer to more data points but it can easily lose track by following the inherent noise of the data. In this regard, its predictions might not be relied on. When this is the case, we say that the model has a high variance. Fig. 6 compares a polynomial of 20th degree with the second-degree polynomial.

    Fig. 6

    Fig. 6 The variance of a too high-degree polynomial which makes it an improper choice.

    Choosing the proper degree of the regression polynomial is a statistical task that considers the trade-off between the high variance (unacceptable sensitivity of the model with high-degree polynomials) and the bias (underfitting the data with low degree polynomials) (Shalev-Shwartz and Ben-David, 2014; Raschka, 2015; Ramasubramanian and Singh, 2018). The issue is discussed in the following section.

    6: Overfitting and underfitting

    Compared to plain linear regression, a high-degree polynomial regression provides a better opportunity for fitting the training data. As shown in Fig. 7, when a 40-degree polynomial model is applied to the training data of Fig. 7, the data points are approximated to a great extent but obviously, its trend is not acceptable at both ends that’s why it’s considered as an overfitting regression polynomial. The linear model neither follows the general trend nor touches the data points satisfactorily; hence, it underfits the data. However, the quadratic regression satisfactorily follows the general trend and presents a reasonable approximation as well (Swamynathan, 2019; Burger, 2018).

    Fig. 7

    Fig. 7 The high-degree polynomial regression.

    This is expected since the initial dataset was created through the introduction of some errors in a quadratic function. Nonetheless, in many practical cases, there is no means to identify the original function behind the dataset. Therefore, there is a need to determine the level of complexity of a model and to determine whether the model is underfitting or overfitting the data (Ganesh, 2017).

    7: Cross-validation

    One of the most common ways to obtain an estimation of the performance of the model in terms of generalization involves the utilization of cross-validation. It is said that the model is overfitting if it has good performance on the training data, but provides poor generalization, which is determined by evaluating the cross-validation measures. However, the model is said to be underfitting if it provides poor performance on both the training data and on the measures of cross-validation. Hence, this is a satisfactory method for determining whether the model is too complex or too simple.

    8: Comparison between linear and polynomial regressions

    In this section, we intend to examine polynomial regression for the nutrient removal efficiency project. By drawing the MLSS on the horizontal axis in terms of removal efficiency of TN data on the y-axis, a nonlinear downward trend is obtained. The figure related to this example is given in Fig. 8.

    Fig. 8

    Fig. 8 The values of MLSS in terms of removal efficiency of TN correspond to the nutrient removal efficiency project.

    After applying linear regression to the data in this example, it becomes clear that this regression cannot be placed correctly on this data (see Fig. 9).

    Fig. 9

    Fig. 9 Using linear regression to predict the values of MLSS and removal efficiency of TN.

    After applying polynomial regression in the quadratic mode for the data of this example, it becomes clear that this regression can fit these data better than the linear mode. This issue was also examined numerically and the value of R² related to this regression was obtained equal to 0.14, while the value of R² related to linear regression was equal to 0.10 (see Fig. 10).

    Fig. 10

    Fig. 10 Using polynomial regression (quadratic mode) to predict the values of MLSS and removal efficiency of TN.

    Now we change the polynomial features to the degree of 10 and run the lines again and this time we can see from the following figure that the direction of the graph is changed. The value of R² is 0.04 and we may seem to have a case of overfitting (see Fig. 11).

    Fig. 11

    Fig. 11 Using polynomial regression (degree of 10) to predict the values of MLSS and removal efficiency of TN.

    9: Learning curve

    One of the other available methods is to evaluate the learning curves. Learning curves show the performance of the model on both training and validation sets as a function of the size of the training set or the training iteration. In order to plot these curves, the model is trained several times using various subsets of the training set where each subset is of a different size (Jaber, 2016).

    It should be noted that in general, a straight line cannot provide good performance for modeling the data. This is confirmed by the fact that the error level reaches a relatively constant level, which is very close to the other curve. It is worth mentioning that when a model is underfitting, these learning curves are typically observed, i.e., curves that reach constant error levels, which are close and relatively high. It should be noted that a common method for improving an overfitting model is to provide it with more training instances until the validation error reaches the training error.

    The learning curve for the nutrient removal efficiency project is given in the below diagram. By plotting train and also test score versus the number of samples, it is clear that these two graphs become closer to each other with increasing the number of training examples, and from 5000 onwards, the slope and the degree to which the two graphs approach each other become less intense (see Fig. 12).

    Fig. 12

    Fig. 12 The learning curve for the nutrient removal efficiency project.

    10: Regularized linear models

    One of the possible ways for decreasing the overfitting phenomenon involves the regularization of the model, which is another way to say limiting or restricting it. By reducing the degrees of freedom of the model, it will be harder for the model to overfit the data. One of the east ways to regularize a polynomial model involves decreasing the polynomial degree. In the case of a linear model, the regularization is generally performed by restricting the weights of the model. In order to better illustrate the model, the Ridge regression and the Lasso regression models can be evaluated since these models utilize two distinct methods for restricting the weights (Gori, 2017).

    11: The ridge regression

    The ridge regression is in fact a regularized or restricted version of the linear regression. In order to regularize the linear regression model, a regularization term, i.e., si25_e is introduced into the cost function of the model. Adding this term will make the learning algorithm fit the data, while also minimizing the weights of the model. It is worth mentioning that this regularization term must only be introduced into the cost function during the training stage. After training the model, the unregularized performance measure can be used to assess the performance of the model (Saleh et al., 2019; Aldrich and Auret, 2013).

    The extent of the regularization of the model can be controlled using the hyper-parameter α. When α = 0, the Ridge regression will be the same as the linear regression. However, if α is very large, all the weights will be close to zero, resulting in a flat line going through the mean values of the data. The cost function of the Ridge regression model is presented in Eq. (23).

    si26_e    (23)

    It should be noted that the bias term, denoted by θ0 is not regularized and the sum starts at i = 1 and not i = 0. If w denotes the vector of feature weights (θ1 to θn), the regularization term will become si27_e , in which ‖w‖2 signifies the 2 norm of the weight vector. Moreover, for the gradient descent, αw is simply added to the MSE gradient vector (Alpaydin, 2020).

    Similar to the linear regression model, the Ridge regression can be performed by calculating a closed-form equation or by applying the gradient descent. The advantages and disadvantages are the same. The closed-form solution is presented in Eq. (24). It should be noted that in this equation, A is the (n + 1) × (n + 1) identity matrix, with one difference, i.e., the presence of a 0 in the top-left cell, which corresponds to the bias term.

    si28_e    (24)

    12: The effect of collinearity in the coefficients of an estimator

    As mentioned earlier, α ≥ 0 is a complexity parameter that controls the amount of shrinkage: the larger the value of α, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity. Each color represents a different feature of the coefficient vector, and this is displayed as a function of the regularization parameter. This example also shows the usefulness of applying Ridge regression to highly ill-conditioned matrices. For such matrices, a slight change in the target variable can cause huge variances in the calculated weights. In such cases, it is useful to set a certain regularization (alpha) to reduce this variation (noise) (see Fig. 13).

    Fig. 13

    Fig. 13 Ridge coefficients as a function of the regularization.

    13: Outliers impact

    Before discussing ridge regression, let’s look at an example of the effect of outliers on the slope of the regression line, and then we will show how ridge regression can reduce these effects. For a data set containing 100 randomly generated points with a slope of 0.5, after performing linear regression, the slope of the fitted line is equal to 0.47134857. The diagram for this example is given below (see Fig. 14):

    Fig. 14

    Fig. 14 Data set containing 100 randomly generated points and performing linear regression.

    Now to show the effect of outliers on the previous example, we change two points in the data set: we replace the first point of the chart with − 200 and the last point of the chart with + 200. After performing linear regression, we see that the slope of the obtained line is equal to 1.50556072, which is significantly different from the slope of the previous chart and shows the effect of outliers. The diagram for this example is given below (see Fig. 15).

    Fig. 15

    Fig. 15 Showing the outlier’s effect.

    After applying the ridge regression, we can see that this regression is substantially better than linear regression and it recovers the original coefficient with a fairly good approximation. The slope of the line obtained in this regression is equal to 1.00370714. The diagram for this example is given below (see Fig. 16).

    Fig. 16

    Fig. 16 Using ridge regression to offset the impact of outliers.

    14: Lasso regression

    Another regularized version of the linear regression is the Lasso regression, which comes from Least Absolute Shrinkage and Selection Operator Regression. Similar to the Ridge regression, this regularized version introduces a regularization term into the cost function, except that instead of the half of the square of the 2 norm, it utilizes the 1 norm of the weight vector. This is expressed in Eq. (25) (Sra et al., 2012; Bali et al., 2016).

    si29_e    (25)

    One of the differences between this type of regression and the Ridge regression involves the fact that as the parameters get closer to the global optimum, the gradients get smaller, and the gradient descent becomes slower, increasing the likelihood of convergence because of the lack of bouncing around. Another difference is that by increasing α, the optimal parameters gradually get closer to the origin, but they will never reach zero.

    It should be noted that at θi = 0 for i = 1, 2, …, n, the cost function of the Lasso regression is not differentiable; however, if in cases where θi = 0, the subgradient vector g is utilized, the gradient descent will perform well enough. A subgradient vector equation that can be utilized for gradient descent with the cost function of the Lasso regression is presented by Eq. (26).

    si30_e

       (26)

    After applying Lasso regression to the example data related to the ridge regression, it was shown that this regression can fit the data of this example with a much better approximation than linear one and be less affected by outliers. The slope of the line related to this regression is 1.06289489. The diagram for this example is given below (see Fig. 17):

    Fig. 17

    Fig. 17 Using Lasso regression to offset the impact of outliers.

    15: Elastic net

    The elastic net is the middle point between the Ridge and the Lasso regression models. In this model, the regularization term is a combination of the regularization terms from Ridge and Lasso regression models, which is called the mix ratio r. It should be noted that the elastic net will be equal to the Ridge regression when r is set to 0, while it equals the Lasso regression when r is set of 1. This is expressed in Eq. (27) (Humphries et al., 2018; Forsyth, 2019).

    si31_e

       (27)

    The question is when to use the linear regression without regularization, the Ridge regression, the Lasso regression, or the Elastic Net. In order to make this decision, it should be noted that a level of regularization is always preferred. Therefore, using the plain linear regression model must be avoided as much as possible. The Ridge regression model can be a good default option. However, if there are only a limited number of useful features, it is better to utilize the Lasso regression or the elastic net, since they usually set the weights of the useless features to zero, as noted earlier. Nonetheless, when the number of features is larger than the number of training instances, or when there is a strong correlation between several features, it is better to utilize the elastic net instead of the Lasso regression since the Lasso can have erratic behaviors in such cases.

    After applying elastic regression to the example data for ridge and lasso regressions, it was shown that this regression has a much better approximation than two previous regressions and is less affected by outliers. The slope of the fitted line for this regression was 0.74724704. The diagram for this example is given below (see Fig. 18):

    Fig. 18

    Fig. 18 Using elastic net regression to offset the impact of outliers.

    16: Early stopping

    Stopping the training once the validation error is minimized is another distinct way to regularize iterative learning algorithms, including the gradient descent. This method is known as early stopping. When applying the early stopping method, as soon as the validation error is minimized, the training is halted. This is a simple, elegant, and efficient method for the regularization of iterative learning algorithms (Shukla, 2018).

    It should be noted that when using stochastic and mini-batch gradient descent, it is difficult to determine if the error is minimized or not since the curves are not this smooth. A possible solution is to stop the training after the validation error has stayed above the minimum for a while and the possibility of a better performance by the model is not very high. Afterward, the parameters of the model can be set at the values they were when the validation error was at the minimum point.

    17: Logistic regression

    Some algorithms using regression could be applied for classification problems. The possibility of an example belonging to a particular class can be examined by logistical regression. For example, how likely is it that an email will be spam? Logit regression is another name for this regression. Whether or not a sample belongs to a class in this model depends on the probability of more or less than 50%, respectively. In this regression, the probability of more than 50% is called the positive class represented by 1 and the probability of less than 50% is called negative class which is indicated by 0. Such division is called a binary classification (Mohammed et al., 2016; Lesmeister, 2015).

    18: Estimation of probabilities

    The question that may be encountered here is how the logistic regression works. A logistic regression model is similar to a linear regression model in the sense that it computes the weighted sum of the inputs along with a bias term. However, its main difference with the linear regression model is that it does not provide a direct result; rather, its output is the logistic of the result. This is expressed in Eq. (28).

    si32_e    (28)

    It should be noted that the logistic, which is denoted by σ(.), is a sigmoid or S-shaped function, whose output ranges from 0 to 1. This function is expressed by Eq. (29) and it is depicted in Fig. 19.

    si33_e    (29)

    Fig. 19

    Fig. 19 The logistic function.

    After the logistic regression model estimates the probability, i.e., si34_e , of an instance x belonging to the positive class, the model can calculate the prediction si5_e in a straightforward fashion. This prediction is expressed in Eq. (30) (Lantz, 2019).

    si36_e    (30)

    It should be noted that when t < 0, we will have σ(t) < 0.5, while when t ≥ 0, we will have σ(t) ≥ 0.5. Accordingly, if xTθ is positive, the logistic regression model’s output as the prediction will be equal to 1; otherwise, the output will be equal to

    Enjoying the preview?
    Page 1 of 1