Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Machine Learning in Earth, Environmental and Planetary Sciences: Theoretical and Practical Applications
Machine Learning in Earth, Environmental and Planetary Sciences: Theoretical and Practical Applications
Machine Learning in Earth, Environmental and Planetary Sciences: Theoretical and Practical Applications
Ebook804 pages6 hours

Machine Learning in Earth, Environmental and Planetary Sciences: Theoretical and Practical Applications

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Machine Learning in Earth, Environmental and Planetary Sciences: Theoretical and Practical Applications is a practical guide on implementing different variety of extreme learning machine algorithms to Earth and environmental data. The book provides guided examples using real-world data for numerous novel and mathematically detailed machine learning techniques that can be applied in Earth, environmental, and planetary sciences, including detailed MATLAB coding coupled with line-by-line descriptions of the advantages and limitations of each method. The book also presents common postprocessing techniques required for correct data interpretation.

This book provides students, academics, and researchers with detailed understanding of how machine learning algorithms can be applied to solve real case problems, how to prepare data, and how to interpret the results.

  • Describes how to develop different schemes of machine learning techniques and apply to Earth, environmental and planetary data
  • Provides detailed, guided line-by-line examples using real-world data, including the appropriate MATLAB codes
  • Includes numerous figures, illustrations and tables to help readers better understand the concepts covered
LanguageEnglish
Release dateJul 3, 2023
ISBN9780443152856
Machine Learning in Earth, Environmental and Planetary Sciences: Theoretical and Practical Applications
Author

Hossein Bonakdari

Dr. Bonakdari obtained his PhD in Civil Engineering from the University of Caen Normandy (France). He has worked for several organizations and most recently as an Associate Professor at the Department of Civil Engineering of the University of Ottawa (Canada). He is one of the most influential scientists in the field of developing novel algorithms for solving practical problems through the decision-making abilities of AI. His research also focuses on creating comprehensive methodologies in the areas of simulation modeling, optimization, and machine learning algorithms. The results obtained from his research have been published in international journals and presented at international conferences. He was included in the list of the world's top 2% scientists, published by Stanford University, and is on the Editorial board of several journals.

Related to Machine Learning in Earth, Environmental and Planetary Sciences

Related ebooks

Enterprise Applications For You

View More

Related articles

Reviews for Machine Learning in Earth, Environmental and Planetary Sciences

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Machine Learning in Earth, Environmental and Planetary Sciences - Hossein Bonakdari

    Chapter 1

    Dataset preparation

    Abstract

    The machine learning (ML) approach, a powerful tool for solving complex nonlinear problems, has attracted the attention of many scholars in various fields of applied sciences, including social science, chemical engineering, physics and astronomy, agriculture and biological science, mathematics, earth and planetary sciences, environmental science, computer science, etc. The primary objective of the ML approach is to generate an intelligent model capable of producing solutions to complex problems, problems which humans would be unable to solve without the help of an expert system. In this chapter, the prerequisite steps required to model using ML techniques are presented in detail. This chapter begins with an overview of the modeling process for ML applications, a process which may be implemented regardless of the specific ML technique considered. The reader is introduced to five different real-world problems which contain two to six input variables and 100 to more than 1000 sample points.

    Keywords

    Data; machine learning; artificial intelligence; MATLAB; Barplot

    1.1 The modeling process

    Prior to implementing any machine learning (ML) technique, the modeler must first have an understanding of the general approach to be followed when faced with any modeling problem. Often, in the analysis and resolution of any of these problems, a consistent modeling paradigm or methodology may be followed. Fig. 1.1 presents a schematic detailing all of the steps involved in the resolution of ML-based modeling problems. From this figure, the ML application process is composed of three main components: (1) data collection; (2) preprocessing; (3) modeling by ML techniques, and (4) postprocessing. During data collection, a set of data relating independent variables to one or more target variables is identified.

    Figure 1.1 The modeling process/paradigm in machine learning.

    In Fig. 1.1, for example, three independent input variables are identified as raw data. Data collection/ data generation can be done in many ways including through personal experimentation (laboratory experimental results), from open data sources (such as USGS or Statistics Canada), or from review documentation and/or existing published data sets (Fig. 1.2).

    Figure 1.2 Different types of data collections.

    The second step is preprocessing. This is considered one of the most critical steps when modeling with ML techniques, as data transformation and cleaning can result in more meaningful modeling results and better model performance. In fact, without preprocessing, it may prove difficult in some circumstances to fit an adequate and generalizable model to the data set (Niu et al., 2020; Obaid et al., 2019). Consider the following example. Suppose the range of different input variables within a dataset is not the same and spans several orders of magnitude. In this case, the model will place more effort into optimizing adjustable parameters for the variables that have higher magnitudes compared to those variables which are small in magnitude. The changes obtained during the modeling process for the optimized values of the variables with smaller values will be negligible, and the model may even choose to ignore the changes in the value of these parameters. One of the most well-known approaches to overcome this limitation of ML models is by applying preprocessing in the form of normalization (Ebtehaj & Bonakdari, 2016a; Ivanyuk & Soloviev, 2019; Qasem et al., 2017) or standardization (Ebtehaj et al., 2019; Gómez-Escalonilla et al., 2022; Zeynoddin et al., 2019) of the raw data. From this simple example, it can be readily appreciated that familiarity with preprocessing methods is fundamental to modeling with ML methods. In addition to improving model accuracy, preprocessing may make the input data simpler to understand for the modeler and easier to compare (Bonakdari et al., 2019; Moeeni et al., 2017). Besides data scaling in the form of normalization (Zeynoddin, Bonakdari et al., 2020) or standardization (Zeynoddin, Ebtehaj et al., 2020; Zhang et al., 2018), data splitting (Ebtehaj et al., 2020) and cross-validation (Bonakdari & Zeynoddin, 2022; Ebtehaj & Bonakdari, 2016b; Ferdinandy et al., 2020) are also used during the preprocessing phase to split data into training and testing samples (Fig. 1.3).

    Figure 1.3 Summary of fundamental preprocessing techniques.

    The third step in the paradigm is modeling using ML approaches. In this step, the best model should be identified through the use of optimization techniques coupled with preprocessing of the input data. To find the optimum models, a set of quantitative postprocessing tools such as statistical indices and qualitative tools such as scatter plots (Kim et al., 2019), box plots (Jato-Espino et al., 2019), Taylor diagrams (Hu et al., 2021), uncertainty analysis (Herrera et al., 2022; Sharafati et al., 2020); reliability analysis (Hariri-Ardebili & Pourkamali-Anaraki, 2018a,b), etc., must be considered. After selecting the best model through the use of these tools, the final model can be applied for practical tasks to new data sets. A summary of the fundamental postprocessing techniques is presented in Fig. 1.4.

    Figure 1.4 Summary of fundamental postprocessing techniques.

    1.2 Data description

    Throughout this text, five different sample data sets (all collected from real-world projects) are considered to demonstrate the development, application, and performance of different ML models. A description of each dataset, which will be termed Examples 1–5 in Appendix 1A, is provided in the following subsections. The number of input variables and the number of all samples are provided in Fig. 1.5.

    Figure 1.5 Characteristics of example data. NIV, Number of input variables; NS, number of all samples.

    1.3 Different types of problems

    1.3.1 Example 1: a problem with six input variables

    The dataset considered in Example 1 is a composite set that was formed by the aggregation of data from two different studies (Bagheri et al., 2014; Cheong, 1991). Therefore the collected data for Example 1 is a combination of two published data sets. In order for a modeler to aggregate any number of datasets, the laboratory conditions of both datasets must be practically identical such that the data was collected in the same way. Given that it is time-consuming to perform experiments with different ranges of variables affecting the investigation, it is often the case that scholar(s) cannot examine all conditions in a single study. Therefore, by juxtaposing several studies in which the conditions of the experiments are consistent with each other, the limitations of each study can be overcome. In addition, the use of ML methods requires a wide range of independent input variables to train the desired model(s) and the typically large number of samples. High numbers of input variables are required to provide the model with the experience required to estimate the target variable with acceptable accuracy for unseen samples (i.e., testing samples). For example, if one dataset only covers a range of input values from 0 to 20, while another set ranges from 15 to 70, it may prove valuable to develop a model that spans the greater range of values defined by the juxtaposed set (i.e., from 0 to 70) so that it has a greater range of application in solving real-world problems. Considering more than one data set is a well-known approach in real-world practical applications of ML by scholars (Azimi et al., 2016; Ebtehaj & Bonakdari, 2014; Ebtehaj et al., 2015, 2016, 2017; Gholami et al., 2017).

    In Example 1, the total number of samples is 161, with 113 samples being randomly selected to train the model (i.e., Training samples) while the rest of the samples (i.e., 48 samples) are used as a validation to check the performance of the developed ML-based model when faced with unseen samples (i.e., Testing samples). Modeling is performed considering 70% of all samples, while 30% of samples are reserved to verify the generalizability of the developed model. It should be noted that the modeling process for ML-based models should be controlled in such a way that the developed model performs well in both the testing phase and the training phase such that it is generalizable to application in a range of future tasks.

    Different splitting ratios may be considered for assigning training and testing data, where the maximum amount of test data is about 50% (i.e., 50% for the training stage and 50% for the testing stage) and the minimum is 10% (i.e., 90% for the training stage and 10% for the testing stage) (Ebtehaj et al., 2020). However, considering 30% of the total data as testing samples is a well-known distribution for the testing stage. Therefore this ratio is considered throughout this text. However, the reader should be aware that in some instances, the nature of the data set necessitates that different splitting ratios be studied to obtain the optimal distribution of training and testing samples. The optimal percentage is defined in such a way that the number of training and testing samples is not too small, but also the performance of the model in both training and testing modes must be very close to each other.

    1.3.1.1 Statistical description of Example 1 data using barplot analysis

    The minimum (Min.), average (Avg.), maximum (Max.), and standard deviation (SD) values for all independent inputs (In1, In2, In3, In4, In5, In6), as well as the dependent output (Out) for the training, testing, and total data, are provided in Fig. 1.6A–G for Example 1 data.

    Figure 1.6 Statistical indices of Example 1 training, testing, and total data. (A) Input 1, (B) input 2, (C) input 3, (D) input 4, (E) input 5, (F) input 6, and (G) output.

    From this figure, it can be seen that the ranges of input values are significantly different from each other. This can also be said of the input variables compared to the output variable. For example, the maximum value of In3 is 10, while this same value is 4 (i.e., In3) or less than 4 (In1, In4, In5, In6) for all other inputs. If the range of values for the different inputs and output variables differs greatly from one another, the modeler may need to apply normalization during the preprocessing stage. Another consideration for the modeler is the similitude in the range of the data used for training and testing subsets. It is desirable to split the data such that the training subset will provide all the necessary experience to the developed model through exposure to the full range of input variables. If the data distribution spans different ranges in the training and testing subsets, then the model may yield poor results.

    1.3.1.2 The barplot coding using MATLAB®

    In Fig. 1.6, several statistical indices (minimum, maximum, standard deviation) were computed for each independent input variable as well as the output variable. In the following subsection, the detailed steps required to code and generate the barplots, including the aggregation of training and test data, the calculation of indices, and the plotting of figures, are presented.

    The coding syntax for the generation of a barplot can be divided into several general categories: (1) load data; (2) Merge all samples; (3) Calculation of statistical indices; (4) Prepare data for plotting; (5) Plot results. First, the data must be read or loaded into the MATLAB environment. To do so, the data is first saved within a Microsoft Excel file that contains sheets (i.e., sheet1, sheet2, sheet3, sheet4), where sheets 1–4 contain the training input, training output, testing input, and testing output, respectively. From Fig. 1.7, For Example 1 data, the number of input variables is six, while the number of output variables is one. This was previously shown in the dataset description in Fig. 1.6. In addition, 70% of the data was considered as training data, while the other 30% was reserved as the testing subset. This results in 113 and 48 training and testing data samples, respectively.

    Figure 1.7 Data preparation in Microsoft Excel file. (A) Training inputs, (B) training targets, (C) testing inputs, and (D) testing target.

    Code 1.1 presents the required syntax for the load data and Merge all samples steps. Before providing the details of the code, its function is conceptualized in Fig. 1.8. According to this figure, using the xlsread command in the MATLAB environment, four different variables are loaded from the previously developed Excel spreadsheet (i.e., TrainInputs; Testargets; TestInputs; TrainTargets). In the next step, the train and test inputs, as well as their corresponding targets, are merged as Inputs and Targets.

    Figure 1.8 The conceptual coding process of Code 1.1.

    Below, the coding details related to data loading (i.e., Code 1.1), calculating indicators (i.e., Code 1.2), as well as plotting figures (i.e., Code 1.3) are explained. In lines 1–3 of Code 1.1 presented next, some general MATLAB functions are used to prepare the MATLAB environment prior to the execution of any programming. These commands are used in almost all MATLAB syntaxes and are the real-life equivalent of wiping a whiteboard clean—a clean slate. In line 1, clc is used to clear the command window, which erases the text that was previously displayed. Once this command has been executed, the function history cannot be seen using the scroll bar, but the command history statements could be called using the up-arrow key ↑. In line 2, the second command is the clear command. This clears variables and functions from the memory of the program. Other times, the clear all command may be used, which is used to clear functions, variables, and other stored items from memory. Examples include cached memory, breakpoints, and persistent variables. It is often unnecessary to employ the clear all function, and the clear command is sufficient. The third command shown in line 3 is close all and is applied to remove all the figures whose handles are not hidden.

    Code 1.1

    In lines 6–9 of Code 1.1, the xlsread command is used to load the data from the Excel spreadsheet developed previously (i.e., Fig. 1.7). The general format of this built-in MATLAB function is xlsread (filename, sheet), where filename is the name of the saved file and sheet is the name of the reference sheet where the data is contained.

    In the case of the Example 1 data set, the data is saved under the Excel filename Example1, which is used as the filename argument, while the sheet argument is specified as either sheet1, sheet2, sheet3, or sheet4, for the training input, training target, testing input, and testing target, respectively. Because the training and testing data are independently read into the MATLAB environment, it is necessary to define a variable capable of storing the data features for the entire set. To achieve this, the training inputs and testing inputs are merged and stored into the Inputs variable in line 12, while the training and testing output are merged and stored under the Targets variable in line 13.

    Once the training, testing, and total data have been read into the MATLAB environment, the statistical indices, including the minimum, the average, the maximum, and the standard deviation, may be calculated (Code 1.2). Code 1.2 includes four different sections that are independently discussed as Code 1.2.A (finding the minimum), Code 1.2.B (finding the mean), Code 1.2.C (finding the maximum), and Code 1.2.D (finding the standard deviation). A simple graphical definition of Code 1.2 is provided in Fig. 1.9. The statistical indices are computed for each of the training input, training output, testing input, and testing output subsets, as well as the total input and total output. This results in 24 different parameters computed by the MATLAB code. The statistical indices are computed using the built-in MATLAB functions min(x), mean(x), max(x), and std(x) where x contains the data set of interest. For example, x is TrainInputs for calculating the minimum, maximum, mean, and standard deviations of the training inputs.

    Figure 1.9 A graphical definition of Code 1.2.

    Code 1.2.A

    Code 1.2.B

    Code 1.2.C

    Code 1.2.D

    During the fourth step, the data is prepared to be plotted by using Code 1.3, which is schematically represented in Fig. 1.10. For each input and output variable, the minimum, average, maximum, and standard deviation values computed for the testing and training subsets are merged into one parameter in Code 1.3.H. Considering line 3 of Code 1.3.A for example, the minimum values for the training and testing subsets (e.g., Min_TrIn(1) and Min_TsIn(1)) for In(1), as well as the total data set (Min_allIn(1)) are merged into the variable Min1. Similarly, the mean, maximum, and standard deviation of In1 (i.e., input one) are also merged and saved into the variables Avg1, Max1, and SD1 in lines 4 through 6, respectively. This process is repeated for all input variables (i.e., In2, In3, In4, In5, In6) in Codes 1.3.B to Code 1.3.G and for the output variable (i.e., Out) in Code 1.3.F. Following this, the information for each input and output variable is stored in an array format in Code 1.3.H. These final variables (i.e., In1, In2, In3, In4, In5, In6, and Out) are employed in the next step to plot all of the input and output characteristics. Indeed, each of the newly generated variables (i.e., In1, In2, In3, In4, In5, In6, and Out) is a matrix that contains four rows and three columns. As seen in Fig. 1.11, each row is associated with a given data statistic (i.e., minimum, maximum, mean, standard deviation), while the columns present the results for each subset of data (i.e., training, testing, total).

    Figure 1.10 Schematic of Code 1.3.

    Figure 1.11 The size and type of the stored parameters in variable

    Enjoying the preview?
    Page 1 of 1