Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Data Science for Genomics
Data Science for Genomics
Data Science for Genomics
Ebook793 pages6 hours

Data Science for Genomics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Data Science for Genomics presents the foundational concepts of data science as they pertain to genomics, encompassing the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions and supporting decision-making. Sections cover Data Science, Machine Learning, Deep Learning, data analysis, and visualization techniques. The authors then present the fundamentals of Genomics, Genetics, Transcriptomes and Proteomes as basic concepts of molecular biology, along with DNA and key features of the human genome, as well as the genomes of eukaryotes and prokaryotes.

Techniques that are more specifically used for studying genomes are then described in the order in which they are used in a genome project, including methods for constructing genetic and physical maps. DNA sequencing methodology and the strategies used to assemble a contiguous genome sequence and methods for identifying genes in a genome sequence and determining the functions of those genes in the cell. Readers will learn how the information contained in the genome is released and made available to the cell, as well as methods centered on cloning and PCR.

  • Provides a detailed explanation of data science concepts, methods and algorithms, all reinforced by practical examples that are applied to genomics
  • Presents a roadmap of future trends suitable for innovative Data Science research and practice
  • Includes topics such as Blockchain technology for securing data at end user/server side
  • Presents real world case studies, open issues and challenges faced in Genomics, including future research directions and a separate chapter for Ethical Concerns
LanguageEnglish
Release dateNov 27, 2022
ISBN9780323985765
Data Science for Genomics

Related to Data Science for Genomics

Related ebooks

Science & Mathematics For You

View More

Related articles

Related categories

Reviews for Data Science for Genomics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science for Genomics - Amit Kumar Tyagi

    Chapter 1: Genomics and neural networks in electrical load forecasting with computational intelligence

    Prasannavenkatesan Theerthagiri     Department of Computer Science and Engineering, GITAM School of Technology, GITAM Deemed to be University, Bengaluru, India

    Abstract

    Background: Electrical load forecasting plays an important task in electric utilities for planning the generation, transmission, and distribution systems. The other applications of load forecasting are to maintain supply and demand of electricity, to determine required resources to operate the power plant, spinning reverse planning, generating unit scheduling, and other applications in power systems. Methodology: In this paper, we have used well-defined genomics learning methods called recurrent neural network and long short-term memory for predicting future electrical load. Paschim Gujarat Vij Company Limited (PGVCL, India) and New York Independent System Operator (NYISO, USA) historical load data are used to check the effectiveness of applied methods. Results/Conclusion: The accuracy of applied methods was shown as higher than other algorithms, and error metrics, namely, mean absolute percentage error and root mean squared error, were lower than others.

    Keywords

    Genomics; Load forecasting; Long short-term memory (LSTM); Machine learning; Recurrent neural network (RNN)

    1. Introduction

    Load forecasting is defined as a procedure used for predicting the future electricity demand using historical data to be able to manage electric generation and electric demand of electric utilities. In the present scenario the load forecasting is an essential task in a smart grid. The smart grid is an electrical grid that uses computers, digital technologies, or other advanced technologies for real-time monitoring, maintaining generation and demand, and to act on particular information (information such as behavior of electric utilities or consumers) for improving efficiency, reliability, sustainability, and economics [1]. To fulfill the applications of a smart grid the load forecasting plays an important role. A smart grid has various modes of forecasting in electric grids, which are load forecasting, price forecasting, solar-based electricity generation forecasting, and wind-based electricity generation forecasting. The load forecasting is classified into four categories [2–4]: (i) very short-term load forecasting, (ii) short-term load forecasting, (iii) mid-term load forecasting, and (iv) long-term load forecasting. The strong focus done in this paper is on short-term load forecasting. As the demand of electricity is increasing the very short-term load forecasting and short-term load forecasting are helpful to provide additional security, reliability, and protection to smart grids. Also, it is useful for energy efficiency, electricity price, market design, demand side management, matching generation and demand, and unit commitment [5]. The machine learning will accurately predict the electrical load to fulfill the needs of smart grids.

    The well-defined long short-term memory (LSTM) and recurrent neural network (RNN) are used in many papers for load forecasting, and these methods are hybridized to improve the predictions. The review on well-defined RNN and LSTM methods used for load forecasting is as follows. In paper [6], the author has applied LSTM RNN for nonresidential energy consumption forecasting. The real-time energy consumption data is from South China, which contains multiple sequences of 48 nonresidential consumers' energy consumption data. The unit of measured data is in kilowatts, and data is collected from Advanced metering infrastructure (AMI) with sampling interval of 15min. To calculate the prediction accuracy, the Mean Absolute Error (MAE), Mean Absolute Percent Error (MAPE), and Root Mean Squared Error (RMSE) method is used. In paper [7], the author has applied a RNN-LSTM neural network for long-term load forecasting. The real time ISO New England load data is used for 5-year load prediction. The MAPE method is used to calculate the accuracy of forecasted results. Year-wise and season-wise MAPE is calculated from which the majority MAPE is below 5% and not exceeding 8%.

    In paper [8], the author mentions multiple sequence LSTM is become an attractive approach for load prediction because of increasing volume variety of smart meters, automation systems, and other sources in smart grids. For energy load forecasting the multisequence LSTM, LSTM-Genetic Algorithm (GA), LSTM-Particle swarm optimization (PSO), random forecast, Support vector machines (SVM), Artificial Neural Network (ANN), and extra tree regressor methods are used, and a comparison is made between them using RMSE and MAE. The load data was obtained from Réseau de Transport d'Électricité (RTE) Corporation, French electricity transmission network. In paper [9], the author has used LSTM for power demand forecasting, and LSTM prediction is compared with Gradient Boosted Trees (GBT) and Support Vector Regression (SVR). The LSTM gives better prediction than GBT and SVR by decreasing MSE by 21.80% and 28.57%, respectively. Timeseries features, weather features, and calendar features are considered for forecasting. University of Massachusetts has provided the power data for forecasting. The evaluation of model accuracy is calculated using MSE and MAPE.

    In paper [10], the electricity consumption prediction is carried out for residential and commercial buildings using a deep recurrent neural network (RNN) model. The Austin, Texas, residential buildings electricity consumption data is used for mid-term to long-term forecasting and for commercial buildings, and the Salt Lake City, Utah, electricity consumption data is used for prediction. For commercial buildings the RNN performs better than a multilayered perceptron model. In paper [11], the author has used LSTM method for power load forecasting. The eunite real power load data has been used for forecasting. The next hour and next half day prediction has been made using a single-point forecasting model of LSTM and multiple-point forecasting model of LSTM. The model accuracy has calculated using MAPE. The single-point forecasting model of LSTM performs better than multiple-point forecasting model of LSTM.

    In paper [12], the author has applied RNN for next 24-h load prediction. The RNN prediction result is compared with Back-Propagation neural network. In paper [13], the author has used deep RNN, DRNN-Gated Recurrent Unit (GRU), DRNN-LSTM, multilayer perceptron (MLP), Autoregressive Integrated Moving Average (ARIMA), SVM, and MLR methods for load demand forecasting. For prediction the author has used residential load data from Austin, Texas, USA. Methods evaluation was calculated based on MAE, RMSE, and MAPE. In paper [14], the author has used RNN, LSTM, and GBT for wind power forecasting. Using the wind velocity data from Kolkata, India, the wind power output forecasting was carried out. The methods accuracy was calculated using MAE, MAPE, MSE, and RMSE.

    In paper [15], the author used LSTM for short-term load forecasting. Here, 24-h, 48-h, 7-day, and 30-day ahead predictions were made and compared with actual load. The LSTM accuracy was tested using RMSE and MAPE. In paper [16], the author made long-term energy consumption prediction using LSTM. The real-time industrial data was used for forecasting. The LSTM result was compared with ARMA, ARFIMA, and BPNN prediction result; out of this the LSTM performed better. MAE, MAPE, MSE, and RMSE was used to evaluate methods accuracy.

    The contribution of this paper is to accurately forecast the load using well-defined machine learning methods. In this paper, two different zones of a real-time load dataset are used for prediction. The first load dataset is of Paschim Gujarat Vij Company Ltd. (PGVCL), India, and the second load dataset is of NYISO, USA. For both datasets the well-defined machine learning methods called RNN and LSTM are applied for load prediction. The accuracy of forecasted load is calculated using root mean squared error and mean absolute percentage error. Further, the machine learning methods results are compared with time series models prediction results that tried to achieve better prediction than time series models. In most cases the machine learning works excellently. The time series models result is taken from Ref. [17] or this paper is extended work of Ref. [17].

    The rest of the paper is prepared as follows. Section 2 includes explanations of the well-defined applied machine learning method, i.e., RNN and LSTM. Section 3 shows the output prediction result of applied machine learning methods for both load datasets. Section 4 will conclude the paper in short.

    2. Methodology

    2.1. RNN

    The concept of RNN is introduced to process the sequence data and to recognize the pattern in sequence. The reason to develop the RNN is the feed forward network fails to predict the next value in sequence or the feed forward network predicts the next value poorly. The feed forward network is mostly not used for sequence prediction because the new output has no relation with previous output. Now let us see how the RNN can solve the feed forward network problem for prediction. Fig. 1.1 illustrates the generalized way to represent the RNN, in which there is a loop where the information is flowing from the previous timestamp to the next timestamp. For a better understanding, Fig. 1.2 shows the unrolling of a generalized form of RNN, i.e., Fig. 1.1 [18].

    From Fig. 1.2, we have input at t-1, which will feed it to the network; then we will get the output at t-1. Then at the next time stamp, i.e., at t we have input at time t that will be given to a network along with the information from the previous timestamp, i.e., t-1, and that will help us to get the output at t. Similarly, for output t+1, we have two inputs: one is a new input at t+1 that we feed to the network, and the other is the information coming from the previous time stamp, i.e., at t to get the output at time t+1. Likewise, it can go on [19]. Fig. 1.3 indicates the mathematical structure of RNN. From Fig. 1.3, two generalized equations can be written as follows:

    (1.1)

    (1.2)

    Where, w i is the input weight matrix, w y is output weight matrix, W R is hidden layer weight matrix, g h  and g y are activation functions, and b h  and b y are the biases. Eqs. (1.1) and (1.2) are useful to calculate the h 0, h 1, h 2, … and y 0, y 1, y 2, … values, as shown in Fig. 1.3, respectively.

    Figure 1.1  Representation of RNN.

    Figure 1.2  Unrolling of RNN.

    Figure 1.3  Mathematical representation of RNN.

    For calculating ℎ0 and y 0, let us consider time t equals zero (i.e., t=0), and at t=0 the input is x 0. Now by substituting t=0 and input x 0 in Eqs. (1.1) and (1.2), we get

    (1.3)

    But in Eq. (1.3) the term W R ∗ ℎ −1 cannot be applied because time can never be negative, so Eq. (1.3) can be rewritten as

    (1.4)

    (1.5)

    From Eqs. (1.4) and (1.5), we can calculate ℎ0 and y 0. Now, let us consider t=1 and the input x 1 at t=1 for calculating ℎ1 and y 1, so by putting values of t=1 and input in Eqs. (1.1) and (1.2), we get

    (1.6)

    (1.7)

    From Eqs. (1.6) and (1.7), we can find ℎ1 and y 1. Similarly, for input x 2 at t=2, we can calculate the value of ℎ2 and y 2. By substituting values into Eqs. (1.1) and (1.2), we get

    (1.8)

    (1.9)

    From Eqs. (1.8) and (1.9), we can calculate ℎ2 and y 2. Likewise, it can go on up to n period of time. So, this is how RNN works mathematically. This method is explained by referring to various sources [11,18,19].

    2.2. Long short-term memory

    The LSTM neural network is a time RNN, and it is a special case of RNN, which was proposed by Hochreiter and Schmidhuber [20,21]. The LSTM can solve the various problems faced by RNN. As sequence length is increases the problems faced by RNN are vanishing gradient, limited storage, limited memory, and short-term memory. In LSTM structure, there are cell state and three different gates, which will effectively solve the RNN problem. The cell state will carry the relevant information throughout the processing of a network, and cell state acts as memory of the network. Because of cell state, the earlier time stamp values can be used in later time stamps, so the LSTM can reduce the effect of short-term memory. The various gates in LSTM are responsible to add or remove the information in cell state, and during training the network the gates can learn what information is necessary to keep or to forget. The gates can regulate the flow of information in the network. Fig. 1.4 illustrates the single LSTM cell or internal layout of LSTM. The LSTM has a similar chain-type layer to RNN, where the only difference is the internal structure and way of calculating a hidden state (ℎ t ). The hidden state is passed from one cell to other in a chain. In internal RNN cells, there is only tanh activation, but from Fig. 1.4 the LSTM has a complex internal cell. From Fig. 1.4 the σ is the sigmoid activation.

    Figure 1.4  LSTM cell.

    For understanding the mathematics behind LSTM and how a hidden state is calculated in it, the forget gate, input gate, cell state, and output gate are split into different parts, shown in Fig. 1.5A–D respectively. Before going to the mathematics equation, let us see the function of tanh and sigmoid activation layers. The values that are flowing through the LSTM network are regulated with the help of tanh activation. The tanh activation will squish (lessen) values between −1 and 1. A sigmoid activation has similar function as tanh activation; the difference is the sigmoid activation will lessen values between 0 and 1. The value or values in the vector that come out from the sigmoid activation indicate values that are closer to 0 are completely forgotten, and values that are closer to 1 are to be kept in the network or in cell state.

    The forget gate is considered the first step in LSTM. This gate will make a decision (decide) regarding which information should be kept or removed from the cell state or network. From Fig. 1.5A, the mathematical representation of the forget gate is expressed as

    (1.10)

    In Eq. (1.10) the σ is sigmoid activation, w f is weight, ℎ t −1 is output from the previous time stamp, x t is new input, and b f is bias. In Fig. 1.5A and Eq. (1.10) to calculate f t , the previous output or previous hidden state ℎ t −1 and new input x t are combined and multiplied with weight; after added to bias, the result is passed through the sigmoid activation. Now the sigmoid activation will squish values between 0 and 1, and values that are nearer to 0 will be discarded and values that are nearer to 1 will kept.

    The next step is input gate, which will update the values of cell state. To update the cell state, the previous output (ℎ t −1) and present input are passed through sigmoid activation. The sigmoid activation will convert the values between 0 and 1; from this we can know which values should be updated or not. The output that comes from sigmoid activation is i t . Further, the previous output and present input are passed through tanh activation. The tanh activation will squish values between −1 and 11 to regulate the network [22]. The output that comes from tanh activation is Č t . From Fig. 1.5B the mathematical representation of input gate is expressed as

    (1.11)

    (1.12)

    The next step is to update the old cell state, i.e., c t −1, into the new cell state, i.e., c t ; for this, first, the old cell state is multiplied by f t , where the vector f t has values between 0 and 11, so the old cell state values that are multiplied by 0 will become 0 or dropped. Now the sigmoid activation output (i t ) and tanh activation output (Č t ) are multiplied; here the sigmoid activation will decide what to keep or to remove, i.e., i t has vector values between 0 and 1. Then there is pointwise addition to get a new cell state, shown in Fig. 1.5C. The mathematical equation is written as

    (1.13)

    Figure 1.5  Various gates and cell states are split from LSTM cell to understand the mathematics behind it: (A) forget gate, (B) input gate, (C) cell state, and (D) output gate.

    The last step is output gate in which the hidden state (ℎ t ) is calculated, and this calculated hidden state is passed forward to the next time stamp (next cell). Hidden state is used for prediction, and it has the information of previous input. To find the hidden state, first the previous hidden state (ℎ t −1) and present input are passed through sigmoid activation to get the o t . Now the new cell state (c t ) is passed through tanh activation. Further, the tanh activation output and sigmoid activation output, i.e., o t , are multiplied to get the new hidden state h t as shown in Fig. 1.5D. The mathematical equation is written as

    (1.14)

    (1.15)

    Further, the hidden state ℎ t and new cell state c t are carried over to the next time stamp. This method is explained by referring to various sources [13,23,24].

    3. Experiment evaluation

    3.1. Testing methods effectiveness for PGVCL data

    For the PGVCL load dataset the short-term load forecasting was carried out; i.e., day-ahead and week-ahead predictions were made using RNN and LSTM. The actual observed data provided by PGVCL is from April 1, 2015 to March 31, 2019 (approximately 4 years), and the time horizon is hourly; i.e., each point was observed at each hour in a day. Fig. 1.6 shows the real-time observed load by PGVCL [25].

    For day-ahead, the method effectiveness was checked for March 31, 2019 (24h). Here the load data from April 1, 2015 to March 30, 2019, historical data, is given in the training data set and March 31, 2019 data is given to testing data set. Using the training set the prediction for day March 31, 2019 is done. Likewise, for week-ahead the method effectiveness is checked for days in March 25, 2019 to March 31, 2019 (each hour in 1week). Here the load data from April 1, 2015 to March 24, 2019, historical data, is given in the training set, and March 25, 2019 to March 31, 2019 data is given to the testing set. Using the training set the prediction for days March 25, 2019 to March 31, 2019 is made. Fig. 1.7 illustrates the comparison between actual load data of PGVCL and predicted load by RNN and LSTM for day ahead.

    Also, this predicted load by RNN and LSTM is further compared with time series models prediction, as shown in Table 1.1. The time series models prediction results is taken from Ref. [17]. In this paper, we tried to achieve a better prediction with RNN and LSTM and experiment with how well the machine learning methods can work on PGVCL load data. The AR (25) model gives a better prediction than the machine learning method (i.e., RNN and LSTM) for day ahead, per Table 1.1. From Table 1.1, the AR (25) model gives a better prediction result with approximately 99% accuracy (1.92% MAPE) and with 95.78MW measured error, while the RNN gives a prediction result with approximately 97% accuracy (2.77% MAPE) and with 148.83MW measured error, and the LSTM gives a prediction result with approximately 97% accuracy (2.85% MAPE) and with 153.38MW measured error.

    Fig. 1.8 illustrates the comparison between actual load data of PGVCL and predicted load by RNN and LSTM for week ahead, respectively. Also, this predicted load by RNN and LSTM is further compared with the time series models prediction, as shown in Table 1.2. Also, for week ahead, we tried to achieve the better prediction with RNN and LSTM than time series models, and here too, we experiment with how well the machine learning methods can work on PGVCL load data for weekly prediction. The RNN gives a better prediction than time series models for week ahead, per Table 1.2. From Table 1.2 the RNN gives a prediction result with approximately 97% accuracy (2.74% MAPE) and with 147.94MW measured error, and the LSTM worked well for week-ahead prediction giving a result with approximately 97% accuracy (2.77% MAPE) and with 148.35MW measured error. Both RNN and LSTM show better prediction than time series models for week-ahead prediction.

    Figure 1.6  Observed PGVCL load data set from April 1, 2015 to March 31, 2019.

    Figure 1.7  Comparison of RNN and LSTM prediction result for 1day with actual PGVCL load.

    Figure 1.8  Comparison of RNN and LSTM prediction result for 1week with actual PGVCL load.

    Table 1.1

    Enjoying the preview?
    Page 1 of 1