Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Machine Learning for Subsurface Characterization
Machine Learning for Subsurface Characterization
Machine Learning for Subsurface Characterization
Ebook774 pages5 hours

Machine Learning for Subsurface Characterization

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Machine Learning for Subsurface Characterization develops and applies neural networks, random forests, deep learning, unsupervised learning, Bayesian frameworks, and clustering methods for subsurface characterization. Machine learning (ML) focusses on developing computational methods/algorithms that learn to recognize patterns and quantify functional relationships by processing large data sets, also referred to as the "big data." Deep learning (DL) is a subset of machine learning that processes "big data" to construct numerous layers of abstraction to accomplish the learning task. DL methods do not require the manual step of extracting/engineering features; however, it requires us to provide large amounts of data along with high-performance computing to obtain reliable results in a timely manner. This reference helps the engineers, geophysicists, and geoscientists get familiar with data science and analytics terminology relevant to subsurface characterization and demonstrates the use of data-driven methods for outlier detection, geomechanical/electromagnetic characterization, image analysis, fluid saturation estimation, and pore-scale characterization in the subsurface.
  • Learn from 13 practical case studies using field, laboratory, and simulation data
  • Become knowledgeable with data science and analytics terminology relevant to subsurface characterization
  • Learn frameworks, concepts, and methods important for the engineer’s and geoscientist’s toolbox needed to support
LanguageEnglish
Release dateOct 12, 2019
ISBN9780128177372
Machine Learning for Subsurface Characterization
Author

Siddharth Misra

Siddharth Misra is currently associate professor at the Harold Vance Department of Petroleum Engineering, Texas A&M University, College Station, Texas. His research work is in the area of data-driven predictive models, machine learning, geosensors, and subsurface characterization. He earned a PhD in petroleum engineering from the University of Texas and a bachelor of technology in electrical engineering from the Indian Institute of Technology in Bombay. He received the Department of Energy Early Career Award in 2018 to promote geoscience research.

Related to Machine Learning for Subsurface Characterization

Related ebooks

Science & Mathematics For You

View More

Related articles

Reviews for Machine Learning for Subsurface Characterization

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Machine Learning for Subsurface Characterization - Siddharth Misra

    papers.

    Chapter 1

    Unsupervised outlier detection techniques for well logs and geophysical data

    Siddharth Misra⁎; Oghenekaro Osogba†,a; Mark Powers‡    ⁎ Harold Vance Department of Petroleum Engineering, Texas A&M University, College Station, TX, United States

    † Texas A&M University, College Station, TX, United States

    ‡ The University of Oklahoma, Norman, OK, United States

    a Formerly at the University of Oklahoma, Norman, OK, United States

    Abstract

    Outliers in well logs and other borehole-based subsurface measurements are often due to poor borehole condition, problems in data acquisition, irregularity in operating procedures, the presence of rare geological formations, or certain rare process/phenomenon in the subsurface. Detection of outliers is an important step prior to building a robust data-driven or machine learning-based model. We perform a comparative study of the performances of four unsupervised outlier detection techniques (ODTs) on various original and synthetic well-log datasets. The four unsupervised ODTs compared in this study are isolation forest (IF), one-class SVM (OCSVM), local outlier factor (LOF), and density-based spatial application with noise (DBSCAN). The unsupervised ODTs are evaluated on four labeled outlier-prone validation datasets using precision-recall curve, F1 score, area under the curve (AUC) score, and receiver operating characteristic (ROC) curve. Isolation forest is the most robust unsupervised ODT for detecting various types of outliers, whereas DBSCAN is particularly effective in detecting noise in a well-log dataset. Efficient feature engineering and feature selection is important to ensure robust detection of outliers in well-log and subsurface measurements using unsupervised outlier detection methods.

    Keywords

    Isolation forest; DBSCAN; Support vector; Local outlier factor; ROC; AUC; Precision; Recall; Outliers; Precision-recall curve

    Chapter outline

    1Introduction

    1.1Basic terminologies in machine learning and data-driven models

    1.2Types of machine learning techniques

    1.3Types of outliers

    2Outlier detection techniques

    3Unsupervised outlier detection techniques

    3.1Isolation forest

    3.2One-class SVM

    3.3DBSCAN

    3.4Local outlier factor

    3.5Influence of hyperparameters on the unsupervised ODTs

    4Comparative study of unsupervised outlier detection methods on well logs

    4.1Description of the dataset used for the comparative study of unsupervised ODTs

    4.2Data preprocessing

    4.3Validation dataset

    4.4Metrics/scores for the assessment of the performances of unsupervised ODTs on the conventional logs

    5Performance of unsupervised ODTs on the four validation datasets

    5.1Performance on Dataset #1 containing noisy measurements

    5.2Performance on Dataset #2 containing measurements affected by bad holes

    5.3Performance on Dataset #3 containing shaly layers and bad holes with noisy measurements

    5.4Performance on Dataset #4 containing manually labeled outliers

    6Conclusions

    Appendix APopular methods for outlier detection

    Appendix BConfusion matrix to quantify the inlier and outlier detections by the unsupervised ODTs

    Appendix CValues of important hyperparameters of the unsupervised ODT models

    Appendix DReceiver operating characteristics (ROC) and precision-recall (PR) curves for various unsupervised ODTs on the Dataset #1

    Acknowledgments

    References

    Acknowledgments

    Workflows and visualizations used in this chapter are based upon the work supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences Geosciences, and Biosciences Division, under Award Number DE-SC-00019266.

    1 Introduction

    From a statistical standpoint, outliers are data points (samples) that are significantly different from the general trend of the dataset. From a conceptual standpoint, a sample is considered as an outlier when it does not represent the behavior of the phenomenon/process as represented by most of the samples in a dataset. Outliers are indicative of issues in data collection/measurement procedure or unexpected events in the operation/process that generated the data. Detection and removal of outliers is an important step prior to building a robust data-driven (DD) and machine learning-based (ML) model. Outliers skew the descriptive statistics used by data analysis, data-driven and machine learning methods to build the data-driven model. A model developed on data containing outliers will not accurately represent the normal behavior of data because the model picks the unrepresentative patterns exhibited by the outliers. As a result, there will be nonuniqueness in the model predictions. Data-driven models affected by outliers have lower predictive accuracy and generalization capability.

    Outlier handling refers to all the steps taken to negate the adverse effects of outliers in a dataset. After detecting the outliers in a dataset, how they are handled depends on the immediate use of the dataset. Outliers can be removed, replaced, or transformed depending on the type of dataset and its use. Outlier handling is particularly important as outliers could enhance or mask relevant statistical characteristics of the dataset. For instance, outliers in weather data could be early signs of a weather disaster; ignoring this could have catastrophic consequences. However, before considering outlier handling, we must first detect them.

    Outliers in well logs and other borehole-based subsurface measurements occur due to wellbore conditions, logging tool deployment, and physical characteristics of the geological formations. For example, washed out zones in the wellbore and borehole rugosity significantly affects the readings of shallow-sensing logs, such as density, sonic, and photoelectric factor (PEF) logs, resulting in outlier response. Along with wellbore conditions, uncommon beds and sudden change in physical/chemical properties at a certain depth in a formation also result in outlier behavior of the subsurface measurements. In this chapter, we perform a comparative study of the performances of four unsupervised outlier detection techniques (ODTs) on various original and synthetic well-log datasets.

    1.1 Basic terminologies in machine learning and data-driven models

    Before discussing more about outliers, the authors would like to clearly distinguish the following terms: dataset, sample, feature, and target. Data-driven (DD) and machine learning-based (ML) methods find statistical/probabilistic functions by processing a relevant dataset to either relate features to targets (referred as supervised learning) or appropriately transform features and/or samples (referred as unsupervised learning). Various types of information (i.e., values of features and targets) about several samples constitute a dataset. A dataset is a collection of values corresponding to features and/or targets for several samples. Features are physical properties or attributes that can be measured or computed for each sample in the dataset. Targets are the observable/measurable outcomes, and the target values for a sample are consequences of certain combinations of features for that sample. For purposes of unsupervised learning, a relevant dataset is collection of only the features for all the available samples, whereas a dataset is collection of features and corresponding targets for all the available samples for purposes of supervised learning. A dataset comprises of one or many targets and several features for several samples. An increase in the number of samples increases the size of the dataset, whereas an increase in the number of features increases the dimensionality of dataset. A DD/ML model becomes more robust with the increase in the size of the dataset. However, with increase in dimension of the dataset, a model tends to overfit and becomes less generalizable, unless the increase in dimension is due to the addition of informative, relevant, uncorrelated features. Prior to building the DD/ML model using supervised learning, a dataset is split into training and testing datasets to ensure the model does not overfit the training dataset and generalizes well to the testing dataset. Further, the training dataset is divided into certain number of splits to perform cross validation that ensures the model learns from and is evaluated on all the statistical distributions present in the training dataset. For evaluating the model on the testing dataset, it is of utmost importance to avoid any form of mixing (leakage) between the training and testing datasets. Also, when evaluating the model on the testing dataset, one should select relevant evaluation metrics out of the several available metrics with various assumptions and limitations.

    1.2 Types of machine learning techniques

    Machine learning (ML) models can be broadly categorized into three techniques: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning (e.g., regression and classification), a data-driven model is developed by first training the model on samples with known features/attributes and corresponding targets/outcomes from the training dataset; following that, the trained model is evaluated on the testing dataset; and finally, the data-driven model is used to predict targets/outcomes based on the features/attributes of new, unseen samples during the model deployment. In unsupervised learning (e.g., clustering and transformation), a data-driven model learns to generate an outcome based on the features/attributes of samples without any prior information about the outcomes. In reinforcement learning (which tends to be very challenging), a data-driven model learns to perform a specific task by interacting with an environment to receive a reward based on the actions performed by the model toward accomplishing the task. In reinforcement learning, the model learns the policy for a specific task by optimizing the cumulative reward obtained from the environment. These three learning techniques have several day-to-day applications; for instance, supervised learning is commonly used in spam detection. The spam detection model is trained on different mails labeled as spam or not spam; after gaining the knowledge from the training dataset and subsequent evaluation on the testing dataset, the trained spam detection model can detect if a new mail is spam or not. Unsupervised learning is used in marketing where customers are categorized/segmented based on the similarity/dissimilarity of their purchasing trends as compared with other customers; for instance, Netflix’s computational engine uses the similarity/dissimilarity between what other users have watched when recommending the movies. Reinforcement learning was used to train DeepMind’s AlphaGo to beat world champions at the game of Go. Reinforcement learning was also used to train the chess playing engine, where the model was penalized for making moves that led to losing a piece and rewarded for moves that led to a checkmate.

    A machine learning method first processes the training dataset to build a data-driven model; following that, the performance of the newly developed model is evaluated against the testing dataset. After confirming the accuracy and precision of the data-driven model on the testing dataset, these methods are deployed on the new dataset. These three types of dataset, namely, training, testing, and new dataset, comprise measurements of certain specific features for numerous samples. The training and testing datasets, when used in supervised learning, contain additional measurements of the targets/outcomes. A supervised learning technique tries to functionally relate the features to the targets for all the samples in the dataset. On the contrary, for unsupervised learning, the data-driven model development takes place without the targets; in other words, there are no targets to be considered during the training and testing stages of unsupervised learning. Obviously, information about the targets is never available in the new dataset because the trained models are deployed on the new dataset to compute the desired targets or certain outcomes.

    1.3 Types of outliers

    In the context of this work, outliers can be broadly categorized into three types: point/global, contextual, and collective outliers [1]. Point/global outliers refer to individual data points or samples that significantly deviate from the overall distribution of the entire dataset or from the distribution of certain combination of features. These outliers exist at the tail end of a distribution and largely vary from the mean of the distribution, generally lying beyond 2 standard deviations away from the mean; for example, subsurface depths where porosity is > 40 porosity units or permeability is > 5 Darcy should be considered as point/global outliers. From an event perspective, a house getting hit by a meteorite is an example of point outlier. The second category of outliers is the contextual/conditional outliers, which deviate significantly from the data points within a specific context, for example, a large gamma ray reading in sandstone due to an increase in potassium-rich minerals (feldspar). Snow in summer is an example of contextual outlier. Points labeled as contextual outliers are valid outliers only for a specific context; a change in the context will result in a similar point to be considered as an inlier. Collective outliers are a small cluster of data that as a whole deviate significantly from the entire dataset, for example, log measurements from regions affected by borehole washout. For example, it is not rare that people move from one residence to the next; however, when an entire neighborhood relocates at the same time, it will be considered as collective outlier. Contextual and collective outliers need a domain expert to guide the outlier detection.

    2 Outlier detection techniques

    An outlier detection technique (ODT) is used to detect anomalous observations/samples that do not fit the typical/normal statistical distribution of a dataset. Simple methods for outlier detection use statistical tools, such as boxplot and Z-score, on each individual feature of the dataset. A boxplot is a standardized way of representing the distributions of samples corresponding to various features using boxes and whiskers. The boxes represent the interquartile range of the data, and the whiskers represent a multiple of the first and third quartiles of the variable; any data point/sample outside these limits is considered as an outlier. The next simple statistical tool for feature-specific outlier detection is the Z-score, which indicates how far the value of the data point/sample is from its mean for a specific feature. A Z-score of 1 means the sample point is 1 standard deviation away from its mean. Typically, Z-score values greater than or less than + 3 or − 3, respectively, are considered outliers. Z-score is expressed

    Enjoying the preview?
    Page 1 of 1