Machine Learning for Subsurface Characterization
By Siddharth Misra, Hao Li and Jiabo He
()
About this ebook
- Learn from 13 practical case studies using field, laboratory, and simulation data
- Become knowledgeable with data science and analytics terminology relevant to subsurface characterization
- Learn frameworks, concepts, and methods important for the engineer’s and geoscientist’s toolbox needed to support
Siddharth Misra
Siddharth Misra is currently associate professor at the Harold Vance Department of Petroleum Engineering, Texas A&M University, College Station, Texas. His research work is in the area of data-driven predictive models, machine learning, geosensors, and subsurface characterization. He earned a PhD in petroleum engineering from the University of Texas and a bachelor of technology in electrical engineering from the Indian Institute of Technology in Bombay. He received the Department of Energy Early Career Award in 2018 to promote geoscience research.
Related to Machine Learning for Subsurface Characterization
Related ebooks
Data Mining and Knowledge Discovery for Geoscientists Rating: 0 out of 5 stars0 ratingsPetrophysical Characterization and Fluids Transport in Unconventional Reservoirs Rating: 5 out of 5 stars5/5Metaheuristics in Water, Geotechnical and Transport Engineering Rating: 0 out of 5 stars0 ratingsPredictive Modelling for Energy Management and Power Systems Engineering Rating: 0 out of 5 stars0 ratingsApplications of Artificial Intelligence Techniques in the Petroleum Industry Rating: 0 out of 5 stars0 ratingsReservoir Simulations: Machine Learning and Modeling Rating: 0 out of 5 stars0 ratingsAssisted History Matching for Unconventional Reservoirs Rating: 0 out of 5 stars0 ratingsMachine Learning and Data Science in the Oil and Gas Industry: Best Practices, Tools, and Case Studies Rating: 3 out of 5 stars3/5Experimental Design in Petroleum Reservoir Studies Rating: 0 out of 5 stars0 ratingsDeconvolution of Geophysical Time Series in the Exploration for Oil and Natural Gas Rating: 0 out of 5 stars0 ratingsPractical Reservoir Engineering and Characterization Rating: 4 out of 5 stars4/5Computing Risk for Oil Prospects: Principles and Programs Rating: 0 out of 5 stars0 ratingsSupervised Machine Learning in Wind Forecasting and Ramp Event Prediction Rating: 0 out of 5 stars0 ratingsAdvances in Subsurface Data Analytics Rating: 0 out of 5 stars0 ratingsIntelligent Digital Oil and Gas Fields: Concepts, Collaboration, and Right-Time Decisions Rating: 5 out of 5 stars5/5Fundamentals of Numerical Reservoir Simulation Rating: 3 out of 5 stars3/5Fundamentals of Applied Reservoir Engineering: Appraisal, Economics and Optimization Rating: 4 out of 5 stars4/5Quantitative Methods in Reservoir Engineering Rating: 0 out of 5 stars0 ratingsReservoir Characterization Rating: 4 out of 5 stars4/5Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models Rating: 0 out of 5 stars0 ratingsMethods for Petroleum Well Optimization: Automation and Data Solutions Rating: 0 out of 5 stars0 ratingsArtificial Neural Networks and Statistical Pattern Recognition: Old and New Connections Rating: 0 out of 5 stars0 ratingsHydrocarbon Exploration and Production Rating: 5 out of 5 stars5/5Advanced Production Decline Analysis and Application Rating: 4 out of 5 stars4/5Data Assimilation for the Geosciences: From Theory to Application Rating: 0 out of 5 stars0 ratingsDeep Learning with R Rating: 0 out of 5 stars0 ratingsComputational Neural Networks for Geophysical Data Processing Rating: 5 out of 5 stars5/5Geophysics for Petroleum Engineers Rating: 5 out of 5 stars5/5Managing Subsurface Data in the Oil and Gas Sector Seismic: Seismic Rating: 0 out of 5 stars0 ratings
Science & Mathematics For You
Outsmart Your Brain: Why Learning is Hard and How You Can Make It Easy Rating: 4 out of 5 stars4/5Becoming Cliterate: Why Orgasm Equality Matters--And How to Get It Rating: 4 out of 5 stars4/5Activate Your Brain: How Understanding Your Brain Can Improve Your Work - and Your Life Rating: 4 out of 5 stars4/5A Letter to Liberals: Censorship and COVID: An Attack on Science and American Ideals Rating: 3 out of 5 stars3/5The Big Fat Surprise: Why Butter, Meat and Cheese Belong in a Healthy Diet Rating: 4 out of 5 stars4/5The Dorito Effect: The Surprising New Truth About Food and Flavor Rating: 4 out of 5 stars4/5The Systems Thinker: Essential Thinking Skills For Solving Problems, Managing Chaos, Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5Memory Craft: Improve Your Memory with the Most Powerful Methods in History Rating: 3 out of 5 stars3/5How Emotions Are Made: The Secret Life of the Brain Rating: 4 out of 5 stars4/5Born for Love: Why Empathy Is Essential--and Endangered Rating: 4 out of 5 stars4/5The Big Book of Hacks: 264 Amazing DIY Tech Projects Rating: 4 out of 5 stars4/5Homo Deus: A Brief History of Tomorrow Rating: 4 out of 5 stars4/5Why People Believe Weird Things: Pseudoscience, Superstition, and Other Confusions of Our Time Rating: 4 out of 5 stars4/5The Wisdom of Psychopaths: What Saints, Spies, and Serial Killers Can Teach Us About Success Rating: 4 out of 5 stars4/5Metaphors We Live By Rating: 4 out of 5 stars4/52084: Artificial Intelligence and the Future of Humanity Rating: 4 out of 5 stars4/5Ultralearning: Master Hard Skills, Outsmart the Competition, and Accelerate Your Career Rating: 4 out of 5 stars4/5Free Will Rating: 4 out of 5 stars4/5On Food and Cooking: The Science and Lore of the Kitchen Rating: 5 out of 5 stars5/5Oppenheimer: The Tragic Intellect Rating: 5 out of 5 stars5/5The Great Mortality: An Intimate History of the Black Death, the Most Devastating Plague of All Time Rating: 4 out of 5 stars4/5The Psychology of Totalitarianism Rating: 5 out of 5 stars5/5Hunt for the Skinwalker: Science Confronts the Unexplained at a Remote Ranch in Utah Rating: 4 out of 5 stars4/5Fantastic Fungi: How Mushrooms Can Heal, Shift Consciousness, and Save the Planet Rating: 5 out of 5 stars5/5No Stone Unturned: The True Story of the World's Premier Forensic Investigators Rating: 4 out of 5 stars4/5Other Minds: The Octopus, the Sea, and the Deep Origins of Consciousness Rating: 4 out of 5 stars4/5Lies My Gov't Told Me: And the Better Future Coming Rating: 4 out of 5 stars4/5The Misinformation Age: How False Beliefs Spread Rating: 4 out of 5 stars4/5
Reviews for Machine Learning for Subsurface Characterization
0 ratings0 reviews
Book preview
Machine Learning for Subsurface Characterization - Siddharth Misra
papers.
Chapter 1
Unsupervised outlier detection techniques for well logs and geophysical data
Siddharth Misra⁎; Oghenekaro Osogba†,a; Mark Powers‡ ⁎ Harold Vance Department of Petroleum Engineering, Texas A&M University, College Station, TX, United States
† Texas A&M University, College Station, TX, United States
‡ The University of Oklahoma, Norman, OK, United States
a Formerly at the University of Oklahoma, Norman, OK, United States
Abstract
Outliers in well logs and other borehole-based subsurface measurements are often due to poor borehole condition, problems in data acquisition, irregularity in operating procedures, the presence of rare geological formations, or certain rare process/phenomenon in the subsurface. Detection of outliers is an important step prior to building a robust data-driven or machine learning-based model. We perform a comparative study of the performances of four unsupervised outlier detection techniques (ODTs) on various original and synthetic well-log datasets. The four unsupervised ODTs compared in this study are isolation forest (IF), one-class SVM (OCSVM), local outlier factor (LOF), and density-based spatial application with noise (DBSCAN). The unsupervised ODTs are evaluated on four labeled outlier-prone validation datasets using precision-recall curve, F1 score, area under the curve (AUC) score, and receiver operating characteristic (ROC) curve. Isolation forest is the most robust unsupervised ODT for detecting various types of outliers, whereas DBSCAN is particularly effective in detecting noise in a well-log dataset. Efficient feature engineering and feature selection is important to ensure robust detection of outliers in well-log and subsurface measurements using unsupervised outlier detection methods.
Keywords
Isolation forest; DBSCAN; Support vector; Local outlier factor; ROC; AUC; Precision; Recall; Outliers; Precision-recall curve
Chapter outline
1Introduction
1.1Basic terminologies in machine learning and data-driven models
1.2Types of machine learning techniques
1.3Types of outliers
2Outlier detection techniques
3Unsupervised outlier detection techniques
3.1Isolation forest
3.2One-class SVM
3.3DBSCAN
3.4Local outlier factor
3.5Influence of hyperparameters on the unsupervised ODTs
4Comparative study of unsupervised outlier detection methods on well logs
4.1Description of the dataset used for the comparative study of unsupervised ODTs
4.2Data preprocessing
4.3Validation dataset
4.4Metrics/scores for the assessment of the performances of unsupervised ODTs on the conventional logs
5Performance of unsupervised ODTs on the four validation datasets
5.1Performance on Dataset #1 containing noisy measurements
5.2Performance on Dataset #2 containing measurements affected by bad holes
5.3Performance on Dataset #3 containing shaly layers and bad holes with noisy measurements
5.4Performance on Dataset #4 containing manually labeled outliers
6Conclusions
Appendix APopular methods for outlier detection
Appendix BConfusion matrix to quantify the inlier and outlier detections by the unsupervised ODTs
Appendix CValues of important hyperparameters of the unsupervised ODT models
Appendix DReceiver operating characteristics (ROC) and precision-recall (PR) curves for various unsupervised ODTs on the Dataset #1
Acknowledgments
References
Acknowledgments
Workflows and visualizations used in this chapter are based upon the work supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Chemical Sciences Geosciences, and Biosciences Division, under Award Number DE-SC-00019266.
1 Introduction
From a statistical standpoint, outliers are data points (samples) that are significantly different from the general trend of the dataset. From a conceptual standpoint, a sample is considered as an outlier when it does not represent the behavior of the phenomenon/process as represented by most of the samples in a dataset. Outliers are indicative of issues in data collection/measurement procedure or unexpected events in the operation/process that generated the data. Detection and removal of outliers is an important step prior to building a robust data-driven (DD) and machine learning-based (ML) model. Outliers skew the descriptive statistics used by data analysis, data-driven and machine learning methods to build the data-driven model. A model developed on data containing outliers will not accurately represent the normal behavior of data because the model picks the unrepresentative patterns exhibited by the outliers. As a result, there will be nonuniqueness in the model predictions. Data-driven models affected by outliers have lower predictive accuracy and generalization capability.
Outlier handling refers to all the steps taken to negate the adverse effects of outliers in a dataset. After detecting the outliers in a dataset, how they are handled depends on the immediate use of the dataset. Outliers can be removed, replaced, or transformed depending on the type of dataset and its use. Outlier handling is particularly important as outliers could enhance or mask relevant statistical characteristics of the dataset. For instance, outliers in weather data could be early signs of a weather disaster; ignoring this could have catastrophic consequences. However, before considering outlier handling, we must first detect them.
Outliers in well logs and other borehole-based subsurface measurements occur due to wellbore conditions, logging tool deployment, and physical characteristics of the geological formations. For example, washed out zones in the wellbore and borehole rugosity significantly affects the readings of shallow-sensing logs, such as density, sonic, and photoelectric factor (PEF) logs, resulting in outlier response. Along with wellbore conditions, uncommon beds and sudden change in physical/chemical properties at a certain depth in a formation also result in outlier behavior of the subsurface measurements. In this chapter, we perform a comparative study of the performances of four unsupervised outlier detection techniques (ODTs) on various original and synthetic well-log datasets.
1.1 Basic terminologies in machine learning and data-driven models
Before discussing more about outliers, the authors would like to clearly distinguish the following terms: dataset, sample, feature, and target. Data-driven (DD) and machine learning-based (ML) methods find statistical/probabilistic functions by processing a relevant dataset to either relate features to targets (referred as supervised learning) or appropriately transform features and/or samples (referred as unsupervised learning). Various types of information (i.e., values of features and targets) about several samples constitute a dataset. A dataset is a collection of values corresponding to features and/or targets for several samples. Features are physical properties or attributes that can be measured or computed for each sample in the dataset. Targets are the observable/measurable outcomes, and the target values for a sample are consequences of certain combinations of features for that sample. For purposes of unsupervised learning, a relevant dataset is collection of only the features for all the available samples, whereas a dataset is collection of features and corresponding targets for all the available samples for purposes of supervised learning. A dataset comprises of one or many targets and several features for several samples. An increase in the number of samples increases the size of the dataset, whereas an increase in the number of features increases the dimensionality of dataset. A DD/ML model becomes more robust with the increase in the size of the dataset. However, with increase in dimension of the dataset, a model tends to overfit and becomes less generalizable, unless the increase in dimension is due to the addition of informative, relevant, uncorrelated features. Prior to building the DD/ML model using supervised learning, a dataset is split into training and testing datasets to ensure the model does not overfit the training dataset and generalizes well to the testing dataset. Further, the training dataset is divided into certain number of splits to perform cross validation that ensures the model learns from and is evaluated on all the statistical distributions present in the training dataset. For evaluating the model on the testing dataset, it is of utmost importance to avoid any form of mixing (leakage) between the training and testing datasets. Also, when evaluating the model on the testing dataset, one should select relevant evaluation metrics out of the several available metrics with various assumptions and limitations.
1.2 Types of machine learning techniques
Machine learning (ML) models can be broadly categorized into three techniques: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning (e.g., regression and classification), a data-driven model is developed by first training the model on samples with known features/attributes and corresponding targets/outcomes from the training dataset; following that, the trained model is evaluated on the testing dataset; and finally, the data-driven model is used to predict targets/outcomes based on the features/attributes of new, unseen samples during the model deployment. In unsupervised learning (e.g., clustering and transformation), a data-driven model learns to generate an outcome based on the features/attributes of samples without any prior information about the outcomes. In reinforcement learning (which tends to be very challenging), a data-driven model learns to perform a specific task by interacting with an environment to receive a reward based on the actions performed by the model toward accomplishing the task. In reinforcement learning, the model learns the policy for a specific task by optimizing the cumulative reward obtained from the environment. These three learning techniques have several day-to-day applications; for instance, supervised learning is commonly used in spam detection. The spam detection model is trained on different mails labeled as spam or not spam; after gaining the knowledge from the training dataset and subsequent evaluation on the testing dataset, the trained spam detection model can detect if a new mail is spam or not. Unsupervised learning is used in marketing where customers are categorized/segmented based on the similarity/dissimilarity of their purchasing trends as compared with other customers; for instance, Netflix’s computational engine uses the similarity/dissimilarity between what other users have watched when recommending the movies. Reinforcement learning was used to train DeepMind’s AlphaGo to beat world champions at the game of Go. Reinforcement learning was also used to train the chess playing engine, where the model was penalized for making moves that led to losing a piece and rewarded for moves that led to a checkmate.
A machine learning method first processes the training dataset to build a data-driven model; following that, the performance of the newly developed model is evaluated against the testing dataset. After confirming the accuracy and precision of the data-driven model on the testing dataset, these methods are deployed on the new dataset. These three types of dataset, namely, training, testing, and new dataset, comprise measurements of certain specific features for numerous samples. The training and testing datasets, when used in supervised learning, contain additional measurements of the targets/outcomes. A supervised learning technique tries to functionally relate the features to the targets for all the samples in the dataset. On the contrary, for unsupervised learning, the data-driven model development takes place without the targets; in other words, there are no targets to be considered during the training and testing stages of unsupervised learning. Obviously, information about the targets is never available in the new dataset because the trained models are deployed on the new dataset to compute the desired targets or certain outcomes.
1.3 Types of outliers
In the context of this work, outliers can be broadly categorized into three types: point/global, contextual, and collective outliers [1]. Point/global outliers refer to individual data points or samples that significantly deviate from the overall distribution of the entire dataset or from the distribution of certain combination of features. These outliers exist at the tail end of a distribution and largely vary from the mean of the distribution, generally lying beyond 2 standard deviations away from the mean; for example, subsurface depths where porosity is > 40 porosity units or permeability is > 5 Darcy should be considered as point/global outliers. From an event perspective, a house getting hit by a meteorite is an example of point outlier. The second category of outliers is the contextual/conditional outliers, which deviate significantly from the data points within a specific context, for example, a large gamma ray reading in sandstone due to an increase in potassium-rich minerals (feldspar). Snow in summer is an example of contextual outlier. Points labeled as contextual outliers are valid outliers only for a specific context; a change in the context will result in a similar point to be considered as an inlier. Collective outliers are a small cluster of data that as a whole deviate significantly from the entire dataset, for example, log measurements from regions affected by borehole washout. For example, it is not rare that people move from one residence to the next; however, when an entire neighborhood relocates at the same time, it will be considered as collective outlier. Contextual and collective outliers need a domain expert to guide the outlier detection.
2 Outlier detection techniques
An outlier detection technique (ODT) is used to detect anomalous observations/samples that do not fit the typical/normal statistical distribution of a dataset. Simple methods for outlier detection use statistical tools, such as boxplot and Z-score, on each individual feature of the dataset. A boxplot is a standardized way of representing the distributions of samples corresponding to various features using boxes and whiskers. The boxes represent the interquartile range of the data, and the whiskers represent a multiple of the first and third quartiles of the variable; any data point/sample outside these limits is considered as an outlier. The next simple statistical tool for feature-specific outlier detection is the Z-score, which indicates how far the value of the data point/sample is from its mean for a specific feature. A Z-score of 1 means the sample point is 1 standard deviation away from its mean. Typically, Z-score values greater than or less than + 3 or − 3, respectively, are considered outliers. Z-score is expressed