Application of Data Mining to Predict and Assess the ROP Response

(1)

Chair of Drilling and Completion Engineering

Master's Thesis

Application of Data Mining to Predict and Assess the ROP Response

Mildred Rosa Mejia Orellana

May 2019

(2)

(3)

I declare on oath that I wrote this thesis independently, did not use other than the specified sources and aids, and did not otherwise use any unauthorized aids.

I declare that I have read, understood, and complied with the guidelines of the senate of the Montanuniversität Leoben for "Good Scientific Practice".

Furthermore, I declare that the electronic and printed version of the submitted thesis are identical, both, formally and with regard to content.

Date 21.05.2019

Signature Author Mildred Rosa, Mejia Orellana Matriculation Number: 01629933

AFFIDAVIT

(4)

(5)

Mildred Mejía Orellana

Master Thesis supervised by

Univ.-Prof. Dipl.-Ing. Dr.mont. Gerhard Thonhauser ipl.- Dipl. Ing. Asad Elmgerbi

ipl.-Ing. Asad Elmgerbi

Application of Data Mining to Predict and Assess the ROP

Response

(6)

ii

(7)

iii

To my family, my kids and my dear friends.

(8)

iv

(9)

v

Abstract

Performance enhancement is the main wish in any industry. In the drilling process, the challenge lies in finding the right conditions to reach a desired depth faster, while balancing the operational complexities with the associated risks. In this regard, drilling operations generate enormous quantities of data and metadata with the main goal of providing detailed visualization of operations accessible remotely and in real time. This aligns with the existent big-data time, where data mining techniques appear as means to drive proficiencies in data processing to generate new and valuable information. From this perspective, the ultimate goal of this thesis is to assess the application of data mining software to transform commonly acquired drilling data into actionable data with possible impact in well planning and during later operations. In order to achieve the prime goal of the thesis the Rate of Penetration (ROP) was selected to be the focus of the study.

The ROP, known as one of the contributors in time estimation for operations, is the variable of interest for the analysis and prediction. This work applies data mining techniques to examine pre-existing data sets of previously drilled wells looking for meaningful information about the measured ROP. Then Machine-learning models are used for its predictions to serve as a reference to evaluate any deviation and its possible causes, by testing the prediction in a new data set.

This thesis is divided into four main parts. Starting by exploring data mining functionalities and its applications, including specific examples related to the Oil & Gas (O&G) industry. The following part involves understanding drilling data, its origins in measurements, its data type, and some of the challenges faced during its acquisition process. The ROP measurement is discussed in detail during this stage as well. With a general overview of the resources, the third part is dedicated to the methodology by developing a workflow including Pre-processing and Processing of the data using a commercial data mining software to implement a model for ROP prediction. In the last part, the Data Analysis and Model Evaluation are performed using different visualization tools, reinforced by descriptive statistics. A discussion of the model implementation and testing process is presented as well, based on the obtained results.

The outcome of this work, drawn a road for further research on ROP deviation causes. It offers an insight for data mining applications for practical analysis and prediction derived from drilling data. It endorses its application when objectives are clearly defined and with no resources constraints.

(10)

vi

(11)

vii

Zusammenfassung

Effizienzsteigerung ist eines der Hauptziele in allen Industriezweigen. Während des Bohrprozesses besteht die Herausforderung darin, die Bohrparameter so anzupassen, dass die geplante Teufe möglichst schnell erreicht wird und die mit dem Bohrprozess verbundenen Risiken und operativen Schwierigkeiten gleichzeitig geringgehalten werden. Im Zusammenhang mit der Bohrtätigkeit werden enorme Mengen an Daten und Metadaten generiert, mit dem Hauptziel, eine detaillierte Visualisierung der Vorgänge zu ermöglichen, auf die von überall aus und in Echtzeit zugegriffen werden kann. Diese Entwicklung geht Hand in Hand mit dem vorherrschenden Trend zu Big Data, in dem Data Mining-Methoden eingesetzt werden um die Effizienz in der Datenverarbeitung zu steigern und neue und wertvolle Informationen zu gewinnen. Davon ausgehend ist es das Ziel dieser Arbeit, die Anwendung von Data Mining-Software auf standardmäßig aufgezeichnete Bohrdaten zu bewerten, um aus ihnen verwertbare Informationen zu erhalten, die möglicherweise Einfluss in der Planungsphase und dem späteren operativen Verlauf von Bohrungen haben können. Dazu wurde in dieser Arbeit die Bohrfortschrittsrate (ROP) als Studienschwerpunkt ausgewählt.

Die Bohrfortschrittsrate stellt bekannterweise einen Faktor in der Zeitplanung von Bohrungen dar und dient hier also zu untersuchende Variable für die Analyse und Vorhersage. Die Arbeit wendet Data-Mining Methoden auf bereits existierende Datensätze von abgeteuften Bohrungen an um diese auf aussagekräftigen Informationen über die gemessene Bohrfortschrittsrate zu prüfen. Anschießend werden maschinelle Lernmethoden genutzt um die Bohrfortschrittsrate vorherzusagen. Diese dienen als Referenz um Abweichungen und deren mögliche Gründe zu evaluieren, indem die Vorhersagen auf neue Datensätze angewandt werden.

Die Arbeit gliedert sich in vier Hauptteile, beginnend mit Funktionsweisen des Data Mining und deren Anwendung, einschließlich spezifischer Beispiele für die Öl- und Gasindustrie. Darauffolgend werden Bohrdaten und die Ursprünge ihrer Aufzeichnung, ihr Datenformat sowie die Schwierigkeiten im Zusammenhang mit ihrer Aufzeichnung behandelt. Dies beinhaltet eine detaillierte Diskussion der Messung der Bohrfortschrittsrate. Der dritte Teil behandelt die Methodik, mit einer allgemeinen Übersicht über die Ressourcen in dem ein Workflow erarbeitet wird der die Vorverarbeitung und Verarbeitung der Daten mit einer kommerziellen Data Mining-Software umfasst, um ein Modell für die Vorhersage der Bohrfortschrittsrate zu implementieren. Im letzten Teil werden die Datenanalyse und die Modellbewertung mit verschiedenen Visualisierungswerkzeugen durchgeführt und durch beschreibende Statistk gestützt. Anhand der erzielten Ergebnisse werden Modellimplementierungs- und Testprozesse diskutiert.

Das Ergebnis der Arbeit zeigt einen Weg für die weitere Erforschung der Ursachen von Abweichungen der Bohrfortschrittsrate auf. Es bietet einen Einblick in Data Mining-Anwendungen zur praktischen Analyse und Vorhersagen die von Bohrdaten abgeleitet werden. Die Anwendung von Data Mining ist aufgrund der Ergebnisse zu befürworten, wenn die Ziele klar definiert sind und keine Ressourcenbeschränkungen bestehen.

(12)

viii

(13)

ix

Acknowledgements

Primarily I want to thank my thesis advisor Dipl.-Ing. Asad Elmgerbi for his support and guidance during my studies, and particularly for the culmination of this work. He has been a true mentor for my professional and personal growth.

I also would like to thank my friends from Ecuador, Iran, England, Austria, Syria and Russia, who constantly inspire me, and provided their support always when I needed.

Furthermore, I want to thank Arash, who always believed in me, giving me his light during my darkest moments.

Special thanks to my family who gave me their love and unconditional support during the highs and lows of this journey.

Finally yet importantly, I want to thank to my favourite person, my beloved sister, who made many sacrifices for me. Without your support this would not be possible, love you Kiki!

(14)

x

(15)

xi

Chapter 1 Introduction

1.1 Overview

There is no doubt that it is a data–driven time, where scientific data, medical data, financial data, and practically every daily interaction inside a system is being registered and stored in some kind of format as data. Only understanding what to do or how to use this vast amount of data can open the possibilities to knowledge.

In this regard, data mining appears to provide the resources to handle this big amount of data. It brings promising solutions as a dynamic, breadth, and multidisciplinary field founded in statistics, data visualization, artificial intelligence, and machine learning along with database technology and high-performance computing. In brief, its focus is on finding insights, regardless of the methods, yet it commonly uses machine- learning algorithms to build models, but its focus is knowledge discovery.

Drilling data is not exempted, with a trend of growing constantly accelerating in volume and type, but still in the process of being explored to its theoretical potential.

Knowing that ROP is one of the parameters of concern during drilling operations, its proper understanding and prediction have become of great interest for optimization, where data mining and machine learning techniques, directly related with data analysis and prediction, appear as a positive alternative for this purpose. Particularly when so much theoretical research has been done regarding ROP, usually under limited conditions that ends preventing its applicability. Data mining, on the other hand, opens the possibility of insights and predictions based on real drilling data generated under operations with tangible and, in many cases, repetitive conditions.

1.2 Motivation and Objectives

With the increase of automatic processes during drilling operations, an increment of data sources is expected with more and different type of sensors installed to accomplish all kind of tasks.

In addition to this increment, Figure 1 shows the number of wells drilled until 2018 in the US shale sector, with a projection to drill and complete more than 20,000 wells for 2019. A tendency of growing is estimated until 2022 reflecting how drilling data is expected to continue growing tremendously in the upcoming years. Handling such a big amount of data demands the application of data mining, covering all the aspects from data preparation to analysis, particularly when it has been already successfully applied in several fields.

Thus, the challenge consists in boosting data mining functionalities in the direction of drilling performance. Therefore, this work represents an opportunity to combine drilling engineering with data mining by applying some of its techniques to a set of drilling data.

(18)

2

Figure 1 Wells drilled, completed, and drilled-but-uncompleted per year until 2018.

Projection until 2022 (Jacobs, Journal of Petroleum Technology 2019)

The main objective is to improve the understanding of ROP behaviour, and when possible identify the factors affecting its expected performance, with the creation of a model to predict its response. The mean for this purpose are sensor data collected constantly during normal drilling operations, along with geographical well position data in one specific field.

In order to achieve the intended goal, a comprehensive workflow was created, and its main phases are showed in Figure 2.

Figure 2 Workflow divided in four specific phases

The two initial phases, involved literature review, and research associated with the topic to support the proposal for methodology, by studying existing data mining applications along with more detailed examples directly related to the O&G industry.

In addition, the second phase includes the use of a commercial data mining software to process and analyse drilling data. Then, the last two phases evoke for the implementation of a predictive model for ROP using data mining techniques, to finally evaluate the model and its applicability for drilling performance.

#1 TO EXPLORE existing data mining

applications in the industry and its benefits

.

#2 TO ANALYSE real drilling data using a

commercial data mining software.

#3 TO CREATE a model to predict ROP using data mining techniques.

#4 TO EVALUATE

the model. To assess the performance of a well while drilling and when possible assist in the detection of potential problems.

(19)

3

Chapter 2 Data Mining

2.1 Overview

Data mining emerged during the late 1980s, with important advances through the next decade and until today. It refers to the application of science to extract useful information from large data sets or databases, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. In other words, a person, under a particular situation, working with specific data sets and pursuing well-defined objectives, executes it. (Gung 2016)

There is a lot of discussion around the proper definition of data mining and how it differs from machine-learning. Many authors and researchers in the area are still in some level of disagree. However, data mining researchers Jiawei Han and Micheline Kamber, in their book Data Mining: Concepts and Techniques, provide a formal definition:

“Data mining also popularly referred to as knowledge discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories, or data streams.” (Han and Kamber 2006)

Figure 3 Data Mining system (Abou-Sayed 2012)

Considering that every time vast amounts of data are being created, transmitted and stored on more frequent time basis, data mining serves the purpose of providing a description of the observed data regardless its volume or type. Research and commercial interest align with this demand with the development of software solutions designed and dedicated exclusively to handle massive amount of data, including algorithms and tools to simplifier its process.

(20)

4

The term is not yet commonly used in the O&G industry; for that reason, some of its functionalities and common applications should be shallow discussed to recognize its value for the industry. There are two main tasks that can be performed using data mining: descriptive and predictive. Descriptive tasks characterize the main features or general properties of the data in a convenient way. The objective is to derive patterns, which summarize the relationships in the data. On the other hand, predictive tasks interpret the current data to model a future behaviour for some variables based on values of other known variables.

To perform any of the tasks, a suite of techniques are employed. The selected approach is highly depended on the nature of the task and the availability of the data. Some of the techniques include Statistics, Artificial Intelligence (AI), Pattern Recognition, Machine Learning, and Data Systems analysis.

2.2 Functionalities

In order to be familiar with the terminology used in the framework of data mining, it is important to properly segregate some common terms like model and pattern. A model is a global concept that provides a full description of the data and can be apply to all points in the database. On the other hand, a pattern corresponds to a local description of some subset of the data that can hold for some variables, but not for all of them.

Patterns are used to extract unusual structures within the data and are valuable for both main mining tasks. Then data mining techniques can be classified based on different criteria like: the type of database to be mined, the type of knowledge to be discovered, and the types of methods to be used. (Platon and Amazouz 2007)

Because it is a field in constant change, there are sort of best algorithms for certain problems, and with pragmatic rules of thumb about when to apply each technique to make it highly effective. Usually, a data mining system consists of a set of elements for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. In addition, there are a several variations of those tasks, resulting in new algorithms, considered in some cases as “new techniques.” For the purpose of this thesis, only the broad classes of data mining algorithms will be discussed. (Pinki, Prinima and Indu 2017)

2.2.1 Concept/Class Description: Characterization and Discrimination

Class/Concept description refers to the advantage of associating data with classes or concepts for summaries of individual descriptions based on these precise terms.

There are three techniques used to derive this description:

 Data Characterization: the class of interest, also referred as target class, is summarized in general terms or, based on its features.

 Data Discrimination: the general features of the class under study are compared with the general features of one or more comparative classes, to obtain a contrast between them.

 Combination of both data characterization and discrimination.

(21)

5

The methods used for characterization and discrimination include summaries and output presentations based on statistical measures, generalized relations, in rule forms and descriptive plots like: bar charts, curves, pie charts, multidimensional tables, and so on.

2.2.2 Frequent Patterns, Associations, and Correlations

Frequent Patterns corresponds to one of the most basic techniques and is about learning to recognize frequent patterns in data sets. It is usually based on distinguishing aberrations in data happening at regular intervals over time. Different kinds of frequent patterns include:

 Frequent item-sets: denote a set of items that recurrently appear together in a transactional data set.

 Frequent sequential pattern: refer to a pattern occurring in a sub-sequential trend, one after another, repeatedly.

 Frequent sub-structured pattern: occur when different structural arrangements take place on a regular basis. The form of those arrangements can be graphs, trees or lattices, and may be combined with sub-sequences or item-sets.

Associations and correlations occur when frequent patterns within the data are tracked in a more specific way to dependently link variables. In the association analysis, two groups can be distinguished related to the number of attributes/dimensions:

 Single-dimensional association rule: Involves a single attribute or predicate that repeats (i.e., buy)

 Multidimensional association rule: Consists of more than one attribute or predicate (i.e., age, income, and buy).

When certain association rules are considered interesting, statistical correlations can be applied to show whether and, how strongly associated attribute–values pairs relate.

2.2.3 Classification and Prediction

A more complex and commonly applied mining technique is classification, where a model is created to describe and differentiate data classes/concepts and then collect them together into discernable categories. The final aim is to use the model, derived from the known data, also known as ‘training data’, to make predictions of the data labelled as unknown.

There are a number of forms to represent the model, for example using:

 Classification rules: with the function IF – THEN.

 Decision trees: creating a flow chart via an algorithm based on the “information gain” of the attributes. Basically, each node is tested on an attribute value, where the tree brands represent the outcome and the leaves denote the class distribution.

 k-nearest neighbour (k-NN) classification: uses the data to determine the model structure by not making assumptions on the original data distribution (non- parametric) but learning based on feature similarity. Hence, it does not do

(22)

6

generalization based on the training data rather it utilizes the training data for the testing phase.

 Neural networks: structurally consist of many small units called neurones, and it is a powerful mathematical tool for solving problems. The neurons are linked to each other into layers, and cooperate to propagate the inputs using weighted connections through ‘activation functions’. Then the Bias values are converted mathematically to continue the transformation of the inputs into outputs in the best possible manner. (Solesa 2017)

 Support Vector Machine: combines linear modelling and instance-based learning to overcome the limitations of linear boundaries. It relies in selecting a small number of critical boundary instances, called support vectors from each class, and build a linear discriminant function that separates them as widely as possible. The result permits the inclusion of extra nonlinear terms in the function, in order to form higher-order decision boundaries. (Witten and Frank 2005)

 Naïve Bayesian: it’s based on the Bayes’s rule (named after Rev. Thomas Bayes 1702-1761) and is mainly appropriate when the dimensionality of the inputs is high, i.e., in a simplistic way, it assumes independency between attributes. This technique works well when combined with procedures to eliminate redundancy (non-independent attributes). The algorithm output will be a function of the prior probability, based on previous experience, and the likelihood for a new object to be classified in a certain class. Naïve Bayes miscarries if a particular attribute value does not occur in the training set along with every class value. (Witten and Frank 2005)

Though conventionally the term prediction is used in reference to numeric prediction as class label prediction, more precisely, classification is used for categorical predictions labels (discrete, unordered), and prediction to emphasize models describing continuous-valued functions. In this context, regression analysis appears as a statistical methodology, commonly used for numeric prediction. However, other methods exist and could also provide a good performance.

2.2.4 Outlier Detection

In many cases, data sets may include anomalies, or outliers, i.e., data that do not comply with the general behaviour or model of the data, data that need to be identified and demand investigation to get a clear understanding of the data set. In general, data mining offers algorithms to discard outliers as noise or exceptions. This type of data could affect data analysis and therefore, needs to be excluded.

This functionality can also serve other purpose, when anomalies can provide information of interest, like for example, in cases of fraud detection using credit cards to purchase extremely large amounts compare to regular transactions.

Outlier detection is possible conventionally through statistical tests, where a certain type of distribution is assumed, or by using probability models to discard anomalies.

Other methods include the use of distance or density measures, where examples substantially far or with less data density from any other cluster are identified as

(23)

7

outliers. On the other hand, deviation-based methods, compare the main characteristics between examples and by examining the differences set apart outliers.

2.2.5 Cluster Analysis

Clustering seems similar to classification, but involves grouping amounts of data without a specific known class label, using only their similarities. Clustering, can in fact, be used to generate the necessary labels.

The principle used to group the data search for examples to maximize their intraclass similarities within the group, and at the same time, to minimize the intraclass similarities with other groups. The final result is different groups (clusters) in a way that examples in the same group are similar to each other but different from examples in other groups. Groups are clearly distinguished and can be used to derive rules. (Han and Kamber 2006)

Different techniques are used for clustering, where the most common examples are hierarchical clustering and k-means clustering. (Abou-Sayed 2012)

Figure 4 Plot of customer data in relation to its location in a city. Three data clusters are clearly identified (Han and Kamber 2006)

2.2.6 Regression

It is a statistical method used to approximate the given data primarily as a form of planning and modelling continues values. There are different types of regression analysis, but the principle consist in evaluating the influence of one or more independent variables on a dependent variable. It allows examining the likelihood of a certain variable, in the presence of other variables, providing a way to uncover the exact relationship between two or more variables in a certain data set.

The simplest form is called linear regression, where the response variable can be modelled as a linear function of another variable. In the event when two or more variables have a linear relationship with the dependent variable, the regression is then known as multiple linear regression. Linear regression is very sensitive to outliers, which can distort the calculation.

Multiple regression is an extension of the simple form, used to predict a relationship between multiple variables, which increases the complexity of the prediction.

(24)

8

Some of the most popular types of Regression are Logistic Regression, Polynomial Regression, Stepwise Regression, Ridge Regression, Lasso Regression, among others.

Each one following specific conditions to better suits problems (Ray 2015).

2.3 Common Applications

Data mining is widely popular in credit risk applications and in cases of fraud detection. The common technique applied is Classification, where a model is developed employing pre-classified examples and then categorize the records through decision tree or neural network-based classification algorithms. Outlier detection is also used for fraud detection, where the outliers become the data of interest. In general, during the process, it is necessary to include records of valid and fraudulent activities to properly train the model on how to determine the required parameters to do the discrimination. (Pinki, Prinima and Indu 2017)

Data mining is also being used successfully in industrial process applications, in areas that include process monitoring, fault detection and diagnosis, process-related decision-making support to improve process understanding, soft sensors, process parameter inference, and many others. Each application demands different techniques along with different types of data bases. However, one of the most popular techniques for prediction modelling is based on neural network approaches due to its well-known predictive capabilities. In some cases, a combination of various methods can also be used to create hybrid models and overcome individual limitations to achieve the proposed objective. (Platon and Amazouz 2007)

Retailer analysis of buying patterns is another classical application of data mining, where its solving problems proficiency of analysing long time stored databases full of data of customer actions and loyalty represents an open door for the marketplaces. In every transaction, customers expose their choices, along with some of their profile data, that when properly processed results in patterns of customer behaviour. This information allows to identify distinguishing characteristics related to their loyalty and churn likelihood to certain products. The results provide client’s profile identification and clients preferences that can be worked as inputs for marketing strategies, market predictions, to serve a customer oriented economy where increasing sales is the final aim. As an example, the giant Wal-Mart can be cited, which transfers all its relevant daily transactions to a data warehouse collecting terabytes of data, that is also accessible to its suppliers enabling them to extract the information regarding customer buying patterns and shopping habits, as well as most shopped days, most sought for products, and so on.

There are many other specific applications. Like screening images with a hazard detection system to identify oil slicks from satellite images and give an earlier warning of ecological disasters. For the forecast of the load for the electricity supply industry based on historical records of consumption. In the medicine field, for best treatment selection and in human in vitro fertilization where over 60 recorded features of the embryos need to be analysed simultaneously; and countless more applications. (Witten and Frank 2005)

(25)

9

2.4 Applications in the Industry

It has been estimated that a large offshore field delivers more than 0.75 terabytes of data weekly, and a large refinery 1 terabyte of raw data per day. References have been made to input/output points somewhere between 4000 and 10000 per second. (Abou- Sayed 2012) With this amount of data flowing constantly, the key lies in ensuring that the right information reaches the right people at the right time.

The industry’s emphasis has been normally on monitoring and assurance of production; therefore, some Operators and Service Companies recognizing the potential of data mining have started to make important investments in that direction.

Some examples of the potential of data mining that are already applied to optimize solutions are related to:

 Predict well productivity, reservoir recovery factors, and decline rate.

 Identify key drivers for performance of producers and water injectors subjected to multiple factors like high pressures and temperature.

 Defining best practices in completion.

 Minimizing production downtime and well intervention costs.

 Extend production life of wells.

Data Mining scope is still uncertain. Therefore some interesting advances are further discussed showing it application in three different disciplines. The first example refers to Reservoir Management and how supported on seismic data it is possible to identify and advice regarding sweet spots. The second example is related to Wellbore stability, and how data can be used to prevent some of the causes and the associated risks. In the last case, an application for Formation Evaluation predictions is presented as an alternative to reduce completion costs.

2.4.1 Example 1 - Reservoir Management

British Petroleum (BP) along with Beyond Limits are working on a project to absorb the learnings of petrotechnical experts, like geologists and petroleum engineers, using cognitive computing to imitate their decision-making processes as they work on subsurface challenges.

The first joint program is already running since July-2018 with a group of BP’s upstream engineering team aiming that their expertise train the system and remain longer digitally. It was meant to be used on the job, in a way that a number of Artificial Intelligent (AI)-agents constantly interact with members of the team to start building experience, learn the art of solving problems and store knowledge further. It starts as a design tool, with a process of learning to become a recommendation tool that with experience can build trust, to later be used as a control system. In a glimpse of the early stage of the project, BP is expecting from the system answers on how to mitigate the impact of sand production with prediction and advice with asphaltene buildups in a well.

A cognitive computing system involves self-learning technologies that use basic analytics, deep learning, data mining, pattern recognition, and natural language

(26)

10

processing to solve problems the way humans solve problems, by thinking, reasoning and remembering. It can combine data from different information sources, weigh its context, and solve conflicts using the evidence to propose the best possible answers.

Through deep learning, the information is processed in layers where the output from one layer becomes the input for the next one, improving the result. (Jacobs, Journal of Petroleum Technology 2018)

BP’s interest on Beyond Limits arose due to its work with Jet Propulsion Laboratory (JPL) on the real Mars rovers Curiosity. One of their principals was the author of a distinctive AI program in charge of managing one of the rover’s battery. The outstanding was that when the program detected that the solar panels were suffering from dust storms, it autonomously accessed data from pressure and temperature sensors with the purpose of building a weather model in order to understand how to properly orient its solar panels to prevent dust from storms. This aligns with the definition of AI as “the science of making computers do things that require intelligence when done by humans” (Evans 2017). In the Curiosity mission, the program was capable to execute a task that was not designed in its model.

Beyond Limits is relatively new and not exclusive to the Oil & Gas industry, therefore unknown. However, it is developing a system, referred as Reservoir Management advisor, that will learn from geologists and reservoir engineers as they search for sweet spots in offshore seismic data to recommend probable well locations, and the more suitable well designs to maximize the recovery of hydrocarbons. It is supported on another software called Sherlock IQ, born from the experience on the rover program and based on machine cognition to autonomously shift through different paths of data to discover specific details and scenarios that ultimately will allow it to assess risks. It is expected to become reliable, faster, and capable to appraise more data in a period of just few hours, to complement the work of real experts, which usually can take months.

Figure 5 Beyond Limits Reservoir Management advisor (Jacobs, Journal of Petroleum Technology 2018)

(27)

11

2.4.2 Example 2 – Data Mining to Understand Drilling Conditions.

Lately terminologies like “intelligent wells” or “digital oilfields” are becoming more and more oft used, according to how data use is changing in the industry. The usual approach of established workflows using only specific set of relationships between variables, like linking core data to well logs, has become obsolete.

Data mining functionality along with the proper technology allow to work with disparate data types structured and unstructured, and with different degrees of accuracy and granularity. The combination enables rapid associations between data that normally would be assumed not linked. This perspective was tested by the UK Department of Energy & Climate Change, with a project together with CGG as the official UK Continental Shelf data release agent. The study purpose was to improve drilling results using data mining main tasks: descriptive modelling and predictive relationships. More specifically the project’s aim was to find out the optimum conditions for drilling efficiency and identify the high-risk situations.

A total of 350 wells located in the UK North Sea were used for the study in the form of 20000 files with different formats but including data from Well logs, well geographical locations, drilling parameters, geological reports, and well deviations. All data was thoroughly loaded, quality controlled and finally used for the analysis. The caliper reading was determined as the main reference, to identify poor hole conditions, by normalizing it with the bit size. Other drilling parameters in detail used were: Torque, Weight on Bit (WOB), and ROP. The visualization tool allowed to combine and contrast the inputs/variables in order to understand how its variations affect borehole quality. (Johnston and Aurelien 2015)

Figure 6 Anomaly Detection: A high risk situation was identified (Johnston and Aurelien 2015)

(28)

12

Data mining revealed its functionality working with big amount of data and performing better than the usual approach and in very short period. Figure 6, is an example of how anomaly detection was possible using one of the visualization tools. It showed the case of a single well well where an increase of WOB, ended in poor hole conditions and affecting the reading of the caliper measurement. Subsequently, it was found that the well faced logging stuck issues and forced an extra wiper trip. In conclusion, a high risk situation was identified.

There were more discoveries as result of the study, which included some predictive statistics meant to provide valuable information to drill future wells in the same area hopefully with less problems.

2.4.3 Example 3 – Predictions of Formation Evaluation Measurements to Replace Logging Tools used in Lateral Sections.

The shale revolution growth over the past two decades positioned United States in the top of oil producers worldwide, competing with Saudi Arabia and Russia (Donnelly 2019). However, the threat of the low oil price market after the crisis at the end of 2014, forced producers to become extremely efficient, to cut costs, and to look for innovation.

In this regard, in the Eagle Ford Shale in Texas, the EOG Resources reported a decrease of 70% in the average drilling days from 14.2 during 2012 to 4.3 in 2015. The curios side of this improvement in efficiency relates to the overall cost per well, which only decreased a 20%, from USD 7.2 million to USD 5.7 million. The discrepancy is due to the completion cost, being the major contributor and independent of any possible improvement in efficiency during operations (Parshall 2015). Innovation was on demand.

It is important to bear in mind that to provide smart completions, the location of stages and perforation clusters is essential and currently engineering designed using formation evaluation technology. This technology, known by being costly, must be add to the already considerable cost per stage, where experience has shown that between 30% and 50% of the perforation clusters do not even produce. This situation caught the attention of Quantico Energy Solutions, a data driven company, understanding the need of more and better information about the reservoirs and its geological complexity without the investment required by conventional logs.

The necessity of innovation became stronger due to the way shale fields are developed where operators can afford to log few appraisal wells but not all the subsequent wells, which ideally should be smartly completed too. Therefore, data mining became an alternative, considering a scenario where already thousands of wells in the area have been drilled, collecting not just important geological data from logging tools and cutting samples, but also a huge amount of data regarding drilling parameters, completion and production.

After a two years research, Quantico Energy Solutions, supported by several major shale operators, along with industry specialist in neural networks and openhole logging tool designers, developed a source of formation evaluation characteristics,

(29)

13

called QLog. It is a commercial logging system based on machine-learning software that trained neural networks models using the drilling and logging data from horizontal wells collected for years by operators. It is capable to simulate compressional, shear, and density logs on horizontal wells to prevent the use of expensive logging tools. With the results, it is possible to derive elastic properties such as Young’s modulus, Poisson’s ratio, horizontal stress, and brittleness, fundamental to engineer the completions. Actually, later on, the company developed QFrac software, which using the simulated results, is able to recommend engineered stage locations.

The success of the system, requiring less investment compared to the actual design and test of physical logging tools, created a network effect where more operators decided to step in, providing more data. In consequence, a real time simulator service was developed, QDrill, to assist drillers with well placement operations too. It is a software based on artificial intelligence, that provides petrophysical properties of a reservoir.

The algorithm was developed using several hundred wells for many basins that have the measured well logs along with the drilling data. It was designed to use as input, gamma ray logs and drilling dynamics parameters, like ROP, WOB, torque, and so on.

There are several advantages in using data mining to simulate formation evaluation logs. Starting with the reduction of capital expenses, with savings up to 80% of conventional logging costs. Other benefits include no nuclear or acoustic sources in the well (Quantico Energy 2019). In fact, models for specific fields can be generated in few days. However, the main advantage is the elimination of the risks of running expensive logging tools with the latent possibility of being stuck, or in the worse scenario lost-in- hole. Specially, when the results of simulations have shown repeatable accuracy consistent with the one obtained by logging tools in both deep-water and land wells.

Figure 7 Differences were less than the precision of the logging tools, where Real Time measuremets are highly depended of the hole conditions and largely affected

by hole washouts (Zhang 2018)

(30)

14

Figure 7, refers to a case study in the Midcontinent region of the U.S., where the target was a formation with a clastic laminated/layered sandstone reservoir. The AI model was prepared and the client drilled two laterals sections using Quantico logs for real time geosteering interpretations to place completion stages in areas with higher porosity intervals and equalizing minimum horizontal stress across stages. To compare predicative accuracy and repeatability of the model with the real time measurements, two models were used: one static, based on information from proprietary database, and one adaptive, constantly incorporating in the training set the data acquired from logging tools. The results showed negligible differences between the bulk densities from both models in relation to the one measured by logging tools (Zhang 2018).

(31)

15

Chapter 3 Measurements and Data

The first step in data mining consist in gathering all relevant data for the study, which might not be an obvious task. This is why it is important to state a clear objective to identify the necessary data. In this regard and, as earlier mentioned, the aim for this work looks for a better understanding of the ROP measurement, which in operations is the reflection of the drilling conditions, and include among others, the drilling parameters set while drilling. Thus, prior to mining the data, it is key to understand which are the main measurements and sensors involved during drilling operations and providing the data, as well as some relevant concepts and considerations regarding the data itself.

3.1 Sensors and Rate of Penetration

3.1.1 Sensors Measurements

There are different number and type of sensors involved during normal drilling operations. This is highly related to the nature of the rig, the sort of operation, and the available budget. Sensors are used in the process to measure parameters, and their outputs are the values that provide these parameters’ descriptions.

It is important to distinguish how some measurements are originated with sensors installed on surface while others come from downhole sensors included in the tools used in the Bottom Hole Assembly (BHA). In addition, there are different types of measurements, some are direct and others indirect. Finally, two domains are working in parallel, so data measurements are acquired in Time and Depth.

In general, there are more less 10 key measurements obtained from surface sensors.

However, due to the scope of the present work, only some of the main and most common measurements will be discussed, as they provide input for further analysis and modelling. Table 1 summarizes the surface sensor measurements, normally acquired by the mud logging service provider during daily drilling operations, with a brief description for each attribute (Nguyen 1996).

As previously mentioned, data is acquired in two domains, being DEPTH one of them;

for that reason, this attribute is by far the most important one regarding measurements.

Nevertheless, concerning rig operations, there are three measurements that are indispensable for operations: hook load, rotation and pump discharge pressure.

Besides, the majority of these measurements are indirect, demanding a certain level of interpretation along with regular on site calibration and thus more susceptible to human error.

(32)

16

Attributes Description

Hole Depth [DEPTH] Permits depth tracking and refers to the most recent position of the Bit while drilling along the trajectory.

Hook load [WOH] Correspond to the average value of the weight/load on the Hook.

Rate of Penetration [ROPins]

Calculate the rate of movement of the Bit while drilling in a certain interval.

Rotation per Minute [RPM] Provides the average revolutions transmitted to the drill- string by the Top Drive.

Standpipe Pressure [SPP] Indicates the average Pressure delivered by the pumps.

Usually measured at the Standpipe.

Weight on Bit [WOB] Calculated as the difference between the weight on the hook while off bottom and on bottom.

Torque [TRQ] Average torque in the drill-string.

Flow Rate [FLOW] Average flow rate delivered by mud pumps, usually referred as Flow in.

Table 1 Summary of Attributes coming from surface sensors

Downhole measurements are also indirect, but in most of the cases, its calibration process is more rigorous and normally performed only in the workshop and under specific conditions, i.e., yearly, once per job, etc. Downhole measurements are mainly used for wellbore positioning, directional work, and formation evaluation. Table 2 shows the attributes related to the directional work and obtained from downhole sensors. It includes DEPTH, which is measured on surface and adjusted to the offset of the downhole sensors position in the BHA. It is necessary as point of reference.

Attributes Description

Depth [DEPTH] Corresponds to the depth position of the sensor in the borehole while taking the measurement.

Inclination [Inclination] Provides the deviation of the borehole in relation to the vertical.

Azimuth [Aimuth] Gives the position of the borehole regarding the North and projected onto a horizontal plane.

Build Rate [BR] Refer to the incremental increase or decrease in inclination angle from vertical, specified in degrees per 100 ft. or per 30 m. (Azar and Samuel 2007)

Turn Rate [TR] Provides a measurement of the incremental change in azimuth per 100 ft. or per 30 m (Azar and Samuel 2007).

Dogleg Severity [DLS] Describe the amount of change in the inclination and/or direction of a wellbore. Also expressed in degrees per 100 ft.

or per 30 m (Carden 2007)

Table 2 Attribute related to the directional work

(33)

17

With a first glimpse of the measurements involved in this study, it is important to discuss one in further detail: ROP. During normal drilling operations, this parameter is of main concern due to its influence in drilling performance and efficiency, and therefore in drilling costs.

3.1.2 Rate of Penetration (ROP)

The ROP is defined as the “advancement in unit time, while the drill bit is on bottom and drilling ahead” and the factors affecting it are categorized in three main groups (Mensa-Wilmot, et al. 2010):

1. Planning.

2. Environment.

3. Execution.

The first group, is defined during the planning stage and includes: Hole size and casing depths, well profile, drive mechanism selected to drill (Motor, RSS, etc.), BHA configuration, bit selection (aggressiveness of the design), bit hydraulic horse power per square inch (HSI), flow rate, drilling fluid type and rheology properties and hole cleaning. From the listed factors, it is important to notice that hole size, bit selection, HSI, drive mechanism, and BHA are constant for a run, i.e., since the BHA is running in hole until it is pulled out of hole again.

The environment category refers to the lithology of the area, the formation drillability (rock strength, abrasiveness, etc.), the pressure conditions (differential and hydrostatic) and the deviation tendencies, among others. The differential pressure and deviation tendencies are in constant change during well construction. However, the formation related factors could be considered constant for a specific area or field.

Last, but not least, are the execution factors: Weight-on-bit (WOB), RPM, drilling dynamics, etc. (IADC 2014). These factors also change constantly and are an essential part of the drilling parameters set on surface to construct a well in order to follow a trajectory previously planned. It is important to consider that some technical limitations exist in this regard. For example, the bit selection usually determined the maximum WOB applicable. In the same way, the maximum RPM are limited by rig capability; also in relation to the BHA configuration, the motor bent housing in case of its use as deflection tool, the resulted torque, the well profile, vibrations, and many others.

For instances, it is necessary to differentiate between the two main types of ROP, the average and the instantaneous. The average is used as a description of the measurement over a certain interval or in relation to a particular BHA. The concept of instantaneous ROP, on the other hand, refers to the measurement over a finite time or distance and offers a reference in real time (Mensa-Wilmot, et al. 2010).

3.2 Data Type

Working with data usually represents a challenge because the majority of the data is collected in an unstructured way, which means it does not involve a pre-defined data model or it is simply not organized in a pre-defined manner. Therefore, it becomes

(34)

18

important to understand how to work with different data sets based on the final aim.

By definition “an attribute is a property or characteristic of an object, that may vary, either from one object to another or from one time to another” (Tan, Steinbach and Kumar 2006). The description of data is done by using different attributes, which not only differ in its values but might also vary in its type.

At the most basic level, the physical value for different attributes are mapped as numbers or symbols, where the values used to represent an attribute may have properties that are not properties of the attribute itself, and vice versa. A way to differentiate the types of attributes is to recognise the properties of numbers associated to the properties of the attribute. Four main operations are used to distinguish between attributes:

1. Distinctness.

2. Order.

3. Addition.

4. Multiplication.

Resulting in four types of attributes, with specific properties and operations clearly defined and valid for each type (Tan, Steinbach and Kumar 2006):

1. Nominal: Provide enough information to distinguish one object from another (=, ≠). For example, gender, ID numbers, etc.

2. Ordinal: Based on the information objects can be ordered with a logic criteria (<,

>). For example, grades, costs, quality, etc.

3. Interval: The differences between values are meaningful (+, -). For example, temperature in Celsius or Fahrenheit, where a unit of measurement exists.

4. Ratio: The differences and ratios between values are meaningful. For example, monetary quantities, age, length, etc.

The first and second type of attribute are commonly denoted as categorical or qualitative, and cannot be treated as numbers, even if represented by numbers, because of its absence of the properties of numbers. In contrast, the last two types of attributes are usually referred to as quantitative or numeric, and are not only represented by numbers but actually, those numbers have direct meaning as measurement and have most of the properties of numbers.

In addition, attributes can also be classified based on their numeric values, which can be discrete or continuous. Discrete attributes are usually represented using integer variables and have a finite set of values, i.e., can only take certain values. A special subgroup of discrete attributes is binary attributes, where only two values are possible (0 or 1, True or False, etc.) and often represented as Boolean variables. Continuous attributes are essentially real numbers, can occupy any value over a continuous range and are represented as floating-point variables. Normally, categorical attributes are discrete, while numeric attributes are continuous.

(35)

19

3.3 Data Issues, Limitations and Resource Constraints

The most time consuming stage during any application of data mining corresponds to the preparation of the data for processing. This includes collecting and cleaning the data. Surveys show that between 60 to 80% of the time is designated to this purpose.

Figure 8 Results of survey between data scientists showing the time needed to massage the data prior its use (Press 2016)

There are several measurements and data collection issues. Mainly, related to human error, limitation of measuring devices, or defects in the data collection process, which includes inappropriate sensor installation or poor understanding of the physics involved behind the measurement (Maidla, et al. 2018). Therefore, it is very common to find missing data, duplicate objects, outliers, and inconsistent values.

Typical errors during the measurement process result in differences between the recorded value and the true value, which is known as discrepancy. This can happen due to several reasons like a sensor defect, the used of wrong calibrations or an inadequate installation. Other common problems involve facing noise in the signal or simply lack of maintenance, allowing debris or humidity to affect the measurement.

Signal noise is normally associated with spatial or temporal components that result in spiking signals distorting the measurement (Tan, Steinbach and Kumar 2006).

Errors concerning the data collection process include omitting relevant data or the inappropriate inclusion of data, which is not suitable for the analysis. Moreover, lack of availability of data. Finally, yet importantly, data frequency and range must as well be considered, because it can affect the granularity of the data, having an impact on the results.

Some illustrations in this regard can be found in Figure 9, where the standpipe pressure measurement correspond to the reading of a pressure transducer installed in the manifold. When the sensor is wrong placed, with a plausible closed valve in the fluid path, the reading could suggest a false pumps off. Figure 10, on the other hand, shows the mounting of a Hookload sensor (Clamp Line Tensor - CLT Type) on the drill

(36)

20

line about 6-8 ft. above the dead line anchor. It’s reading could be affected by drill line vibrations when not properly adjusted or when installed too far from the anchor.

Finally, debris and humidity are a concern in all sensor connections, which are constantly exposed to the environmental conditions.

Figure 9 Pressure Transducer installed in manifold on the rig floor

Figure 10 Hookload sensor installed in the drill line

(37)

21

Chapter 4 Methodology

With the clear objective of modelling the ROP response based on drilling data and considering all the factors influencing it, along with the data available for the project, the following general structure was developed:

Figure 11 General structure for the methodology

It was clear that the starting point was to gather data from the same field. Wells drilled in the same field, normally share the same geology, lithology, formation drillability and even face similar issues while drilling. The second important point relates the well schematic. Yet again, wells drilled in the same field are prone to share similar well schematics, i.e., hole sizes, CSG depths, fluid type, bit selection, well profiles, etc.

Then, the last point refers to the drilling parameters. Ideally, in this context, the same rig would be used to drill all the wells in the field. Therefore the technical limitations would be the same. In addition, a relation between the directional work and the drilling parameters is posed, as they are linked and usually defined between the Directional Driller, the Bit engineer and the Company Man. By gathering enough data, data mining techniques can be used to train the model that allows predicting the ROP, providing a point of reference to assess the performance of a new well, hoping for insights of potential factors affecting its result.

Based on the general structure, Figure 12 shows the specific phases defined to cover all aspects involved in the methodology workflow.

Figure 12 Phases for the workflow Field

• Same Formations - Lithology

Sections

• Bit ~BHA  CSG Points

Parameters ↔ Directional work

• Building

• Dropping

• Turning

• Maintaining

1. Data Gathering

2. Data Pre-processing 3. Data Processing 4. Data Analysis 5. Model Evaluation 6. Results

(38)

22

In the following sections, only the first four phases of the workflow will be explained in detail, and the other ones will be covered in the next chapter.

4.1 Data Gathering

Data confidentiality is the most important clause in any company, especially when there is so much in gamble with high monetary investments and considerable environmental associated risks. Therefore, obtaining data is the first challenge.

In this regard, drilling operations are described using different means in the form of reports. For this project, it was possible to collect a limited amount of data to work with. The data set consisted of different files, with different formats and granularity, from four wells drilled onshore, and in the same field with the same rig.

Parameter Well_1 Well_2 Well_3 Well_4

TD MD/TVD 11250 / 10838 12330 / 10600 11660 / 10800 11455 / 10823

VS 2794 3188 3553 3204

Section1 MD/TVD 6490 / 6267.4 6280 / 6247.15 6502 / 6225.1 6520 / 6253.18 Section2 MD/TVD 10387 / 9988.02 10493 / 10010.93 10820 / 10001.53 10627 / 10011.60 Section3 MD/TVD 11250 / 10838.22 11950 / 10587.33 11660 / 10800.66 11455 / 10823.34

Section4 MD/TVD 12330 / 10600.12

Max. Inc/VSA [°] 18.476 / 136.087 88.794 / 98.375 30.993 / 35.682 25.66 / 158.608

Well Type S / 2D Horizontal / 3D S / 2D S / 2D

Table 3 Well candidates’ basic details. All units related to distances are in feet [ft]

Following the main objective of this work along with the general structure described in Figure 11 for the methodology, only the reports available for this thesis and with potential impact on the ROP are further explained:

- Survey Listing: refers to the well profile, which is the result of the drilling parameters used to build the trajectory, and therefore influencing the ROP.

- BHA Report: includes information about the Bit size (i.e. hole size), its type (cutting mechanism), position of the stabilizers and the deflection tool used.

Component affecting the resulted ROP.

- LAS Files: presents a list of all sensors measurements taken on surface, including the ROP, in other words, summarizing with different granularities and domains, the parameters used during operations. The number of sensors installed varies according to the mud logging company contract.

- Geological Topes List: provides a simple description of the different formations from the surface to the target. It usually states the names used to identify each formation tope in relation to its MD and TVD. Formation names vary according to the geographical location. This document differs from the Geological Report, which is a much more detailed description of the cuttings and its composition.

In this case, the geological topes list was included with the Survey Listing.

Application of Data Mining to Predict and Assess the ROP Response

Chair of Drilling and Completion Engineering

Master's Thesis

Application of Data Mining to Predict and Assess the ROP Response

Mildred Rosa Mejia Orellana

May 2019

Mildred Mejía Orellana

Master Thesis supervised by

Univ.-Prof. Dipl.-Ing. Dr.mont. Gerhard Thonhauser ipl.- Dipl. Ing. Asad Elmgerbi

ipl.-Ing. Asad Elmgerbi

Application of Data Mining to Predict and Assess the ROP

Response

Abstract

Zusammenfassung

Acknowledgements

Contents

Chapter 1 Introduction

1.1 Overview

1.2 Motivation and Objectives

.

Chapter 2 Data Mining

2.1 Overview

2.2 Functionalities

2.2.1 Concept/Class Description: Characterization and Discrimination

2.2.2 Frequent Patterns, Associations, and Correlations

2.2.3 Classification and Prediction

2.2.4 Outlier Detection

2.2.5 Cluster Analysis

2.2.6 Regression

2.3 Common Applications

2.4 Applications in the Industry

2.4.1 Example 1 - Reservoir Management

2.4.2 Example 2 – Data Mining to Understand Drilling Conditions.

2.4.3 Example 3 – Predictions of Formation Evaluation Measurements to Replace Logging Tools used in Lateral Sections.

Chapter 3 Measurements and Data

3.1 Sensors and Rate of Penetration

3.1.1 Sensors Measurements

3.1.2 Rate of Penetration (ROP)

3.2 Data Type

3.3 Data Issues, Limitations and Resource Constraints

Chapter 4 Methodology

• Same Formations - Lithology

• Bit ~BHA  CSG Points

• Building

• Dropping

• Turning

• Maintaining

4.1 Data Gathering