# WELL PRED Predict missing logs from available data using artificial intelligence ###### tags: `Machine learning, wells, GENEVA` :::info :bulb: Combining information from different sources through non-parametric approahces can help to infer not available (but very important) well logs data? ::: ### Merge available wells data Create a dataset of available well logs data. The data comes from these wells: ![](https://i.imgur.com/la4keCS.png) Read and merge available data such as LAS data and QEMSCAN data. ![](https://i.imgur.com/REefPwr.png) Not all the wells have the same kind of data. However, the dataset contains all the possible variables encountered in each wells. Each columns of the dataset correspond to a feature. If a features is absent for a well, an '*out-of range*' (-999999) value is assigned. ### Train and test dataset The training set consist of a set of wells data on which we train the ML algorithm. The test dataset, consist of a single well (or few) on which we test or ML model. For this example the **Humilly 2** was kept to test the ML model. In particular we want to test how the training on the 8 other wells are able to predict some wells data. The test predictions are made using XGBoost[^2] algorithm. Humilly 2 available logs are the following: ![](https://i.imgur.com/Gu1bDQI.png) **Test 1 : predict RHOB in Humilly 2** ![](https://i.imgur.com/GGR5rXA.jpg =600x) The RHOB curve in Humilly2 is available only between 2560 and 3060 m. The prediction (blue) shows clearly the same trends of the true data (red). However, there is considerably gaps on the top and on the bottom of the curve. One interesting (and useful) things about XGBoost and ensemble methods ing eneral, is the ability to plot the so-called feature importance plot, that is a ranking list (based in a choosen metric) on how features (wells logs curves in this case) contribute in predict RHOB. We remark that the most 'important' is the caliper curve and the less important is temperature. ![](https://i.imgur.com/QTTWQmO.jpg) **Test 2 : predict DT (Sonic log) in Humilly 2** ![](https://i.imgur.com/vzsPqLv.jpg =400x) The predicted DT trend (blue) is clearly following the true data (red). Major differences are between 500m to 750 m, 1000m to 1250m and 2100m to 2500m. And the plot importance: ![](https://i.imgur.com/dCbSqGm.jpg) ### Comments As the results shows there is a lot of potential in predicting missing well logs form existing data. Some considerations: - No kind of feature engineering (e.g create new features (statistics on the logs, wavelet decomposition, etc.) that can capture patterns that are not visible in the curve alone). Typically, feature engineering is a key step of this kind of analysis as it can improve by a lot the prediction output - The algorithms have not been tuned (e.g. searching for the best parameters). Such ML algorithms have tipically many paramaters and finding the ones that gives the best results is almost a separate ML discipline. - Cross validation has not been applied, it should probably improve the results. CV allows to shuffle the traing datasaet in order to increase the robustness of the ML model. By pushing the analyisis a little bit further (the 3 points above) I think we can achieve outstanding results. :::info :pushpin: For the moment QEMSCAN data have not been integrated in the dataset, since we need to define the intervals for the wells other than GEO01. If we havec cuttings descriptions, we also should integrate them in the analysis. In fact, this kind of algorithms are capable to work both with numerical and categorical data. ::: [^1]: Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016.