Document Type : Research Article
Authors
 David Wood ^{} ^{} ^{1}
 Abouzar Choubineh ^{2}
^{1} Professor DWA Energy Limited Lincoln, United Kingdom
^{2} MSc Petroleum University of Technology, Ahwaz, Iran
Abstract
Machinelearning algorithms aid predictions for complex systems with multiple influencing variables. However, many neuralnetwork related algorithms behave as black boxes in terms of revealing how the prediction of each data record is performed. This drawback limits their ability to provide detailed insights concerning the workings of the underlying system, or to relate predictions to specific characteristics of the underlying variables. The recently proposed transparent open box (TOB) learning network algorithm successfully addresses these issues by revealing the exact calculation involved in the prediction of each data record. That algorithm, described in summary, can be applied in a spreadsheet or fullycoded configurations and offers significant benefits to analysis and prediction of many natural gas systems. The algorithm is applied to the prediction of natural gas density using a published dataset of 693 data records involving 14 variables (temperature and pressure plus the molecular fractions of the twelve components: methane, ethane, propane,
2methylpropane, butane, 2methylbutane, pentane, octane, toluene, methylcyclopentane, nitrogen and carbon dioxide). The TOB network demonstrates very high prediction accuracy (up to R^{2} =0.997), achieving comparable accuracy to the predictions reported (R^{2} =0.995) for an artificial neuralnetwork (ANN) algorithm applied to the same data set. With its high levels of transparency, the TOB learning network offers a new approach to machine learning as applied to many natural gas systems.
Keywords
 Predicting gas density
 Learning networks
 Multicomponent natural gas
 Auditable machine learning
 Transparent predictions
Main Subjects
1. Introduction
The employment of machine learning algorithms to provide accurate predictions from complex systems governed by multiple variables with poorly defined nonlinear relationships is growing. The use of system learning tools, such as artificial neural networks (ANN), adaptive neurofuzzy inference systems (ANFIS), support vector machines (SVM), least squares support vector machine (LSSVM), etc., are being ever more widely applied as systems learning tools (Schmidhuber, 2015).
The learning potential of ANN was recognized in the 1950’s (Kleene, 1956) and has developed with a number of different algorithms now routinely exploited, such as the multilayer perceptron (MLP) (Hush & Horne, 1993; Haykin, 1995) and radial basis functions (RBF) (Broomhead & Lowe, 1988). Since its development in the 1990s ANFIS adapts ANN with a Takagi–Sugeno fuzzy inference system (Jang, 1993) and has successfully demonstrated its learning capabilities when applied to approximate and uncertain nonlinear functions (Jang, Sun, & Mizutani, 1997). SVM and LSSVM algorithms, also developed in the 1990s, provide supervised learning that is successfully applied as nonlinear regression and correlation analysis (Cortes & Vapnik, 1995; Vapnik, 2000). These machine learning algorithms are now commonly applied to provide predictions to many nonlinear systems, including those in the gas and oil industries. Moreover, they are also widely used in a hybrid form, coupled with various optimization algorithms, such as genetic algorithms to improve their performance (Ghorbani, Ziabasharhagh, & Amidpour, 2014; Ghorbani, Hamedi, Shirmohammadi, Mehrpooya, & Hamedi, 2016) and data handling capabilities (Ghorbani, Shirmohammadi, Mehrpooya, & Hamedi, 2018; Shirmohammadi, Ghorbani, Hamedi, Hamedi, & Romeo, 2015; Shirmohammadi, Soltanieh, & Romeo, 2018; Choubineh, Ghorbani, Wood, Moosavi, Khalafi, & Sadatshojaei, 2017).
However, in academia and industry, the extensive exploitation of machine learning algorithms in their many hybrid forms has polarized scientists, particularly in the oil and gas industry. Perhaps the mostcontentious issue is the lack of transparency provided by neural networks regarding their inner calculations, particularly the relative weightings and adjustments made to input variables in deriving specific predictions. This often leads to them being used and viewed as blackboxes (Heinert, 2008). It requires complex and sometimes cumbersome simulations to gain insight to the ways variables are treated in their calculations. At best this turns them into “white boxes” that provide insight to the relative influences of input variables on the calculations being made (Elkatatny, Tariq, & Mahmoud, 2016).
This blackbox condition frustrates and infuriates some scientists and industry practitioners. If it is not possible to see, quickly and in detail, how a prediction is derived from a machine learning tool, and no new fundamental insight is provided about the underlying system, then, for example, many experimental scientists see no value in such systems. Those in this camp when reviewing machinelearning studies simply dismiss them as correlation analysis of minor importance with no experimental justification. On the other hand, some oil and gas companies and their suppliers/ service companies are comfortable with a blackbox approach as in some circumstances it can enhance their competitive advantage and keep their underlying data analysis confidential. Some researchers also embrace the blackbox condition by their willingness to just enter the inputvariable data into opaque coding (e.g., some MatLab machinelearning functions), derive accurate correlations and predictions for their objective function and make bold claims concerning the superiority of their newly developed algorithms. This leaves many practitioners blind to the inner workings of such systems. Nevertheless, the uptake of machine learning and the diversity of its applications continue to grow and have more impact on the way decisions are taken and realtime actions determined in the field.
What is urgently required are moretransparent machine learning tools that raise awareness about their underlying systems rather than obscure them. A recentlyproposed algorithm, the Transparent OpenBox (TOB) network (Wood, 2018) demonstrates that it is possible to do this in such way that sufficient prediction accuracy is provided at the same time as revealing the inner network calculations involved in deriving the prediction for each data record.
In the natural gas industry density of natural gas (ρ) is dependent on several complex nonlinear relationships relating to its physical and chemical characteristics. Its prediction from underlying physiochemical conditions makes it suitable for machine learning applications.
Natural gas density is an important metric contributing to the calculations of other variables relevant to numerous systems involving natural gas (e.g. pipelines, storage facilities, and underground reservoirs). It is complex because it varies significantly with respect to pressure, temperature and gas composition (AlQuraishi & Shokir, 2011). However, measuring it experimentally is timeconsuming and expensive, and estimating it from PVT data involves significant assumptions about related metrics, which are difficult to define with accuracy (e.g. zfactors). Although various equations of state (EOS) are proposed for calculating ρ they have proved to be too simple and inconsistent across the full PT and compositional ranges encountered (Elsharkawy, 2003; FarzanehGord, Khamforoush, Hashemi, & Pourkhadem, 2010).
Shokir (2008) proposed a fuzzy logic method and AlQuraishi and Shokir (2011) developed the probabilistic alternating conditional expectations model to predict ρ. Since then several machine learning algorithms have been applied to accurately predictρ with various machine learning algorithms applied to various mediumsized and large databases. These include: ANN (AlQuraishi & Shokir, 2009); LSSVM (Esfahani, Baselizadeh, & HemmatiSarapardeh, 2015); ANNTLBO (Choubineh, Khalafi, Kharrat, Bahreini,& Hosseini, 2017); and, ANFIS (Dehaghani & Badizad, 2017). These studies have typically achieved accuracies of predicted versus measured data with coefficients of determination of about 0.99 and very low values of statistical error measures (e.g., root mean squared error).
Here, we describe the methodology and mathematical basis of the TOB algorithm and demonstrate its application benefits using, as an example, a complex nonlinear natural gas system for predicting natural gas density from mole fractions of gas composition, together with its temperature and pressure. We have selected a comprehensive published dataset (Atilhan, Aparicio, Karadas, Hall, & Alcade, 2012) of experimental measurements performed on Qatar North Field natural gas samples (693 data records), because it has been previously used for published ANN study (Choubineh, Khalafi, Kharrat, Bahreini, & Hosseini, 2017) (to predict ρ). Our objective is to show that the TOB algorithm is capable of producing comparable accuracy for predicting from this dataset as the ANN study with the additional benefit of providing transparency to each individual prediction it calculates. We make no claims that the TOB algorithm can outperform other more mathematicallycomplex machinelearning algorithms (ANN, ANFIS, LSSVM etc.) in terms of prediction accuracy, but rather that it can achieve acceptable levels of accuracy with the additional benefit of greater transparency.
2. The Transparent Openbox learning Network Algorithm: Methodology and Mathematical Basis
The TOB network approach was proposed and outlined by Wood (2018). There are 14 steps, divided into two stages (stage 1 and stage 2), involved in applying the TOB learning network algorithm (2018). These sequences of steps are explained here with the sequence of steps summarized in a flow diagram (Figure 1).
Step 1: Assemble data into a twodimensional (2D) array consisting of (N+1) variables and M data records. The variables include N input variables plus one dependent variable (i.e., the predictionobjective dependent variable or PODV) to predict for the M data records.
Step 2: Sort and rank the data records into ascending or descending order of the PODV values.
Step 3: Calculate standard statistical metrics (i.e., minimum, maximum, mean, standard deviation, etc) defining the range and distribution of each of the N+1 variables in the dataset. These statistics must include the minimum and maximum values as these are used for the normalization process described in step 4.
Figure 1. Flow diagram illustrating the 14 steps involved in the application of the transparent openbox (TOB) learning network algorithm (2018). The methodology and mathematical details associated with each step are described in the text.
Step 4: Normalize (M) data records for each (N+1) variable. To provide a normalized range of (1, +1) for variable X as minimum and maximum limits use Eq. (1)
X_{i}*= 2*[(X_{i} Xmin)/(XmaxXmin)]1 
(1) 
Where:
X_{i} is the actual value of variable X for the i^{th} data record
Xmin is the minimum value of variable X that exists in the entire data set
Xmax is the maximum value of variable X that exists in the entire data set
X_{i}* is the normalized value of variable X for the i^{th} data record
Step 5. Calculate the standard statistical metrics for the normalized dataset. This is not an essential step, but it provides a useful check to confirm that the normalized variables do, indeed, all fall within the range 1 to +1 as intended.
Step 6. Divide the normalized2D array into training, tuning and testing subsets. The testing subset is kept apart and is not involved in model training or tuning. Sensitivity analysis can be conducted to establish the best division of data records (i.e., percentages of the entire dataset) to allocate to each subset. Typically, more than about 70% of the data records are allocated to the training subset. Consequently, up to about 15%, depending on the size of the data set, are then allocated to each of the other two subsets (i.e., tuning and testing). Such divisions of the data records produce meaningful levels of prediction accuracy. For a specific dataset, exact percentages allocated to each subset can be refined by running sensitivity cases.
Step 7. Calculate the squared error for each of the N+ 1 variable (i.e., the variable squared error, VSE) between each of the tuning records (J in total) and all the data records allocated to the training subset (K in total) as expressed in Eq. (2):
(2) 
Where:
is the value of variable X for the k^{th} data record in the training subset
is the value of variable X for the j^{th} data record in the tuning subset
is the squared error of variable X for the j^{th} data record in the tuning subset versus the k^{th} data record of the training subset.
Then sum the VSE values for each of the N+1 variables between each data record to calculate ∑VSE. In this step and for TOB stage 1 up to step 10, equal weighting factors (W_{n}) are applied to each variable as expressed in Eq. (3):
(3) 
Where:
is the squared error for variable Xn for the j^{th} data record in the tuning subset versus the k^{th} data record of the training subset.
is the sum of the squared errors for all N+1 variables for the j^{th} data record in the tuning subset versus the k^{th} data record of the training subset.
Wn is a weighting factor applied to the squared error of variable n. Each of the N+1 weighting factors is free to be allocated an independent value between 0 and 1. However, in stage one of the algorithm (i.e. for step 7 to step 10, Figure 1) the weighting factors for all N+1 variables are set to the same nonzero value (e.g. all Wn values are set to a singleconstant value, e.g., they could all be set to equal 1, or all be set to 0.5, or another constant number between 0 and 1) so that no bias is introduced among the variable contributions to the match established in stage one of the algorithm.
Use values as the basis for ranking the matches in the training subset in ascending order of (∑VSE) for each tuning subset record.
Step 8. Select the topQranking data records (i.e., those with the lowest in (∑VSE)) in the training subset for each tuning subset data record. Rank these highmatching records in order; so that the training subset data record with lowest (∑VSE) versus data record j in the tuning subset is ranked as the #1 match for tuning subset record j. The integer value of Q is typically set to 10 for TOB stage 1 (i.e. up to step 10) and is refined later in the optimization stage two (see Step 11). These top10matching data records from the training subset for each tuning subset is recorded and then made available for the detailed calculation of the PODV prediction for each tuning subset data record.
Step 9. The toptenranking records in the training subset for the j^{th} data record in the tuning subset each contribute a fraction to the predicted value of the dependent variable for that j^{th} data record. That fractional contribution is calculated by Eq. (4) to Eq. (6) and is proportionalto their relative ∑VSE scores for the j^{th} data record.
(4) 
Where:
q and r are each one of the Qtopranking records from the training subset with the closest matches to the j^{th} record in the tuning subset.
f_{q} is the fractional contribution of one of the topQranking records for the j^{th} record in the tuning subset calculated such that Eq.(5) applies.
(5) 
In order to ensure that the matching record with the lowest value contributes most to the dependent variable prediction for the j^{th} data record, Eq. (6) is then calculated involving all of the topQranking records.
(6) 
Where:
is the dependent variable for the q^{th} data record in the training subset, which is one of the Qtopranking data records in the training subset for the j^{th} data record in the tuning subset.
is the initially predicted value for the dependent variable for the j^{th} data record in the tuning subset (with equal weighting, as described in Step 7, applied to all the variables).
Applying Eq. (6) ensures that the rank#1 in the training subset topmatching records contributes most to the predicted values. On the other hand, the rank #Q match in the training subset contributes least to the dependentvariable prediction for the j^{th} data record in the tuning subset.
Step 10. Compute the coefficient of determination (R^{2}), mean square error (MSE) and root mean square error (RMSE) for the predicted versus actual or measured values of the PODV for all J data records in the tuning subset using Eq. (7) and Eq. (8).
(7) 

(8) 

(9) 
Where:
X_{j} is the dependent variable (designated in Eq. (6)) for the j^{th} data record in the tuning subset
is the actual value of the dependent variable for the j^{th} data record
is the predicted value of the dependent variable for the j^{th} data record
is the average actual value of the dependent variable for all J data records in the tuning subset.
This step represents the end of TOB stage one of the prediction process. Step 10 provides a provisionallytuned TOB network that provides predictions based upon uniform weighting (as described in Step 7) applied to all the variables and by matching data records with those in the training subset.
The TOB stage 2 involves applying optimization to improve the accuracy of the predictions for the tuning set as a whole. TOB stage 2 also tests the optimized prediction metrics with the yettobe used independent testing subset.
Step 11. Apply an optimizer to the provisionallytuned TOB network to improve the accuracy of the PODV prediction for the tuning subset. The optimizer is set up with its objective function to minimize RMSE (Eq.9) for all J data records in the tuning subset by varying a set of optimization variables within specified constraints. These optimization variables are:
1. The weights (W_{n}) applied by the optimizer to each of the N input variables in Eq. (3) are allowed to vary independently between values 0 and 1. This contrasts with Step 7 of stage one of the algorithm, where all the weights were initially set to the same constant number between 0 and 1. Also, in Step 11 the dependent variable (identified as variable N+1) is not involved in the optimization as it is considered as an unknown, so a zero weight is applied to it. Sometimes, verylow weights (e.g. 1.0 E10 or less) may be selected as optimum weights for certain input variables. This verylowweight value does not mean that such a variable is insignificant in the optimum solution. Their nonzero values, albeit small, will contribute to selecting the relative contributions of each of the topmatching records in the predictions made. This point is illustrated for an example data record from the dataset evaluated.
2. The integer values of Q (how many of the topmatching records to include in the predictions) is allowed to vary in Eqs. (4), (5) and (6). Typically, 2<=Q<=10 is the range within which Q is allowed to vary and the optimizer selects the best value of Q from that range. Q values of higher than 10 could be used. However, the experience of applying the algorithm to multiple datasets suggests that all the topten matching records are not used in the optimum solutions found by the optimizer.
In this study, the standard “Solver” optimizer in Microsoft Excel is used to conduct the optimization process. Specifically, it is the GRG (Generalized Reduced Gradient) algorithm option within the Solver function that is used. GRG applies a robust nonlinearprogramming algorithm (FarzanehGord, Khamforoush, Hashemi, & Pourkhadem, 2010). GRG is setup to “multistart” (i.e., run multiple cases each with a population of 150) and to converge to a solution value of 0.0001, if possible, for the RMSE objective function. GRG can be run directly from an Excel worksheet or as part of a visual basic for applications (VBA) code in Excel. It is possible to use other fullycoded optimizers to achieve this, but the advantages of doing it in Excel for midsized dataset is explained.
The optimization process accepts the topQ matches in the training subset for each data record in the tuning subset established by step 8 of TOB stage 1. However, in TOB stage two it reevaluates the scores using Eq. (3) by varying W_{N} in each iteration of the optimizer, and the scores scores use Eq. (4) by varying Q in each iteration of the optimizer.
Step 12. Evaluate and compare the RMSE and R^{2} values obtained by the optimum solution found by step 11. The statistical accuracy of the predictions derived from Step 11 typically demonstrates a significant improvement on the TOBstage1 predictions (from step 10). Also, this step runs and evaluates sensitivity cases with different fixed values of Q (2 to 10). All but one of these sensitivity cases is suboptimal. However, comparing these sensitivitycase results helps to identify regions of the TOB network that might be prone to under fitting or over fitting.
Step 13. Apply the weights and Q values of the optimized learning network, tuned for the tuning and training subsets, to the independent testing subset. The RMSE and R^{2} values obtained for the testing subset should be close to those for the optimized tuning subset. The detailed prediction calculations for each data record (testing and tuning subsets) are transparently recorded and can be reviewed to interrogate the reason for prediction outliers, if any occur. This is a useful predictionauditing attribute of the TOB and helps to provide deeper insight to the underlying dataset and optimallytuned network. It also generates more confidence in the reliable range for which meaningful predictions can be generated.
Step 14. Decide whether the level of accuracy achieved by the TOB is fitforpurpose? If sodeploy it. If not? Interrogate the prediction performance of the TOB by reviewing in detail the prediction calculations for each of the data records of the tuning and testing data subsets to establish the PODV value ranges for which the network lacks sufficient accuracy. This information can help to focus the network on viable PODV ranges and establish value ranges for which the dataset is too sparse. This information can also be useful as a benchmark for assessing the performance of other machinelearning algorithms applied to the same data set.
In summary, TOB Stage 1 involves constructing a network of initial record matches from a large training subset to the individual records of a much smaller tuning subset of data records. That first stage yields a provisional prediction for the dependent variable which can usually be significantly improved upon by the optimization applied in TOB stage 2.
TOB stage 1involves standard matching and ranking algorithms between an unknown record and the multiple records in the larger training subset. That training subset should typically be comprised of more than about 70% of all the data records available. In order to obtain reliable predictions across the entire PODVvalue range covered by the dataset, the records included in the tuning subset and the testing subset should be distributed across the full range displayed by the dataset. It is also appropriate for the data records with the minimum and maximum PODV values to be placed in the training subset. These requirements mean that the division of the data records between the data subsets is not conducted randomly, as that might lead to sparse data coverage in certain PODVvalue ranges in the training and tuning subsets.
The simple steps of TOB Stage 1 (steps 1 to 10) often generate predictions for the dependent variable of credible but suboptimal accuracy. This highlights which data records in the training subset should be the focus of more detailed analysis for each data record in the tuning subset. Stage 1 can often achieve impressive levels of accuracy from highly nonlinear input data distributions. TOB stage 2: (steps 11 to 12) applies optimization to refine and tune the predictions derived from TOB stage 1. A comparison of the prediction results from stage 1 and stage 2 can typically reveal the respective contributions of each stage to the accuracy of the final predictions derived. Once the optimizedtuning process is completed, and the optimum tuned values of Q and W_{N} are established, those values are then applied to generate predictions of the dependent variable for the data records of the testing subset (TOB stage2: steps 13 and 14).
The TOB learning network can be applied using spreadsheets (e.g. Excel workbooks), which is a suitable approach for small to midsized data sets. It can also be set up in fullycoded formats or as a hybrid code plus spreadsheet configurations. The spreadsheet and hybrid alternatives have the attraction that the standard builtin spreadsheet optimizers can be exploited (e.g. the generalized reduced gradient, GRG, and evolutionary optimizers of Excel’s Solver optimization function). That approach enables the final steps of the TOB Stage 1 and the Stage 2 prediction calculations to be displayed as simple and easilyaudited formulas for each data record in the spreadsheet cells. For large datasets it is more efficient to code the TOB algorithm with suitable mathematical coding languages (i.e., Octave, R, Python, MatLab etc.).
To predict ρ from the compiled natural gas dataset (693 data records) evaluated in this study, a hybrid VBAExcel spreadsheet configuration is used. The TOB subsets (training, tuning and testing) are initially displayed in Excel with some calculations conducted using spreadsheet formula (e.g. statistical metrics for all variables). Visual Basic for Applications (VBA) coding is then used to normalize, rank, and match the data records of the tuning and training subsets (TOB stage 1). The VBA code places the topten ranked matches for each tuning subset data record into an Excel sheet cell. This enables the final TOB Stage 2 optimization calculations to be conducted with Excel cell formula, enabling the Solver optimizer(s) (Frontline Solvers. Standard Excel Solver, 2018) to be applied. This approach enhances the transparency and insight to the dataset compared to the fullycoded method.
3. Application of the TOB Learning Network to Predict Gas Density
A TOB network is used to predict ρ from the dataset of experimental measurements performed on Qatar North Field natural gas samples (693 data records) published by Atilhan et al. (2012). The data records cover a temperature range of 250 to 450 K and a pressure range of 15 to 65 MPa. They also include compositional data for each data record in the form of mole fractions for the 12 components: methane, ethane, propane, 2methylpropane, butane, 2methylbutane, pentane, octane, toluene, methylcyclopentane, nitrogen and carbon dioxide. A value range and mean for each variable in the full 693record data set is provided in Table 1.
Table 1. Natural gas dataset (Atilhan, Aparicio, Karadas, Hall,& Alcade, 2012) statistical summary of data record values for fourteen input variables with gas density as the dependent variable to which the TOB learning network (Wood, 2018) is applied.
Summary of Dataset Consisting of 693 Data Records 

Input Variable Number 
Mol Fractions: 
Min 
Max 
Mean 
1 
Methane 
0.8034 
0.9026 
0.85197 
2 
Ethane 
0.05189 
0.05828 
0.05515 
3 
Propane 
0.01878 
0.02106 
0.01997 
4 
2 Methyl propane 
0.00384 
0.00412 
0.00399 
5 
Butane 
0.00573 
0.00641 
0.006 
6 
2Methyl butane 
0.00169 
0.00214 
0.0019 
7 
Pentane 
0.0014 
0.00162 
0.0015 
8 
Octane 
0.00145 
0.00161 
0.00153 
9 
Toluene 
0.0009 
0.0011 
0.00097 
10 
Methylcyclopentane 
0.00095 
0.00106 
0.00101 
11 
Nitrogen 
0 
0.06596 
0.03364 
12 
Carbon Dioxide 
0 
0.0438 
0.02237 
13 
Pressure (Mpa) 
15 
65 
40 
14 
Temperature (K) 
250 
450 
350 
Dependent Variable: 
Density (Kg/m^{3}) 
75.36 
415.25 
244.928 
The dataset is divided into training (532 data records; 77% of the complete dataset), tuning (90 data records; 13% of the complete dataset) and testing subsets (71 data records; 10% of the complete dataset) for detailed TOB network analysis.
The relationships between the key variables, P and T and ρ for the training subset are illustrated in Figures 2 A to D demonstrating the significant nonlinearity and irregularity in the relationships among these variables across the entire dataset.
Table 2 and Figures 3 and 4 show the results obtained from applying the TOB to this dataset: 1) up to step 10 (evenlyweightedvariable contributions to POV prediction) for the configured tuning set; 2) up to step 12 for the optimized tuning set; 3) up to step 12 applying the optimized TOB settings to the testing subset.
Figures 2. (A to D). Pressure, temperature and density relationships in the training set used for the TOB network application.
Table 2. Gas density prediction performance of TOB learning network applied to the 693record data set showing solutions with a range of variable weightings applied.
Variable Description 
Variable Number 
Preoptimization Equal Weightings 
Best Solution 
Best Solution 
Sensitivity Analysis with Q constrained to integers progressively from 10 to 2 

Q Constrained to 
Integer Constraints 
2 to 10 
2 to 10 
10 
9 
8 
7 
6 
5 
4 
3 
2 

Q selected for solution 
Integer # 
3 
3 
10 
9 
8 
7 
6 
5 
4 
3 
2 

Prediction Performance of Optimum and Constrained Optimum Solutions Applied to the Tuning Subset (90 records: ~ 13.0% of total dataset) 

RMSE 
Kg/m^{3} 
6.8877 
5.5995 
5.5995 
6.8638 
6.7311 
6.6134 
6.6996 
6.3217 
6.3711 
6.6230 
5.5995 
7.7695 

R^{2} 
fraction 
0.9939 
0.9954 
0.9954 
0.9940 
0.9941 
0.9942 
0.9938 
0.9942 
0.9942 
0.9936 
0.9954 
0.9907 

Weightings (0<=w<=1) Applied to constrained optimum solutions for the tuning subset 

Temperature 
#14 
0.5 
0.09888 
0.65545 
0.78770 
0.66440 
0.15029 
0.79172 
0.44836 
0.67680 
0.39388 
0.46541 
0.59351 

Pressure 
#13 
0.5 
0.08734 
0.57896 
0.62375 
0.55941 
0.15310 
0.85796 
0.49286 
0.69382 
0.38816 
0.41110 
0.67403 

All other variables 
#1 to #12 
0.5 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 

Ratio of T weight to P weight 
1.13211 
1.13210 
1.26284 
1.18768 
0.98163 
0.92280 
0.90971 
0.97548 
1.01475 
1.13211 
0.88055 

Prediction Performance of Optimum Solution Variable Weightings and Q Value Applied to the Testing Subset (71 records: ~ 10.2% of total dataset) 

RMSE 
Kg/m^{3} 
5.5023 

R^{2} 
fraction 
0.9965 

The algorithm was applied to this dataset using the combination of an Excel spreadsheet for steps 10 to 14 (enabling the use of Solver’s GLG and evolutionary optimization functions) and VBA code to handle the ranking sorting normalization, record matching, and selection (steps 1 to 8).
Figure 3. Predicted versus measured gas density for the tuning data set (90 records).
Figure 4. Predicted versus measured gas density for the testing data set (71 records).
The results reveal that the TOB algorithm can achieve very high levels of prediction accuracy (RMSE= 5.6; R^{2 }= 0.997 for the testing subset). This accuracy is comparable to that achieved by ANN applied to the same data set (Choubineh, Khalafi, Kharrat, Bahreini, & Hosseini, 2017) which reported RMSE = 5.28 and R^{2 }= 0.995. Both TOB and ANN provide superior ρ predictions than the published correlations applied to the same dataset [Azizi, Behbahani, & Isazadeh, 2010;Sanjari & Lay, 2012). RMSE and R^{2} achieved by Azizi et al.’s (2010) correlation were 59.18 and 0.7, respectively. Gas density predicted using that model (Azizi, Behbahani, & Isazadeh, 2010) for the lowdensity range areis reasonable. However, that correlation significantly overestimates ρ in the higherdensity range. Sanjari and Lay’s (2012) correlation model, achieved a better gas density prediction performance (RMSE=12.6; R^{2}=0.97than the Azizi et al.’s (2010) correlation. Although Sanjari and Lay’s (2012) model estimates gas density values lower than 340 kg/m^{3} with reasonable precision, the values diverge greatly from the unit slope line (y = x) for values in the range of 340–450 kg/m^{3}, indicating the limitations of that model in that density range.
The proposed TOB model achieves a high level of prediction accuracy while also being able to display the exact prediction calculation for each record in each subset (i.e., identify which of the topranking matching records are involved and their fractional contributions to the prediction value). The algorithm achieves most of this in steps 1 to 10 (i.e., TOB stage 1) for this data set (achieving R^{2 }= 0.9939 for equal weighting applied to all 14 of the input variables for the tuning subset, Q=10 and without the use of an optimizer). Reviewing the prediction details of each record shows that the only variables impacting the prediction based on using the highranking matches are T and P. The matching and ranking of the squared errors has removed the impact of the mole fractions of the individual gas components from the final optimized (TOB stage 2) prediction calculation. The mole fractions of the gas components play an important role in selecting the top10 ranking records in the training subset for each tuning subset data record in the credible TOB stage 1 provisional prediction. However, in TOB stage 2 the optimizer takes those top10 matching records for each stage 1 prediction and refines them by varying the weights it applies to P and T while applying zero weight to the mole fractions of the gas components. By doing so it slightly improves on the TOB stage 1 prediction.
Table 2 highlights the results of optimization and sensitivity analysis by varying the Q value from 10to 2. The optimum Q value for this data set is 3 (yielding the minimum RMSE value). For values of Q below 3, the accuracy of the model is impaired slightly, suggestive of under fitting. The impact of varying Q on the predictions is caused by subtle changes in the weightings applied to variables T and P. As it happens for this data set the optimum weightings for these two variables in gas density prediction are close in relative magnitude (e.g. 0.5:0.5). That is why the TOB stage 1 provisional prediction managed to achieve such high prediction accuracy because it applied 0.5 weightings to all 14 input variables.
As shown in Table 2, the ratio of the weightings for T and P (w_{T/}w_{P}) varies for the optimum solutions associated with different Q values. For Q = 2, 5 to 8 that weightings ratio is less than 1. For other Q values, up to 10 it is greater than 1. The optimum value of w_{T/}w_{P} is 1.13211 for Q=3. This is very specific and useful information about how the TOB algorithm is making its optimum predictions for this data set. On the highranking matched records (top 3, when Q=3) it is using weights only for T and P and it is doing so in a ratio of 1.13211 to achieve the optimum prediction accuracy. Such insight to the prediction calculation cannot be readily revealed by ANN, ANFIS, SVM and LSSVM algorithms. Future work is planned to compare the performance of TOB with these other machinelearning algorithms for the prediction gas density and other complex oil and gas systems.
On the positive side, the TOB methodology provides transparency to the specific calculations involved in each prediction it makes and it achieves credible levels of prediction accuracy. On the negative side, TOB cannot extrapolate its predictions beyond the minimum and maximum values of the dependent variable covered by data records in the training subset, which many other machinelearning algorithms can do. It also cannot achieve highly accurate predictions in sparsely populated regions of a training sunset. This is not necessarily a bad limitation as it inhibits the algorithm from over fitting sparse datasets; a criticism often leveled against other machine learning algorithms. We believe that these attributes make the TOB algorithm a complementary addition to the suite of existing machinelearning algorithms and justify its use in conjunction with other machine learning algorithms to provide greater transparency to the prediction process.
4. Conclusions
The Transparent OpenBox (TOB) learning network provides a valuable tool for evaluating and deriving predictions from complex, nonlinear naturalgas systems. It offers advantages and complementary capabilities to the moretraditional machinelearning networks in that:
 · all its intermediate calculations and relationships are fully auditable and accessible;
 · relative weightings applied to variables in optimized solutions are clearly revealed;
 · it performs well with standard optimizers (e.g., Excel’s Solver options) and can be also be linked easily to customized optimization algorithms;
 · varying its Qfactor values readily identifies underfitted versus overfitted solutions
We recommend that what can be achieved in terms of predictionperformance accuracy by the transparent openbox network algorithmshould be useful as a performance benchmark when applying lesstransparent machinelearning algorithms to specific datasets.
There are many natural gas datasets to which the TOB (e.g. PVT, drilling, welllog data, reservoir and source rock analysis) learning network could be readily applied and provide more enlightening analysis and predictions than the black boxes currently applied to them.