Content of PetroWiki is intended for personal use only and to supplement, not replace, engineering judgment. SPE disclaims any and all liability for your use of such content. More information

# Statistical concepts

An understanding of statistical concepts is important to many aspects of petroleum engineering, but especially reservoir modeling and simulation. The discussion below focuses on a range of statistical concepts that engineers may find valuable to understand. The focus here is classical statistics, but differences in the application for geostatistics are included.

## Measurement systems

A quantitative approach requires more than a headlong rush into the data, armed with a computer. Because conclusions from a quantitative study are based at least in part on inferences drawn from measurements, the geoscientist and reservoir engineer must be aware of the nature of the measurement systems with which the data are collected.

There are four measurement scales:

• Nominal
• Ordinal
• Interval
• Ratio

Each of these scales is more rigorously defined than the one before it. The nominal and ordinal scales classify observations into exclusive categories. The interval and ratio scales involve determinations of the magnitude of an observation and so are the ones we normally think of as “measurements.”[1] All four of these systems are used in reservoir descriptions.

### Nominal scale

The nominal scale classifies observations into mutually exclusive categories of equal rank, such as “red,” “green,” or “blue.” Symbols (e.g., A, B, C, or numbers) often are used, as well. In geostatistics, for example, when predicting lithofacies occurrence, we often code lithofacies as 1, 2, and 3 for sand, siltstone, and shale, respectively. Within this code, or scale, there is no connotation that 2 is “twice as much” as 1, or that 3 is “greater than” 2. Furthermore, a lithofacies value such as 2.5 has no meaning at all.

### Ordinal scale

In an ordinal scale, observations are ranked hierarchically. A classic example of an ordinal scale in the geosciences is the Mohs hardness scale. Although the ranking scale extends from one to ten, the step from 1 to 2 is not equal in magnitude to that from 9 to 10. Thus, the Mohs hardness scale is a nonlinear scale of mineral hardness. In the petroleum industry, too, kerogen types are based on an ordinal scale that reflects the stages of organic diagenesis.

### Interval scale

The interval scale is so named because the width of successive intervals remains constant. A common example of an interval scale is temperature. The increase in temperature between 10 and 20°C is the same as the increase between 110 and 120°C. An interval scale does not have a natural zero or a point where the magnitude is nonexistent, and so it is possible to have negative values; however, in the petroleum industry, some reservoir properties are based on an interval scale measured along continuums for which there are practical limits. It would be impossible, for example, to have negative porosity, permeability, or thickness, or porosity greater than 100%.

### Ratio scale

Ratios not only have equal increments between steps, but also have a zero point. Ratio scales are the highest form of measurement. All types of mathematical and statistical operations are performed with them. Many geologic measurements are done on a ratio scale because they have units of length, volume, mass, and so forth. A commonly used ratio in the petroleum industry is the net-to-gross ratio of pay and nonpay.

For most of discussion in this page, focus will be centered mainly on the analysis of interval and ratio data. Typically, no distinction is made between the two, and they may occur intermixed in the same problem. For example, a net-to-gross map is a ratio, whereas porosity and permeability measurements are on an interval scale.

## Samples and sample populations

Statistical analysis is built around the concepts of “populations” and “samples” and implicitly assumes that the sampling is random and unbiased.

• A population is a well-defined set of elements (either finite or infinite), which commonly are measurements and observations made on items of a specific type (e.g., porosity or permeability).
• A sample is a subset of elements taken from the population.
• Furthermore, there are finite and infinite (or parent) populations. A finite population might consist of all the wells drilled in the Gulf of Mexico during the year 2001, for example, whereas the parent population would be all possible wells drilled in the Gulf of Mexico in the past, present, and future (albeit a practical impossibility).

Each reservoir is unique and completely deterministic. Thus, if all the information were available, there would be no uncertainty about the reservoir. Unfortunately, though, our sample data set offers us a sparse and incomplete picture of the reservoir. Furthermore, the sampling program (drilling wells) is highly biased, at least spatially, in that we generally do not drill wells randomly. Instead, we drill in locations that we believe are economically favorable, and thus anomalous. Because the entire reservoir will not be examined directly, we never will know the true population distribution functions of the reservoir properties.

Even when collected at the borehole, the data might not be representative because certain observations are purposely excluded, and this produces a statistical bias. Suppose, for example, we are interested in the pore volume (PV) of a particular reservoir unit for pay estimation. Typically, we use a threshold or a porosity cutoff when making the calculation, thus deliberately and optimistically biasing the true PV to a larger volume. If the lower-porosity rocks are not oil saturated, this might not be a bias, but without certainty of such, the PV estimate is considered biased. Both the statistical insufficiencies caused by sparse, irregular well spacing and the biases present in data acquisition reduce our ability to accurately and precisely define reservoir heterogeneity enough to ensure a surprise-free production history.

Sparse and biased sampling presents a major challenge. The biased sample population often is used as the conditioning data during construction of a geostatistical reservoir model. Thus, the assumptions made about the population distribution function influence the results; if the assumptions are incorrect, then the modeling results are highly suspect. With these limitations in mind, our task is to best estimate the reservoir properties while minimizing the effects of our uncertainty. To do this, we use a variety of statistical tools to understand and summarize the properties of the samples and make inferences about the entire reservoir.

## Exploratory data analysis (EDA)

Often, the goal of a project is to provide a general description and analysis of a data set, and this can be done using classic statistical tools in a process commonly known as exploratory data analysis (EDA). EDA is an important precursor to a geostatistical reservoir-characterization study, which may include interpolation or simulation and uncertainty assessment. Unfortunately, though, in many reservoir studies today (including routine mapping of attributes), EDA tends to be overlooked. It is absolutely necessary to understand the reservoir data fully, and doing so will be rewarded with much-improved results.

There is no single set of prescribed steps in EDA; one should follow one’s instincts in explaining the behavior of the data. By using various EDA tools, not only will you gain a clearer understanding of your data, but you also will discover possible sources of errors. Errors are easily overlooked, especially in large data sets and when computers are involved, because we tend to become detached from the data. A thorough EDA fosters an intimate knowledge of the data, so that suspicious results are more easily noticed.

A number of excellent textbooks offer a more thorough discussion of EDA,[1] [2] [3] [4] [5] though here, there would just be a brief review of classic statistical methods it uses. These methods generally fall under the following categories:

• Univariate data analysis
• Multivariate data analysis
• Normal-score transform

## Univariate data analysis

There are several ways to summarize a univariate (single-attribute) distribution. Often, simple descriptive statistics are computed, such as the sample mean and variance, and plotted on a corresponding histogram; however, such univariate statistics are very sensitive to extreme values (outliers) and, perhaps more importantly, do not provide any spatial information. Spatial information is the heart of a geostatistical study—a reef and a delta, for example, can have identical univariate statistical profiles, but the geographic distribution of their petrophysical properties will be completely different.

### Frequency tables and histograms

Histograms (frequency distributions) are graphic representations that are based on a frequency table that records how often data values fall within certain intervals or classes. It is common to use a constant class width for a histogram, so that the height of each bar is proportional to the number of values within that class. When data are ranked in ascending order, they can be represented as a cumulative frequency histogram, which shows the total number of values below certain cutoffs, rather than the total number of values in each class.

### Summary statistics

The summary statistics are the univariate statistics as represented graphically by the histogram. They are grouped into four categories: measures of location, measures of spread, measures of shape, and the z-score statistic.

#### Measures of location

The measures of location provide information about where the various parts of the data distribution lie, and are represented by the following:

• Minimum—The smallest value.
• Maximum—The largest value.
• Mean—The arithmetic average of all data values. The mean is quite sensitive to outliers, and can be biased by a single, erratic value.
• Median—The midpoint of all observed data values, when arranged in ascending or descending order. Half the values are above the median, and half are below. This statistic represents the 50th percentile of the cumulative frequency histogram and generally is not affected by an occasional erratic data point.
• Mode—The most frequently occurring value in the data set. This value falls within the tallest bar on the histogram.
• Quartiles—Each of the quarter-points of all observed data values. Quartiles represent the 25th, 50th and 75th percentiles on the cumulative frequency histogram.

Measures of spread describe the variability of the data values, and are represented by variance, standard deviation, and interquartile range. Variance is the average squared difference of the observed values from the mean. Because variance involves squared differences, it is very sensitive to outliers. Standard deviation is the square root of the variance, and often is used instead of variance because its units are the same as those of the attribute being described. Interquartile range is the difference between the upper (75th percentile) and lower (25th percentile) quartiles. Because this measure does not use the mean as the center of distribution, it is less sensitive to outliers.

#### Measures of shape

All statistical data analyses require an assumption about the nature of the population probability distribution however, to assume this blindly can be dangerous. To determine that the data are consistent with the assumed distribution, one of several numerical indicators can be used. One approach is to use the method of moments. Moment measures are defined in the same way as moments in physics—the mean can be defined as the first moment about the origin, the standard deviation as the first moment about the mean (or the second about the origin), and so forth.

For its definition, the Gaussian distribution only requires the values of the first two moments. All higher moments have the value of zero or are constant for all members of the Gaussian family. The third Gaussian moment has been called “skewness” and the fourth “kurtosis.” The skewness, for example, is equal to the averaged cubed difference between the data values and the mean, divided by the cubed root of the standard deviation. In all Gaussian distributions, the skewness = zero and the kurtosis = 3.0.

Because skewness is a cubic function, the computed values may be negative or positive, and which sign they are can indicate how the sample values depart from a Gaussian assumption. A positive value indicates that the distribution departs by being asymmetric about the mean and that it contains too many large values. The resultant histogram is asymmetric, with an elongated tail for the higher values. Positive skewness denotes that the center of gravity of the distribution (the mean) is above than the 50th percentile (the median). Negative skewness indicates the reverse (i.e., that the center of gravity is below the 50th percentile).

Often, a few large outliers will distort a statistical analysis, and frequently these represent errors in data entry; however, when there are a few such outliers among thousands of correct values, their existence cannot be surmised from the skewness value. Historically, data analysts have used the coefficient of variation (the ratio of σ to m) as a further check for the existence of outliers. A rule of thumb is that when the coefficient of variation exceeds the value of unity, the data should be checked for outliers and corrected, if necessary.

The z-score statistic transforms the data values into units of standard deviation and rescales the histogram with a mean of zero and a variance of 1.0. The z-score is a statistic used to screen for outliers or spurious data values. Absolute score values greater than a specified cutoff (e.g., 2.0 to 2.5 standard deviations) lie beyond the expected range of the zero mean. Statistically, such data are outliers and should be investigated carefully to determine whether they are erroneous data values or whether they represent a local anomaly in the reservoir property.

The z-score rescaling does not transform the shape of the original data histogram. If the histogram is skewed before the rescaling, it retains that shape after rescaling. The x-axis of the rescaled data is in terms of ± standard deviation units about the mean of zero.

## Multivariate data analysis

Univariate statistics deals with only one variable at a time, but frequently we measure many variables on each sample. Working with multiple variables requires the use of multivariate data statistics. If we want to express the relationships between two variables (e.g., porosity and permeability), we do so through the regression study and correlation analysis. In a regression study, we estimate the relationship between two variables by expressing one as a linear (or nonlinear) function of the other. In correlation analysis we estimate how strongly two variables vary together. It is not always obvious which one—regression or correlation—should be used in a given problem. Indeed, practitioners often confuse these methods and their application, as does much of available statistics literature, so it is important to clearly distinguish these two methods from one another.[3] Looking at the purpose behind each will help make the distinction clear.

### Regression study

In regression analysis, the purpose is to describe the degree of dependency between two variables, X and Y, to predict Y (the dependent variable) on the basis of X (the independent variable). The general form of the equation is

....................(1)

where a = the Y-intercept; b = the slope of the function; X = the independent variable whose units are those of the X variable; and Y = the dependent variable whose units are those of the Y variable. Generally, b is known as the regression coefficient, and the function is called the regression equation.[3]

### Correlation analysis

Correlation analysis is similar to regression, but is less rigorous and is used to determine generally whether variables are interdependent; however, correlation analysis makes no distinction between dependent and independent variables, and one is not expressed as a function of the other. The correlation coefficient r is a statistic measuring the strength of the relation (linear or nonlinear) between all points of two or more variables. Its value lies between +1 (perfect, positive correlation) and –1 (perfect, inverse correlation). A value of zero indicates a random relation (no correlation). The square of the correlation coefficient r2 , known as R-squared (coefficient of determination), is a measure of the proportion of the variation of one variable that can be explained by the other.[3]

In its classical form, regression analysis is strictly used to estimate the value at a single point. In this respect, it often is used incorrectly in the petroleum industry. The misuse is due to the requirement of sample independence not being recognized, a prerequisite for regression analysis.[3] It would be inappropriate to apply the regression equation spatially, when the data by their very nature are dependent. For example, the value at a given well can be highly correlated to a value in a nearby well. Indeed, the result of implementing regression analysis spatially can lead to highly erroneous results.

For example, seismic attributes often are used to estimate reservoir properties in the interwell region on the basis of a correlation between a property measured at the well (e.g., porosity) and a seismic attribute (e.g., acoustic impedance). Let us say that during regression and correlation analyses for these properties, we find that there is a –0.83 correlation between well-derived porosity and seismic acoustic impedance. Because of this strong correlation, we proceed with deriving the regression equation—well porosity = a – b (seismic acoustic impedance)—to transform and then map our 3D seismic acoustic-impedance data into porosity, not recognizing that we have applied a point estimation method as a spatial estimator. Although the results may appear fine, b imparts a spatial linear bias (trend) in the estimates during the mapping process. This bias becomes apparent in an analysis of the residuals. Particularly unsettling is the misapplication of regression analysis to the mapping of permeability from porosity, a common practice. The topic of biasing is revisited in the Kriging Estimator page, where we find kriging to be the spatial extension of regression analysis.

### Covariance

The correlation coefficient represents a normalized covariance. Subtracting the mean and dividing the result by the standard deviation normalizes the covariance for each measurement. The transformed data then have a mean of zero and a standard deviation of unity. This data transformation restricts the range of the correlation coefficient to between –1 and +1, and sometimes it is more useful to use the raw (untransformed) data to calculate the relationship between two variables, so that the range of the correlation coefficient is unrestricted. The (untransformed) covariance formula is

....................(2)

where Xi and Yi = the measured values of variables X and Y, respectively, whose units are those of X and Y and whose i varies between the first and last measurements; mx and my = the sample means of X and Y, respectively; n = the number of X and Y data pairs; and Covx,y = the covariance of the variables X and Y.

The covariance is greatly affected by extreme pairs (outliers). This statistic also forms the foundation for the spatial covariance, which measures spatial correlation, and for its alternative construct, the variogram, which measures spatial dissimilarity. Rather than computing the covariance between two properties, we compute a statistic on one property measured at different locations.

## Normal score transform

Many statistical techniques assume that the data have an underlying Gaussian (normal) distribution. Geologic data usually do not, though, and typically require a numerical transformation to achieve one. The transformed data are used for some geostatistical analyses and can be reconfigured to their original state in a back transform, if done correctly. Thus, it is a temporary state and is used for the convenience of satisfying the Gaussian assumption, when necessary. We can define zi as any raw data value (having any units or dimensions) at any location. If zi is transformed such that its distribution has a standard normal histogram of zero mean and unity variance, the transformed value is designated as yi . Such a transform is referred to as a normal-score transform, and the yi -values are called normal scores. The transform process is described in detail throughout the literature.[6] [7] [8]

The normal-score transform can transform any data-distribution shape into the Gaussian form. Once the data are transformed the following are performed in the transformed space:

• Subsequence data analysis
• Modeling
• Interpolation
• Geostatistical simulation

As previously mentioned, the final step requires a back-transformation into the original data space; however, the normal-score transform is an ad hoc procedure and is not underlain by a full panoply of proofs and theorems. Using this transform is justified only insofar as the back-transformation will recover the original data. The transform process becomes more accurate as the original data-distribution approaches the Gaussian. Sensitivity analysis has shown that the transform is robust for a variety of unimodal distributions, even those that are very different from the Gaussian. Pathological distributions include those that may be polymodal, with null frequencies common in the intermodal values. Such distributions actually are mixtures of unimodal distributions, and each mode should be transformed independently.

Some practitioners are uncomfortable with the amount of data manipulation involved in normal-score transformations. Yet, most of us would not hesitate to perform a logarithm transform of permeability data, for example, because doing so makes it easier to investigate the relationship between permeability and porosity—its justification is simply its mathematical convenience. The same is true of a normal-score transformation. First, parametric geostatistical simulation assumes that the data have a Gaussian distribution because of the nature of the algorithms. Also, as the earlier discussion on the normal probability distribution pointed out, if we know the mean and the variance, we have a perfectly predictable model (the data histogram), which makes interpolation and simulation easier. As long as no significant data are lost in the back-transformation, the process is benign.

## Pros and cons of classical statistical measures

Table 1 provides information about the benefits and limitations of classical statistical measures.

## Nomenclature

 a = the Y-intercept b = the slope of the function Covx,y = covariance (untransformed) of variables X and Y mx = sample mean of X, units are those of the X variable my = sample mean of Y, units are those of the Y variable X = the independent variable whose units are those of the X variable Xi = the measured value of variable X, with i varying between the first and last measurements; units are those of the X variable yi = data value in transformed space at a specific location Y = the dependent variable whose units are those of the Y variable Yi = the measured value of variable Y, with i varying between the first and last measurements; units are those of the Y variable z = the regionalized variable (primary attribute)

## References

1. Davis, J.C. 1986. Statistics and Data Analysis in Geology, second edition. New York City: John Wiley & Sons.
2. Mendenhall, W. 1971. Introduction to Probability and Statistics Belmont, California: Wadsworth Publishing Co.
3. Sokal, R.R. and Rohlf, J.F. 1969. Biometry. San Francisco, California: W.H. Freeman and Co.
4. Isaaks, E.H. and Srivastava, R.M. 1989. An Introduction to Applied Geostatistics. Oxford, UK: Oxford University Press.
5. Koch, G.S. Jr. and Link, R.F. 1981. Statistical Analysis of Geological Data, 850. New York City: Dover Publications, Inc.
6. Deutsch, C.V. and Journel, A.G. 1998. GSLIB: Geostatistical Software Library and User’s Guide, second edition. Oxford, UK: Oxford University Press.
7. Olea, R.A. 1991. Geostatistical Glossary and Multilingual Dictionary. Oxford, UK: Oxford University Press.
8. Deutsch, C.V. 2002. Geostatistics Reservoir Modeling. Oxford, UK: Oxford University Press.