Content of PetroWiki is intended for personal use only and to supplement, not replace, engineering judgment. SPE disclaims any and all liability for your use of such content. More information

# Statistical concepts in risk analysis

Descriptive statistics should aid communication. As the name suggests, it is intended to develop and explain features of both data and probability distributions. We begin the discussion with data, perhaps visualized as collections of numbers expressing possible values of some set of variables; but it is common practice to extend the language to three common types of graphs used to relate variables to probability:

• Histograms
• Probability density functions
• Cumulative distributions

We habitually use the same words (mean, median, standard deviation, and so on) in the context of data as well as for these graphs. In so doing, we create an additional opportunity for miscommunication. Although we require only a few words and phrases from a lexicon of statistics and probability, it is essential that we use them carefully.

## Example data

There is an unspoken objective when we start with data: we imagine the data as a sample from some abstract population, and we wish to describe the population. We use simple algebraic formulas to obtain various statistics or descriptors of the data, in hopes of inferring what the underlying population (i.e., reality, nature, the truth) might look like. Consider, for example, the database in Table 1 for 26 shallow gas wells in a given field:

• Pay thickness
• Porosity
• Reservoir temperature
• Initial pressure
• Water saturation
• Estimated ultimate recovery (EUR)

We can use various functions in Excel to describe this set of data. The “underlying populations” in this case would refer to the corresponding data for all the wells we could drill in the field. What follows concentrates on the porosities as an example, but one may substitute any of the other six parameters.

## Measures of central tendency

Our first group of statistics helps us find “typical” values of the data, called measures of central tendency. Let us calculate the three common ones.

Mean

Sum the 26 values and divide by 26 (nicknames: arithmetic mean, expected value, average, arithmetic average). The Excel name is “AVERAGE.” The mean porosity is 0.127.

Median

First sort the 26 values in ascending order and take the average of the two middle values (13th and 14th numbers in the ascending list). For an odd number of data, the median is the middle value once sorted (nickname: P50). P50 is not the probability at the 50th percentile. P50 is a value on the other axis. The Excel function is “MEDIAN.” The median porosity is 0.120. The rule works regardless of repeated data values.

Mode

Find the number that repeats most often. In case of a tie, report all tied values. The Excel function is “MODE.” Because Excel reports only one number in case of a tie, namely the one that appears first in the list as entered in a column or row, note that it reports the mode of the porosity data as 0.100 because that value appeared five times, even though the value 0.120 also appeared five times.

We therefore see that the mode is ambiguous. Rather than one number, there may be several. In Excel we get a different value by simply recording the data in a different order. This situation can be confusing. We seldom use the mode of data in a serious way because what we care about is the underlying population’s mode, and the mode of the sample data available is generally not a good estimator. Fitting a theoretical curve to data and then finding the (unique) mode of the fitted curve is a relatively unambiguous process, except for the possibility of competing curves with slightly different modes.

These three values—mean, median, and mode—are referred to as measures of central tendency. Each one’s reaction to changes in the data set determines when it is used. The mean is influenced by the extreme values, whereas the median and mode are not. Thus, one or more very large values cause the mean to drift toward those values. Changing the largest or smallest values does not affect the median (and seldom the mode) but would alter the mean. The mode and median are insensitive to data perturbations because they are based more on the rank, or order, of the numbers rather than the values themselves.

The median is often used to report salaries and house prices, allowing people to see where they fit relative to the “middle.” Newspapers report housing prices in major cities periodically. For instance, Table 2 appeared in the Houston Chronicle to compare prices in five major cities in Texas. The mean values are roughly 20% larger than the median, reflecting the influence of a relatively small number of very expensive houses.

## Measures of dispersion and symmetry

The next group of statistics describes how the data are dispersed or spread out from the “center.” That is, the degree of data dispersion advises us how well our chosen measure of central tendency does indeed represent the data and, by extension, how much we can trust it to describe the underlying population.

Population variance

The average of the squared deviations from the mean:  Population standard deviation

Population standard deviation is the square root of population variance (in Excel, this is STDEVP).

Sample standard deviation

Sample standard deviation is the square root of sample variance (in Excel, this is STDEV).

All this variety is necessary because of the implicit objective of trying to describe the underlying population, not just the sample data. It can be shown that VAR and STDEV are better estimators of the actual population’s values of these statistics.

Of all these measures, STDEV is the most commonly used. What does it signify? The answer depends to some degree on the situation but, in general, the larger the standard deviation, the more the data are spread out. Consider the 26 wells in Table 1. Their descriptive statistics appear in Table 3. In particular, the STDEVs of porosity and EUR are respectively 0.033 and 494. But we never simply look at STDEV without referencing the mean of the same data. A better method of comparing dispersion is to look at the “coefficient of variation,” which, unlike the other measures, is dimensionless.

Coefficient of variation

The coefficient of variation = STDEV/mean. Thus, porosity and EUR are quite different in this regard; their respective coefficients of variation are 0.26 and 0.78. Temperature has an even smaller dispersion, with a coefficient of variation of 0.07.

Skewness

Skewness is the next level of description for data. It measures the lack of symmetry in the data. While there are many formulas in the literature, the formula in Excel is A symmetric data set would have a mean, m, and for each point x smaller than m, there would be one and only one point x′ larger than m with the property that x′ – m = m – x. Such a set would have SKEW = 0. Otherwise, a data set is skewed right or left depending on whether it includes some points much larger than the mean (positive, skewed right), or much smaller (negative, skewed left). To help us understand skewness, we must introduce some graphs.

## Histograms and probability distributions

A histogram is formed by splitting the data into classes (also called bins or groups) of equal width, counting the number of data that fall into each class (the class frequency, which of course becomes a probability when divided by the total number of data), and building a column chart in which the classes determine the column widths and the frequency determines their heights. The porosity data in Table 1 yield the histogram in Fig. 1.

Three more histograms, generated with Monte Carlo simulation software, show the three cases of skewness:

• Symmetrical
• Right skewed
• Left skewed

See Figs. 2 through 4.

Whereas histograms arise from data, probability density functions are graphs of variables expressed as theoretical or idealized curves based on formulas. Four common density functions are the:

• Normal
• Log-normal
• Triangular
• Beta distributions

Figs. 5 through 8 show several of these curves.

The formulas behind these curves often involve the exponential function. For example, the formula for a normal distribution with mean, μ, and standard deviation, σ, is ....................(10.1)

And the log-normal curve with mean, μ, and standard deviation, σ, has the formula ....................(10.2)

where ....................(10.3)

and ....................(10.4)

The single rule for a probability density function is that the area under the curve equals 1.00 exactly.

To each density function y = f(x), there corresponds a cumulative distribution function y = F(x) obtained by taking the indefinite integral of f. Because the area under f is 1, the cumulative function ranges monotonically from 0 to 1.

Figs. 5 through 8 also show the figures of the cumulative functions corresponding to the density functions. The variable, X, on the horizontal axis of a density function (or the associated cumulative graph) is called a random variable.

In practice, when we attempt to estimate a variable by assigning a range of possible values, we are in effect defining a random variable. Properly speaking, we have been discussing one of the two major classes of random variables—continuous ones. Shortly, we introduce the notion of a discrete random variable, for which we have:

• Histograms
• Density functions
• Cumulative curves
• The interpretations of the statistics we have defined

The reason to have both density functions and cumulative functions is that density functions help us identify the mode, the range, and the symmetry (or asymmetry) of a random variable. Cumulative functions help us determine the chance that the variable will or will not exceed some value or fall between two values, namely of P(X < c), P(X > c), and P(c < X < d), where c and d are real numbers in the range of x. In practice, cumulative functions answer questions like: “What is the chance that the discovered reserves will exceed 100 million bbl? What is the chance of losing money on this investment [i.e., What is P(NPV < 0)?]? How likely is it that the well will be drilled in less than 65 days?

## Curve fitting: the relationship between histograms and density functions

As mentioned earlier, the implied objective of data analysis is often to confirm and characterize an underlying distribution from which the given data could reasonably have been drawn. Today, we enjoy a choice of software that, when supplied a given histogram, fits various probability distributions (normal, log-normal, beta, triangular) to it. A common metric to judge the “goodness of fit” of these distributions to the histogram is the “chi-square” value, which is obtained by summing the normalized squared errors. ....................(10.5)

where hi is the height of the histogram, and yi is the height (y -value) of the fitted curve. The curve that yields a minimum chi-square value is considered as the best fit. Thus, we begin with data, construct a histogram, then find the best fitting curve, and assume that this curve represents the population from whence the data came.

We do this because, when we build a Monte Carlo simulation model, we want to sample hundreds or thousands of values from this imputed population and then, in accordance with our model (i.e., a formula), combine them with samples of other variables from their parent populations. For instance, one way to estimate oil in place for an undrilled prospect is to:

1. Use analogous data for net volume, porosity, oil saturation, and formation volume factor
2. Fit a curve to each data set
3. Sample a single value for each of these four variables and take their product.

This gives one possible value for the oil in place. We then repeat this process a thousand times and generate a histogram of our results to represent the possible oil in place. Now that we have these graphical interpretations, we should extend our definitions of mean, mode, median, and standard deviation to them.

For histograms, although there are definitions that use the groupings, the best way to estimate the mean and median is simply to find those of the histogram’s original data. The mode of a histogram is generally defined to be the midpoint of the class having the highest frequency (the so-called modal class). In case of a tie, when the two classes are adjacent, we use the common boundary for the mode. When the two classes are not adjacent, we say the data or the histogram is bimodal. One can have a multimodal data set.

One problem with this definition of mode for a histogram is that it is a function of the number of classes. That is, if we rebuild the histogram with a different number of classes, the modal class will move, as will the mode. It turns out that when we fit a curve to a histogram (i.e., fit a curve to data via a histogram), the best-fitting curve gives us a relatively unambiguous value of a mode. In practice, although changing the number of classes could result in a different curve fit, the change tends to be small. Choosing another type of curve (say, beta rather than a triangular) would change the mode also. Nevertheless, this definition of mode (the one from the best curve fit) is adequate for most purposes.

Interpreting statistics for density functions

The mode of a density function is the value where the curve reaches its maximum. This definition is clear and useful. The median of a density function is the value that divides the area under the curve into equal pieces. That is, the median, or P50, represents a value M for which a sample is equally likely to be less than M and greater than M.

The mean of a density function corresponds to the x coordinate of the centroid of the two-dimensional (2D) region bounded by the curve and the X-axis. This definition, while unambiguous, is hard to explain and not easy to implement. That is, two people might easily disagree on the location of the mean of a density function. Fig. 9 shows these three measures of central tendency on a log-normal density function.

Interpreting statistics for cumulative distributions

The only obvious statistic for a cumulative function is the median, which is where the curve crosses the horizontal grid line determined by the 0.5 on the vertical axis. While the mode is also the point of inflection, this is hard to find. The mean has no interpretation in this context.

Table 4 summarizes the interpretations of these central tendency measures for the four contexts: data, histograms, density functions, and cumulative curves. Table 5 shows the calculated average, standard deviation, and coefficient of variability for each of five data sets.

Kurtosis, a fourth-order statistic

Kurtosis is defined in terms of 4th powers of (x – m), continuing the progression that defines mean, standard deviation, and skewness. Although widely used by statisticians and used to some degree by geoscientists, this statistic, which measures peakedness, is not discussed here because it plays no active role in the risk analysis methods currently used in the oil/gas industry.

Percentiles and confidence intervals

The nth percentile is the value on the X- (value) axis corresponding to x on the Y- (cumulative probability) axis. We denote it Px.

A C-percent confidence interval (also called probability or certainty intervals) is obtained by removing (100 – C)/2% from each end of the range of a distribution. Thus, we have an 80% confidence interval that ranges from P10 to P90 and a 90% confidence interval that ranges from P5 to P95. Some companies prefer one or the other of these confidence intervals as a practical range of possible outcomes when modeling an investment or when estimating reserves, cost, or time.

When to use a given distribution

One of the challenges to someone building a model is to decide which distribution to use to represent a given parameter. While there are only very few hard and fast rules (although some people are vocal in their support of particular distributions), the following list provides guidelines for using some common distributions.

Normal distributions

Normal distributions are often used to represent variables, which themselves are sums (aggregations) or averages of other variables. (See central limit theorem discussion.) Four popular applications are:

• Field production, which is a sum of production from various wells.
• Reserves for a business unit, which are sums of reserves from various fields.
• Total cost, which is a sum of line-item costs.
• Average porosity over a given structure.

Normal distributions are also used to characterize:

• Errors in measurement (temperature and pressure).
• People’s heights.
• Time to complete simple activities.

Samples of normal distributions should inherit the symmetry of their parent, which provides a simple check on samples suspected to come from an underlying normal distribution: calculate the mean, median, and skew. The mean and median should be about the same; the skew should be approximately zero.

Log-normal distributions

The log-normal distribution is very popular in the oil/gas industry, partly because it arises in calculating resources and reserves. By definition, X is log-normal if Ln(X) is normal. It follows from the central limit theorem (discussed later) that products are approximately log-normal. If Y = X1 × X2 ×...× XN , then Ln (Y) = Ln (X1) + Ln (X2) +..., which, being a sum of distributions, is approximately normal, making Y approximately log-normal. Common examples of log-normal distributions include:

• Areas (of structures in a play).
• Volumes (of resources by taking products of volumes, porosity, saturation, etc.).
• Production rates (from Darcy’s equation).
• Time to reach pseudosteady state (a product formula involving permeability, compressibility, viscosity, distance, etc.).

Other examples of variables often modeled with log-normal distributions are:

• Permeability
• Time to complete complex tasks
• New home prices
• Annual incomes within a corporation
• Ratios of prices for a commodity in successive time periods

A simple test for log-normality for data is to take the logs of the data and see if they form a symmetric histogram. Bear in mind that log-normal distributions are always skewed right and have a natural range from 0 to infinity. In recent years, a modified (three parameter) log-normal has been introduced that can be skewed right or left, but this distribution has not yet become widely used.

Triangular distributions

Triangular distributions are widely used by people who simply want to describe a variable by its range and mode (minimum, maximum, and most likely values). Triangular distributions may be symmetric or skewed left or right, depending on the mode’s location; and the minimum and maximum have no (zero) chance of occurring.

Some argue that triangular distributions are artificial and do not appear in nature, but they are unambiguous, understandable, and easy to define when working with experts. Beyond that, however, triangular distributions have other advantages:

• Though “artificial,” they can nevertheless be quite accurate (remember, any distribution only imitates reality).
• When one proceeds to combine the triangular distributions for a number of variables, the results tend toward the normal or log-normal distributions, preferred by purists.
• The extra effort in defining more “natural” distributions for the input variables is largely wasted when the outcome does not clearly reflect the difference.

Discrete distributions

A continuous distribution has the property that for any two values, a and b, which may be sampled, the entire range between a and b are eligible for samples as well. A discrete distribution, by contrast, is specified by a set of X-values, {x1, x2, x3,...} (which could be countably infinite), together with their corresponding probabilities, {p1, p2, p3,...}. The most used discrete distributions are the binomial distribution, the general discrete distribution, and the Poisson distribution.

Central limit theorem

Let Y = X1 + X2, +...+ Xn, and Z = Y/n, where X1, X2, ... Xn are independent, identical random variables each with mean μ and standard deviation σ. Then, both Y and Z are approximately normally distributed, the respective means of Y and Z are nμ and μ, and the respective standard deviations are approximately √nσ and σ/√n.

This approximation improves as n increases. Note that this says the coefficient of variation, the ratio of standard deviation to mean, shrinks by a factor of √n. Even if X1, X2,... Xn are not identical or independent, the result is still approximately true. Adding distributions results in a distribution that is approximately normal, even if the summands are not symmetric; the mean of Y equals the sum of the means of the Xi (exactly); and the standard deviation of Y is approximately 1/√n times the sum of the standard deviations of the Xi and, thus, the coefficient of variation diminishes.

When is the approximation poor? Two conditions retard this process: a few dominant distributions and/or strong correlation among two or more of the inputs. Some illustrations may help.

For instance, take 10 identical log-normal distributions, each having mean 100 and standard deviation 40 (thus, with coefficient of variation, CV, of 0.40). The sum of these distributions has mean 1,000 and standard deviation 131.4, so CV = 0.131, which is very close to 0.40/sqrt(10) or 0.127.

On the other hand, if we replace three of the summands with more dominant distributions, say each having a mean of 1,000 and varying standard deviations of 250, 300, and 400, then the sum has a mean of 3,700 and standard deviation 560, yielding a CV of 0.15. As one might expect, the sum of standard deviations divided by sqrt(10) is 389—not very close to the actual standard deviation. It makes more sense to divide the sum by sqrt(3), acknowledging the dominance of three of the summands. As one can find by Monte Carlo simulation, however, even in this case, the sum is still reasonably symmetric. The practical implications of this theorem are numerous and noteworthy.

Total cost is a distribution with a much smaller uncertainty than the component line items. Adding the most likely costs for each line item often results in a value much too low to be used as a base estimate.

Business unit reserves have a relatively narrow range compared to field reserves. Average porosity, average saturations, average net pay for a given structure area tend to be best represented by normal distributions, not the log-normal distributions conventionally used.

## Laws of probability

Probability theory is the cousin of statistics. Courses in probability are generally offered in the mathematics department of universities, whereas courses in statistics may be offered in several departments, acknowledging the wide variety of applications.

Our interest in probability stems from the following items we must estimate:

• The probability of success of a geological prospect.
• The probability of the success of prospect B, once we know that prospect A was successful.
• The probabilities of various outcomes when we have a discovery (for example, the chance of the field being large, medium, or small in volume).

While much of our application of the laws of probability are with decision trees, the notion of a discrete variable requires that we define probability. For any event, A, we use the notation P(A) (read “the probability of A”) to indicate a number between 0 and 1 that represents how likely it is that A will occur. Lest this sound too abstract, consider these facts:

• A = the occurrence of two heads when we toss two fair coins (or toss one fair coin twice); P(A) = ¼.
• A = the occurrence of drawing a red jack from a poker deck; P(A) = 2/52.
• A = the occurrence some time next year of a tropical storm similar to the one in Houston in July 2001; P(A) = 1/500.
• A = the probability that, in a group of 25 people, at least two of them share a birthday; P(A) = 1/2, approximately.
• A = the probability that the sun will not rise tomorrow; P(A) = 0.

The numbers come from different sources. Take the red jack example. There are 52 cards in a poker deck (excluding the jokers), two of which are red jacks. We simply take the ratio for the probability. Such a method is called a counting technique.

Similarly, when we toss two fair coins, we know that there are four outcomes, and we believe that they are “equally likely,” for that is indeed what we mean by a fair coin. The Houston storm of July 2001 recorded as much as 34 in. of rain in a two- or three-day span, flooded several sections of highway (enough to float dozens of tractor-trailers), and drove thousands of families from their homes. Meteorologists, who have methods of assessing such things, said it was a “500-year flood.”

Most believe that it is certain that the sun will rise tomorrow (the alternative is not clear) and would, therefore, assign a probability of 1.0 to its rising and a probability of 0.0 to its negation (one of the rules of probability).

Sometimes we can count, but often we must estimate. Geologists must estimate the chance that a source rock was available, the conditions were right to create hydrocarbons, the timing and migration path were right for the hydrocarbon to find its way to the reservoir trap, the reservoir was adequately sealed once the hydrocarbons got there, and the reservoir rock is of adequate permeability to allow the hydrocarbons to flow to a wellbore. This complex estimation is done daily with sophisticated models and experienced, highly educated people. We use experience and consensus and, in the end, admit that we are estimating probability.

Rules of probability

• Rule 1: 1 – P(A) = P(-A), the complement of A. Alternately, P(A) + P(-A) = 1.0. This rule says that either A happens or it doesn’t. Rule 1′: let A1, A2 , ... be exclusive and exhaustive events, meaning that exactly one of them will happen, then P(A1) + P(A2) +...+ P(An) = 1.
• Rule 2: For this rule, we need a new definition and new notation. We write P(A|B) and say the probability of A knowing B (or “if B” or “given B”) to mean the revised probability estimate for A when we assume B is true (i.e., B already happened). B is called the condition. P(A) is called the conditional probability. We write P(A & B) and say the probability that both A and B happen. (This is called the joint probability.) P(A & B) = P(A|B) × P(B). We say A and B are independent if P(A|B) = P(A). Note that when A and B are independent, P(A & B) = P(A) × P(B). Using the fact that A & B means the same as B & A and, thus, interpreting Rule 2 as P(B & A) = P(B|A) × P(A), it follows that P(A|B) × P(A) = P(B|A) × P(A), from which we can deduce.
• Rule 3: P(B|A) = [P(A|B) P(B)]/ P(A).
• Rule 4: This rule is often paired with Rule 3 and called Bayes’ Theorem.. Given the n mutually exclusive and exhaustive events A1, A2,..., An, then P(B) = P(B&A1) + P(B&A2) + ... P(B&An). An example application of Bayes’ Theorem appeared in Murtha.

## Nomenclature

 f(x) = a probability density function, the derivative of F(x), various units hi = height of histogram, various units x = random variable whose values are being observed, various units xi = ith of N observed values of a random variable, various units yi = height of fitted curve, various units μ = mean, various units σ = standard deviation, various units