Generalized Least Squares (GLS, also called least-squared with prior information) is a tool for statistical inference  -  that is widely used in geotomography  -  and geophysical inversion  , as well as other areas of the physical sciences and engineering. One of the attractive features of GLS that makes it especially useful in the imaging of multidimensional fields (for example, density, velocity, viscosity) is its ability to implement, in a natural and versatile way, prior information of the behavior of the field. Widely-used types of prior information include the field being smooth, as quantified by its low-order derivatives , having a specified power spectral density or autocovariance  , and satisfying a specified partial differential equation (such as the geostrophic flow equation  or the diffusion equation  ). The word “regularization” sometimes is used to describe the effect of prior information on the solution process .
We review the Generalized Least Squares (GLS) method here, following the notation in , in order to provide context and to establish nomenclature. In GLS, observations (or data) and prior information (or inferences) are combined to arrive at a best-estimate of initially-unknown model parameters (which might, for example, represent a field sampled on a regular grid). The data are assumed to satisfy the linear equation , where is a vector of data, is a vector of model parameters, and is a known “kernel” matrix associated with the data. Prior information is assumed to satisfy a linear equation , where is a vector of prior values and is a kernel matrix associated with the prior information. GLS problems are assumed to be over-determined, with . For observed data , known prior information and a specified model , the prediction error is and prior information error is . These errors are assumed to be Normally-distributed with zero mean and prior covariance and , respectively. Then, the normalized errors and are independent and identically-distributed Normal random variables with zero mean and unit variance. Bayes theorem can be used to show that the best estimate of the solution is the one that minimizes the generalized error , with and   . The solution can be expressed in a variety of equivalent forms, among which is the widely-used version :
The assumption of linear kernels and is a very restrictive one. In the well-studied nonlinear generalization  , the products and are replaced with vector functions and . Then, a common solution method is to linearize the data and prior information equations around a trial solution :
and . The solution is then found by iterative application of (1) applied to (2); that is, by the Gauss-Newton’s method . Alternatively, a gradient-descent method  can be used that employs:
The latter approach is preferred for very large M, since the convergence rate of gradient descent is independent of its dimension , whereas the effort required to solve the M× M system (1) by a direct method scales as M3 .
We now discuss issues related to the covariance matrices that appear in GLS. The data covariance quantifies the uncertainty of the observations and the information covariance quantifies the uncertainty of the prior information. Prior knowledge of the inherent accuracy of the measurement technique is needed to assign , and prior knowledge of the physically-plausible solutions, perhaps stemming from and understanding of the underlying physics, is needed to assign . These assignments are often very subjective, especially when correlations are believed to occur (that is, and have non-zero off-diagonal elements). For example, one geotomographic study  reconstructs a two-dimensional field using a that represents autocovariance of the field and that is dependent upon a scale length q. The value of q is chosen on the basis of broad physical arguments that, while plausible, leaves considerable room for subjectivity.
The matrices and together contain elements,
many more than the constraints imposed by the data and prior information .Consequently, insufficient information is available to uniquely solve for all the elements of and . However, it sometimes may be possible to parameterize and/or in terms of , and ask whether an initial estimate of can be improved. As long as , adequate information may be available to determine a best estimate . We refer to the process of determining as “tuning”, since in typical practice it requires that the covariances be close to their true values.
As an example of a parametrized covariance, we consider the case where the model parameters represent a sampled version of a continuous function , where is an independent variable; that is, , with and the sampling interval. The prior information that is approximately oscillatory with wavenumber q can be modeled by:
In this case, approximates the autocovariance of , which is assumed to be stationary. The goal of tuning is to provides a best-estimate , as well of best estimated of the model parameters. This problem is further developed in Example 4, below.
Although the GLS formulation is widely used in geotomography and geophysical imaging, the tuning of variance is typically implemented in a very limited fashion, through the use of trade-off curves  - . In this procedure, a scalar parameter q controls the relative size of and , that is, , where is specified . The GLS problem is then solved for a suite of qs, the functions and are tabulated and the resulting trade-off curve is used to identify a solution that has acceptably low E and L (for example, Figure 1 of  ). As we will show below, this ad hoc procedure is not a consistent extension of GLS, because it results in a different q than the one implied by Bayes’ principle. A more consistent approach is to apply Bayes theorem directly to estimate both the model parameters and the covariance parameters . Such an approach has been implemented in the context of ordinary least squares  and the Markov chain Monte Carlo (MCMC) inversion method  (which is a computationally-intensive alternative to GLS). An important and novel result of this paper is a computationally-efficient procedure for tuning GLS in a Bayes-consistent manner.
2. Bayesian Extenion of GLS
The general process of using Bayes’ theorem to construct a posterior probability density function (p.d.f.) that depends on unknown parameters and of estimating those parameters though the maximization of probability is very well understood . In the current case, the p.d.f. has M model parameters and J covariance parameters, so the maximization process (implemented, say, with a gradient ascent method) must search an -dimensional space. Our main purpose here is to show that the process can be organized in a way that makes use of the GLS solution (1) and thus reduce the dimensionality of the searched space to J.
The GLS solution (1) yields the that minimizes the generalized error , or equivalently, the that maximizes the Normal posterior probability density function (p.d.f.) :
Here, Bayes theorem  is used to related the Normal posterior p.d.f. to the Normal likelihood and the Normal prior . When poorly known parameters are added to the problem, they must be treated as additional random variables . Writing , with appearing in the likelihood and appear in the prior, we have:
Here, we have assumed that and are not correlated with one another. The maximization with respect to the two variables can be performed as a sequence of two single-variable maximizations:
In the special case of the uniform prior , the maximization in (7a) is the GPR solution at fixed . For the Normal p.d.f.:
the maximization (7b) is equivalent to the minimization of an objective function , defined as:
The quantity is best computed by finding the Choleski decomposition , the algorithm  for which is implemented in many software environments, including MATLAB® and PYTHON/linalg. Then, (and similarly for ).The nonlinear optimization problem of minimizing can be implemented using a gradient descent method, provided that the derivative can be calculated . In the next section, we derive analytic formula for this and related derivatives.
3. Solution Method and Formula for Derivatives
The process of simultaneously estimating the covariance parameters and model parameters consists of six steps. First, the analytic form of the covariance matrices and are specified, and their derivatives and are computed analytically. Second, an initial estimate is identified. Third, the covariance matrices and are inserted into (1), yielding model parameters . Fourth, using formulas developed below, the value of the derivative is calculated at . Fifth, a gradient descent method employing is used to iteratively perturb towards the minimum of at (and in process, repeating steps three through five many times). Sixth, the estimated model parameters are computed as . This process is depicted in Figure 1.
Our derivation of uses three matrix derivatives, , and that may be unfamiliar to some readers, so we derive them here for completeness. Let be asquare, invertible, differentiable matrix. Differentiating yields , which can be rearranged into ( , their (36)):
Figure 1. Schematic depiction of solution process. (a) The GLS solution (red curve) is considered a function of the covariance parameters and its derivative (blue line) at a point is computed by analytic differentiation of GLS equation (1); (b) The objective function Ψ (colors) is considered a function of . The results of (a) are used to compute its gradient at the point . The gradient descent method is used to iteratively perturb this point anti-parallel to the gradient until it reaches the minimum of the objective function, resulting in the best-estimate . This value is then used to determine a best-estimate of the model parameters , as depicted in (a).
Similarly, differentiating and applying (10), yields the Sylvester equation:
We have not been able to determine a source for this equation, but in all likelihood, it has been derived previously. In practice, (11) is not significantly harder to compute than (10), because efficient algorithms for solving Sylvester equations  and for computing a symmetric (principal) square root , are widely available and implemented in many software environments, including MATLAB® and PYTHON/linalg. The derivative of is derived starting with Jacobi’s formula :
where is the adjugate and is the trace, applying Laplace’s identify  and the rule (where c is a scalar and is a matrix) . Finally, the determinant is moved to the left-hand side and the well-known relationship , for a differentiable function , is applied, yielding ( , their (38)):
We begin the main derivation by considering the case in which data variance depends on a parameter vector , and the information variance is constant. The derivative of the GLS solution can be found by applying the chain rule applied to (1):
Note that we have used (10). The derivative of the normalized prediction error is and total error are:
Here, the Sylvester equation arises from (11). An alternate way of differentiating E that does not require solving a Sylvester equation is:
The derivative of the normalized error in prior information and total error are:
Finally, since , we have:
Note that we have applied (13).
Finally, we consider the case in which the information variance depends on parameters , and is constant. Since the data and prior information play completely symmetric roles in (1), the derivatives can be obtained by interchanging the roles of and , and , and , and and E and L, in the equations above, yielding:
These formulas have been checked numerically.
4. Examples with Discussion
In the first example, we examine the simplistic case in which the parameter q represents an overall scaling of variance; that is and , with specified and . The solution is independent of q, as can be verified by substitution into (1). The parameter q can then be found by direct minimization of (9), which simplifies to:
Here, we have used the rule , valid for any matrix , and have defined and . The minimum occurs when:
This is a generalization of the well-known maximum likelihood estimate of the sample variance . As long as exists, the minimization in (21) is well-behaved and the overall scaling q is uniquely determined.
In the second example, we examine another simplistic case in which a parameter q represents the relative weighting of variance; that is and .We consider the problem of estimating the mean of data given observations and prior information (where and are vectors of zeros and ones, respectively), when , and . Applying (1), we find that . Then, the objective function is and its derivative is . The solution to is , as can be verified by direct substitution. Thus, the solution splits the difference between the observations and the prior values, and yields prior variances and that are equal. While simplistic, this problem illustrates that, at least in some cases, GLS is capable of uniquely determining the relative sizes of and . Because trade-off curves, as defined in the Introduction, are based on the behavior of E and L, and not the complete objective function Ψ, the weighting parameter estimated from them in general will be different from .Consequently, the trade-off curve procedure is not consistent with the Bayesian framework upon which GLS rests.
Our third example demonstrates the tuning of data covariance . In many cases, observational error increases during the course of an experiment, due to degradation of equipment or to worsening environmental conditions. The example demonstrates that the method is capable of accurately quantifying the fractional rate of increase p of the variance , which is assumed to vary with position . In our simulation, we consider synthetic data, evenly-spaced on the interval , which scatter around the curve (Figure 2). The covariance of the data is modeled as , where and is the Kronecker delta; that is, the data are uncorrelated and their variance increases linearly with x. The derivative of the covariance is . We have included prior information with and , which implements the notion that the model parameters are small. The corresponding covariance is chosen to be large, , indicating that this information is weak. The goal is to tune the rate of increase of variance and to arrive at a best-estimate of the two model parameters. The starting value is taken to be , which corresponds to uniform variance. It is successively improved by a gradient descent method that minimizes Ψ, yielding an estimated value .This estimate differs from the true value by about 1%. The estimated solution differs from by a few tenths of a percent, which may be significant in some applications.
Figure 2. Example of tuning . (a) Plot of synthetic data (red dots) and predicted data (green curve); (b) The starting value corresponds to uniform variance (black curve). The estimate corresponds to increasing variance (green curve); (c) Generalized error (black curve). The starting value (black circle) is successively improved (red circles) by a gradient descent method, yielding an estimate (green circle); (d) The gradient , computed using the formulas developed in the text; (e) The first model parameter , highlighting the initial value (black circle) and estimated value (green circle) (f) Same as (e), except for the second model parameter .
The fourth example demonstrates tuning of information covariance . In many instances, one may need to “reconstruct” or “interpolate” a function on the basis of unevenly and sparsely sampled data. In this case, prior information on the autocovariance of the function can enable a smooth interpolation. Furthermore, it can enforce a covariance structure that may be required, say, by the underlying physics of the problem. In our example, we suppose that the function is known to be oscillatory on physical grounds, but that the wavenumber of those oscillations is known only imprecisely. The goal is to tune prior knowledge of wavenumber to arrive at a best-estimate of the reconstructed function. In our simulation, a total of model parameters are uniformly spaced on the interval and representing a sampled version of a continuous, sinusoidal function with wavenumber (Figure 3). Synthetic data with uncorrelated error with variance are available for randomly-chosen points , where the index function aligns in x observations to model parameters. The data kernel is . The prior information is given in (4), with autocovariance and
. The derivative is . An
initial guess is improved using a gradient descent method, yielding an estimated value of that differs from by less than 0.01%. The reconstructed function is smooth and sinusoidal and the fit to the data is much improved.
Examples three and four were implemented in MATLAB® and executed in <5s on a notebook computer. They confirm the flexibility, speed and effectiveness of the method. An ability to tune prior information on autocovariance may be of special utility in seismic exploration applications, where three-dimensional waveform datasets are routinely interpolated.
A limitation of this overall “parametric” approach is that the solution is dependent on the choice of parameterization, which must be guided by prior knowledge of the general properties of the covariance matrices in particular problem being solved. In Example 3, we were able to recognize (say, by visually
Figure 3. Example of tuning . Sparsely-sampled synthetic data (red dots) are oscillatory. (a) A regularly-sampled version is created by imposing the oscillatory covariance . With the starting value , the reconstruction poorly fits the data (black curve). Tuning leads to a better fit (green curve with dots), as well as a precise estimate of wavenumber ; (b) Decrease in with iteration number during the gradient descent process.
examining the data plotted in Figure 2(a)) that observational error increases with x and chose that matched this scenario. If, instead, the degree of correlation between successive data increased with x, this pattern might be less expected, more difficult to detect, and require a different
Not every parameterization of (or ) is necessarily well-behaved. To avoid poor behavior, the parameterization must be chosen so its determinant does not have zeros at values of that will prevent the steepest descent process from converging to the global minimum. That this choice can be problematical is illustrated by the simple Toeplitz version of (with , ):
with . This form is useful for quantifying correlations within a stationary sequence of data . Yet as is illustrated in Figure 4, the volume is crossed
Figure 4. The function for the case given by (22). (a) The surface for and the other qs randomly assigned; (b) Same as (a), but with ; (c) Same as (a), but with ; (d) Perspective view of the surfaces in the volume. The positions of the three slices in (a), (b) and (c) are noted on the -axis (green arrows). A question posed in the text is whether, given an arbitrary point and the global minimum of the objective function, say at (and with both points satisfying ), a steepest-descent path necessarily exists between them.
by many surfaces that correspond to surfaces of singular objective function Ψ. Their presence suggests that the steepest descent path between a starting value and the global minimum at may be very convoluted (if, indeed, such a path exists) unless is very close to .
Generalized Least Squares requires the assignment of two prior covariance matrices, the prior covariance of the data and the prior covariance of the prior information. Making these assignments is often a very subjective process. However, in cases in which the forms of these matrices can be anticipated up to a set of poorly-known parameters, information contained within the data and prior information can be used to improve knowledge of them—a process we call “tuning”. Tuning can be achieved by minimizing an objective function that depends on both the generalized error and determinants of the covariance matrices to arrive at a best estimate of the parameters. Analytic and computationally-tractable formulas are derived for the derivative needed to implement the minimization via a gradient descent method. Furthermore, the problem is organized so that the minimization need be performed only over the space of covariance parameters, and not over the typically-much-larger space of model and covariance parameters. Although some care needs to be exercised as the covariance matrices are parametrized, the minimization is tractable and can lead to better estimates of the model parameters. An important outcome is this study is the recognition that the use of trade-off curves to determine relative weighting of covariance—a practice ubiquitous in the geophysical imaging—is not consistent with the underlying Bayesian framework of Generalized Least Squares. The strategy outlined here provides a consistent solution.
The author thanks Roger Creel for helpful discussion.
 Tarantola, A. and Valette, B. (1982) Generalized Non-Linear Inverse Problems Solved Using the Least Squares Criterion. Reviews of Geophysics and Space Physics, 20, 219-232.
 Abers, G. (1994) Three-Dimensional Inversion of Regional P and S Arrival Times in the East 723 Aleutians and Sources of Subduction Zone Gravity Highs. Journal of Geophysical Research, 99, 4395-4412.
 Menke, W. (2005) Case Studies of Seismic Tomography and Earthquake Location in a Regional Context. Geophysical Monograph 157. American Geophysical Union, Washington DC.
 Nettles, M., and Dziewonski, A.M. (2008) Radially Anisotropic Shear Velocity Structure of the Upper Mantle Globally and Beneath North America. Journal of Geophysical Research, 113, B02303.
 Humphreys, E.D., Dueker, K.G., Schutt, D.L. and Smith, R.B. (2000) Beneath Yellowstone: Evaluating Plume and Nonplume Models Using Teleseismic Images of the Upper Mantle. GSA Today, 10, 1-7.
 Gillet, N., Schaeffer, N. and Jault, D. (2011) Rationale and Geophysical Evidence for Quasi-Geostrophic Rapid Dynamics within the Earth’s Outer Core. Physics of the Earth and Planetary Interiors, 187, 380-390.
 Zhao, S. (2013) Lithosphere Thickness and Mantle Viscosity Estimated from Joint Inversion of GPS and GRACE-Derived Radial Deformation and Gravity Rates in North America. Geophysical Journal International, 194, 1455-1472.
 Menke, W. and Eilon, Z. (2015) Relationship between Data Smoothing and the Regularization of Inverse Problems. Pure and Applied Geophysics, 172, 2711-2726.
 Snyman, J.A. and Wilke, D.N. (2018) Practical Mathematical Optimization—Basic Optimization Theory and Gradient-Based Algorithms. Springer Optimization and Its Applications, 2nd Edition, Springer, New York, 340 p.
 Zaroli, C., Sambridge, M., Lévêque, J.-J., Debayle, E. and Nolet, G. (2013) An Objective Rationale for the Choice of Regularization Parameter with Application to Global Multiple-Frequency S-Wave Tomography. Solid Earth, 4, 357-371.
 Malinverno, A. and Briggs, V.A. (2004) Expanded Uncertainty Quantification in Inverse Problems: Hierarchical Bayes and Empirical Bayes. Geophysics, 69, 877-1103.
 Schmidt, E. (1973) Cholesky Factorization and Matrix Inversion, National Oceanic and Atmospheric Administration Technical Report NOS-56. US Government Printing Office, Washington DC.