Theory#

What is the Principal Component Geostatistical Approach (PCGA) ?

If you are interested in the math behind pypcga, you’ve comed to the right place. The following is an extract summary coming from the PhD of Antoine COLLET. See chapter 4.4.1 and 4.4.2 in Collet [2024].

General “black-box” framework#

Although these three “black-box” or “derivative-free” inversion methods may at first appear to be unrelated, not least because their respective authors present them in this way and do not use the same notations, they are in fact very similar in that they minimize the same objective function and can be derived from the same approximation form of Newton’s method: Gauss-Newton iterations. But they differ in the computation of certain matrices and in the implementation.

Bayesian framework#

Considering the measurement equation

\[\mathbf{d}_{\mathrm{obs}} = \mathcal{F}(\mathbf{s}) + \epsilon, \quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_{\mathrm{obs}}),\]

where \(\mathcal{F}\) designates the forward operator predictions masked in the data domain to match observations, the history matching problem consists in finding \(\mathbf{s}\) knowing \(\mathbf{d}_{\mathrm{obs}}\). This is equivalent to maximizing the probability function \(P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}})\) to obtain \(\widehat{\mathbf{s}}\) the maximum likelihood estimate (MLE). Using Bayes theorem, we can calculate the posterior probability density function (PDF) of an event that will occur based on another event that has already occurred with

\[\begin{split}\begin{aligned} \mathrm{Bayes' rule:} \quad P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}}) &= P(\mathbf{s}) \times \dfrac{P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s})}{P(\mathbf{d}_{\mathrm{obs}})}, \\ & \propto \underbrace{P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s})}_{\mathrm{likelyhood}} \times \underbrace{P(\mathbf{s})}_{\mathrm{prior}}, \end{aligned}\end{split}\]

where \(P(\mathbf{d}_{\mathrm{obs}})\) is neglected because it is always positive, it does not depend on \(\mathbf{s}\) and it is very difficult to compute. Since the measurement errors are assumed to be Gaussian, i.e., \(\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_{\mathrm{obs}})\), then \(P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s})\) is Gaussian, and noting that \(\mathbb{E}\left[\mathbf{d}_{\mathrm{obs}}|\mathbf{s}\right] = \mathcal{F}(\mathbf{s})\) and \(\mathrm{cov}(\mathcal{F}(\mathbf{s})) = \mathbf{C}_{\mathrm{obs}}\), it follows that

\[P(\mathbf{d}_{\mathrm{obs}}|\mathbf{s}) = \dfrac{1}{(2\pi)^{N_{\mathrm{obs}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{obs}})}}} \exp \left( - \dfrac{1}{2}\left(\mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s})\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1}\left(\mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s})\right) \right).\]

In addition, one always has some prior information about the plausibility of models. In the case of history matching (HM), this includes a geological prior model (hopefully stochastic) constructed from log, core and seismic data, as well as information about the depositional environment. A general assumption is that the updated parameter \(\mathbf{s}\) is a realization of a random \(N_{\mathrm{s}}-\mathrm{dimensional}\) vector \(\mathbf{S}\), which is (multivariate) Gaussian or multinormal with mean \(\mathbf{s}_{\mathrm{prior}}\) and autocovariance \(\mathbf{C}_{\mathrm{prior}}\). In other words, \(\mathbf{S} \sim \mathcal{N}(\mathbf{s}_{\mathrm{prior}}, \mathbf{C}_{\mathrm{prior}})\) and the probability density for \(\mathbf{S}\) is

\[P(\mathbf{s}) = \dfrac{1}{(2\pi)^{N_{\mathrm{s}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{prior}})}}} \exp \left( - \dfrac{1}{2}\left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{prior}}^{-1}\left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right) \right).\]

Maximizing the product of the prior and the likelihood inserted into Bayes’ theorem yields

\[P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}}) = a \exp\left( -\mathcal{J} \right),\]

with

(1)#\[\begin{split}\begin{aligned} a &= \dfrac{1}{(2\pi)^{N_{\mathrm{obs}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{obs}})}}} \times \dfrac{1}{(2\pi)^{N_{\mathrm{s}}/2}} \dfrac{1}{\sqrt{\mathrm{det(\mathbf{C}_{\mathrm{prior}})}}}, \\ \mathcal{J}(\mathbf{s}) &= \underbrace{\frac{1}{2} \left(\mathcal{F}(\mathbf{s}) - \mathbf{d}_{\mathrm{obs}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1}\left(\mathcal{F}(\mathbf{s}) - \mathbf{d}_{\mathrm{obs}}\right)}_{\mathrm{likelihood~(data misfit)}} \underbrace{+ \frac{1}{2} \left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{prior}}^{-1}\left(\mathbf{s} - \mathbf{s}_{\mathrm{prior}}\right)}_{\mathrm{prior}}, \end{aligned}\end{split}\]

Maximizing \(P(\mathbf{s}|\mathbf{d}_{\mathrm{obs}})\), i.e., finding the MAP estimate (the mode of the posterior distribution), is the same as minimizing its negative logarithm, i.e., the negative log-likelihood \(\mathcal{J}\).

Gauss-Newton iterations#

The MAP or most likely values are obtained by minimizing \(\mathcal{J}\) from (1) with respect to the vector \(\mathbf{s}\). As explained earlier, based on the initial guess (or most recent “good solution”) \(\mathbf{s}_{\ell}\), a Newton-type iterative approach leads to a new solution \(\mathbf{s}_{\ell+1}\) according to

\[\mathbf{s}_{\ell+1} = \mathbf{s}_{\ell} - \gamma_{\ell} \left(\dfrac{\partial^{2} \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}^{2}}\right)^{-1} \dfrac{\partial \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}}.\]

Defining \(\mathbf{J}_{\ell}\), the \((N_{\mathrm{obs}} \times N_{\mathrm{s}})\) Jacobian matrix of \(\mathcal{F}\) at \(\mathbf{s}_{\ell}\) as

\[\begin{split}\mathbf{J}_{\ell} = \left.\dfrac{\partial \mathcal{F}}{\partial \mathbf{s}} \right|_{\mathbf{s}=\mathbf{s}_{\ell}} = \mathcal{F}(\mathbf{s}_{\ell})\nabla_{\mathrm{s}}^{\mathrm{T}} = \displaystyle \begin{bmatrix} \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{0}}{\partial s_{0}} & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{0}}{\partial s_{1}} & \ldots & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{0}}{\partial s_{N_{\mathrm{s}}-1}} \\ \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{1}}{\partial s_{0}} & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{1}}{\partial s_{1}} & \ldots & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{1}}{\partial s_{N_{\mathrm{s}}-1}} \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{N_{\mathrm{obs}}-1}}{\partial s_{0}} & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{N_{\mathrm{obs}}-1}}{\partial s_{1}} & \ldots & \dfrac{\partial \mathcal{F}(\mathbf{s}_{\ell})_{N_{\mathrm{obs}}-1}}{\partial s_{N_{\mathrm{s}}-1}} \end{bmatrix},\end{split}\]

with the gradient row vector operator \(\nabla_{\mathrm{s}}^{\mathrm{T}} = \begin{bmatrix} \dfrac{\partial.}{\partial s_{0}}, & \ldots & \dfrac{\partial.}{\partial s_{N_{\mathrm{s}} - 1}} \end{bmatrix}\) and the column vector \(\mathcal{F}(\mathbf{s}_{\ell})\), \(\mathcal{F}\) is linearized around \(\mathbf{s}_{\ell}\) with \(\mathcal{F}(\mathbf{s}_{\ell+1}) \approx \mathcal{F}(\mathbf{s}_{\ell}) + \mathbf{J}_{\ell}(\mathbf{\mathbf{s}_{\ell+1}} - \mathbf{s}_{\ell})\). Then, the gradient of \(\mathcal{J}\) reads

(2)#\[\dfrac{\partial \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}} = \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell})- \mathbf{d}_{\mathrm{obs}}\right) + \mathbf{C}_{\mathrm{prior}}^{-1} \left(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\right).\]

The Hessian is approximated in the Gauss-Newton way, i.e., neglecting \(\nabla^{2}_{\mathrm{s}} \mathcal{F}(\mathbf{s}_{j})^{\mathrm{T}}\):

(3)#\[\begin{split}\begin{aligned} \dfrac{\partial^{2} \mathcal{J} (\mathbf{s}_{\ell})}{\partial (\mathbf{s}_{\ell})^{2}} & = \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \nabla^{2}_{\mathbf{s}} \mathcal{F}(\mathbf{s}_{\ell})^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\right) + \mathbf{C}_{\mathrm{prior}}^{-1}, \\ & \approx \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}. \end{aligned}\end{split}\]

A Gauss-Newton iteration is then defined as

(4)#\[\begin{split}\begin{aligned} \mathbf{s}_{j, \ell+1} & = \mathbf{s}_{\ell} &- \gamma_{\ell} & \left(\dfrac{\partial^{2} \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}^{2}}\right)^{-1} \dfrac{\partial \mathcal{J} (\mathbf{s}_{\ell})}{\partial \mathbf{s}_{\ell}}, \\ & = \mathbf{s}_{\ell} &- \gamma_{\ell} & \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\Bigg)^{-1} \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\right) + \mathbf{C}_{\mathrm{prior}}^{-1} \left(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\right)\Bigg), \\ & = \mathbf{s}_{\ell} &- \gamma_{\ell} & \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\Bigg)^{-1} \mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \left(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\right), \\ & &- \gamma_{\ell} & \Bigg(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\Bigg)^{-1} \mathbf{C}_{\mathrm{prior}}^{-1} \left(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\right). \end{aligned}\end{split}\]

Since inverting large covariance matrices is not practical, we use the two following matrix inversion lemmas [Golub and Van Loan, 1996, Petersen and Pedersen, 2008] for the second and third right-hand side terms of (4) respectively

(5)#\[\begin{split}\begin{aligned} & \displaystyle \left(\mathbf{A}^{-1}+\mathbf{U}\mathbf{C}^{-1}\mathbf{V}\right)^{-1} \mathbf{U} \mathbf{C}^{-1}=\mathbf{A}\mathbf{U}\left(\mathbf{C}+\mathbf{VAU}\right)^{-1}, \\ & \displaystyle \left(\mathbf{A}^{-1}+\mathbf{U}\mathbf{C}^{-1}\mathbf{V}\right)^{-1}=\mathbf{A}-\mathbf{A}\mathbf{U}\left(\mathbf{C}+\mathbf{VAU}\right)^{-1}\mathbf{VA}, \end{aligned}\end{split}\]

with \(\mathbf{A} = \mathbf{C}_{\mathrm{prior}}\), \(\mathbf{C} = \mathbf{C}_{\mathrm{obs}}\), \(\mathbf{U} = \mathbf{J}_{\ell}^{\mathrm{T}}\) and \(\mathbf{V} = \mathbf{J}_{\ell}\), which gives

(6)#\[\begin{split}\begin{aligned} \mathbf{s}_{\ell+1} & = \mathbf{s}_{\ell} - \gamma_{\ell} \Bigg[&&\mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}}\Big) \\ & && + \mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}} - \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \mathbf{J}_{\ell} \Big(\mathbf{s}_{j,l} - \mathbf{s}_{\mathrm{prior}}\Big) \Bigg] \\ & = \mathbf{s}_{\ell} - \gamma_{\ell} \Bigg[ && \mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}} + \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}} - \mathbf{J}_{\ell} \Big(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\Big)\Big) \Bigg]. \end{aligned}\end{split}\]

As previously stated, all implementations described below are derived from this last equation but differ in 1) the way \(\mathbf{J}_{\ell}\) is approximated, 2) the representation of \(\mathbf{C}_{\mathrm{prior}}\), 3) how matrix inversions are performed, and 4) how the step length \(\gamma\) is chosen.

Uncertainty quantification#

Minimizing \(\mathcal{J}\) allows to find the MAP \(\widehat{\mathbf{s}}\) i.e., the mean of the PDF but it does not answer the question of how to sample the full PDF (uncertainty quantification). A classic way relies on the linearization of the measurement operator (through the sensitivity matrix \(\mathbf{J}\)) which, under the assumptions made in Bayesian framework, yields a local Gaussian for the posterior PDF. This local Gaussian is specified by its mean (the MAP) and the posterior covariance matrix \(\mathbf{C}_{\mathrm{post}}\) which can be approximated by the inverse of the Hessian of the negative log-likelihood of the posterior PDF computed at the MAP estimate [Kitanidis, 1995, Lepine et al., 1999, Tarantola, 2005]. Considering iteration \(\ell\) at which \(\mathbf{s}_{\ell} = \widehat{\mathbf{s}}\),

\[\mathbf{C}_{\mathrm{ss}, \ell} \approx \left(\mathbf{H}_{\ell}\right)^{-1} \approx \left(\mathbf{J}_{\ell}^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1} \mathbf{J}_{\ell} + \mathbf{C}_{\mathrm{prior}}^{-1}\right)^{-1}.\]

Using the second lemma in (5), it gives

(7)#\[\mathbf{C}_{\mathrm{ss},\ell} \approx \mathbf{C}_{\mathrm{prior}} - \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \left(\mathbf{C}_{\mathrm{obs}} + \mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \right)^{-1} \mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}},\]

and using the forward operator linearization, the posterior covariance matrix on predictions is expressed following Lepine et al. [1999]

(8)#\[\mathbf{C}_{\mathrm{dd},\ell} \approx \mathbf{J}_{\ell} \mathbf{C}_{\mathrm{ss},\ell} \mathbf{J}_{\ell}^{\mathbf{T}}.\]

For large-scale systems, computing and storing the approximation to \(\mathbf{C}_{\mathrm{ss}}\) is computationally infeasible because the prior covariance matrices arise from finely discretized fields and certain covariance kernels are dense [Saibaba and Kitanidis, 2015]. In addition, computing the dense measurement operator requires solving many forward PDE problems, which can be computationally intractable. Note also that when a quasi-Newton optimization is used such as L-BFGS-B, the BFGS approximation may not converge to the true Hessian matrix [Ren-pu and Powell, 1983], hence, this approximation can not be used as a posterior covariance matrix. To ovecome these issues, uncertainty analysis and optimization can be conducted using randomized sampling.

The Geostatistical Approach (GA)#

In the 90’s, Kitanidis [1995], Stanford University, has developed a form of the previous bayesian framework in which \(\gamma = 1\) and \(\mathbf{s}_{\mathrm{prior}}\) is replaced by \(\mathbf{X}\boldsymbol{\beta}\) to represent the potential trend in \(\mathbf{s}\), with \(\mathbf{X}\) a \(N_{\mathrm{s}} \times p\) polynomial matrix, and \(\boldsymbol{\beta}\) a vector of \(p\) drift coefficients. He referred it as geostatistical inverse approach, with the following objective function to minimize

(9)#\[\mathcal{J}(\mathbf{s}, \boldsymbol{\beta}) = \underbrace{\frac{1}{2} \left(\mathcal{F}(\mathbf{s}) - \mathbf{d}_{\mathrm{obs}}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{obs}}^{-1}\left(\mathcal{F}(\mathbf{s}) - \mathbf{d}_{\mathrm{obs}}\right)}_{\mathrm{likelihood (data misfit)}} \underbrace{+ \frac{1}{2} \left(\mathbf{s} - \mathbf{X}\boldsymbol{\beta}\right)^{\mathrm{T}} \mathbf{C}_{\mathrm{prior}}^{-1}\left(\mathbf{s} - \mathbf{X}\boldsymbol{\beta}\right)}_{\mathrm{prior}}.\]

The particularity is that both \(\mathbf{s}\) and \(\boldsymbol{\beta}\) are updated at the same time. The Gauss-Newton iteration in (6) becomes

(10)#\[\begin{split}\begin{aligned} \mathbf{s}_{\ell+1} & = \mathbf{s}_{\mathrm{prior}} - \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathcal{F}(\mathbf{s}_{\ell}) - \mathbf{d}_{\mathrm{obs}} - \mathbf{J}_{\ell} \Big(\mathbf{s}_{\ell} - \mathbf{s}_{\mathrm{prior}}\Big)\Big), \\ & = \mathbf{X}\boldsymbol{\beta}_{\ell+1} + \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s}_{\ell}) + \mathbf{J}_{\ell} \mathbf{s}_{\ell} - \mathbf{J}_{\ell} \mathbf{X}\boldsymbol{\beta}_{\ell+1} \Big), \\ & = \mathbf{X}\boldsymbol{\beta}_{\ell+1} + \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} \boldsymbol{\xi}_{\ell+1}, \end{aligned}\end{split}\]

with

(11)#\[\begin{split}\begin{split} \boldsymbol{\xi}_{\ell+1} = \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg)^{-1} \Big(\mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s}_{\ell}) + \mathbf{J}_{\ell} \mathbf{s}_{\ell} - \mathbf{J}_{\ell} \mathbf{X}\boldsymbol{\beta}_{\ell+1} \Big), \\ \implies \Bigg(\mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\Bigg) \boldsymbol{\xi}_{\ell+1} + \mathbf{J}_{\ell} \mathbf{X}\boldsymbol{\beta}_{\ell+1} = \mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s}_{\ell}) + \mathbf{J}_{\ell} \mathbf{s}_{\ell}. \end{split}\end{split}\]

A local minimum of \(\mathcal{J}\) is found when the derivative cancels which yields

\[\dfrac{\partial \mathcal{J}}{\partial \boldsymbol{\beta}_{\ell+1}} = \mathrm{X}^{\mathrm{T}} \mathbf{C}_{\mathrm{prior}}^{-1} \left(\mathbf{s}_{\ell+1} - \mathbf{X}\boldsymbol{\beta}_{\ell+1}\right) = \mathbf{0}.\]

Remarking that (10) can be written \(\mathbf{C}_{\mathrm{prior}}^{-1}\left(\mathbf{s}_{\ell+1} - \mathbf{X}\boldsymbol{\beta}_{\ell+1}\right) = \mathbf{J}_{\ell}^{\mathrm{T}} \boldsymbol{\xi}_{\ell+1}\), injecting the previous equality yields a second equation

(12)#\[\mathbf{X}^{\mathrm{T}} \mathbf{J}_{\ell}^{\mathrm{T}} \boldsymbol{\xi}_{\ell+1} = \mathbf{0}.\]

Gathering (11) and (12), one obtains a linear system of \(N_{\mathrm{s}} + p\) equations:

(13)#\[\begin{split}\begin{bmatrix} \mathbf{J}_{\ell} \mathbf{C}_{\mathrm{prior}} \mathbf{J}_{\ell}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}} & \mathbf{J}_{\ell}\mathbf{X}\\ (\mathbf{J}_{\ell}\mathbf{X})^{\mathrm{T}} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \boldsymbol{\xi}_{\ell+1}\\ \boldsymbol{\beta}_{\ell+1} \end{bmatrix} = \begin{bmatrix} \mathbf{d}_{\mathrm{obs}} - \mathcal{F}(\mathbf{s}_{\ell}) + \mathbf{J}_{\ell}\mathbf{s}_{\ell}\\ \mathbf{0} \end{bmatrix}\end{split}\]

This can be easily solved for small-to-moderate-scale inverse problems, i.e., up to \(N_{\mathrm{s}} \approx 10^{4}\) [Lee and Kitanidis, 2014]. This is why this approach has been widely used in subsurface inverse problems to estimate unknown parameter fields and corresponding uncertainty from noisy observations citep[among others]{snodgrassGeostatisticalApproachContaminant1997,michalakApplicationGeostatisticalInverse2004,zaniniGeostatisticalInversingLargecontrast2009}. However, computational challenges arise for larger systems:

The construction of the Jacobian or sensitivity matrix, \(\mathbf{J}\), requires a large number of forward model runs increasing linearly with the number of unknowns, \(N_{\mathrm{s}}\), and the number of observations, \(N_{\mathrm{obs}}\).
\(\mathbf{J}\) becomes very large and difficult to handle in terms of memory, and the matrix products of \(\mathbf{J}\), i.e., \(\mathbf{Js}\), \(\mathbf{JX}\), \(\mathbf{JC}_{\mathrm{prior}}\), and \(\mathbf{JC}_{\mathrm{prior}}\mathbf{J}^{\mathrm{T}}\) become very expensive to compute.

This is the reason why Kitanidis and Lee [2014] have developed a parametrized version, namely the Principal Component Geostatistical Approach (PCGA).

The Principal Component Geostatistical Approach (PCGA)#

The Principal Component Geostatistical Approach is presented as a computationally efficient algorithm for geostatistical inversion based on compression of covariance matrices and Jacobian-free evaluation of sensitivity [Kitanidis and Lee, 2014]. It falls in the parametrization category of expansion techniques: a Karhunen-Loève expansion in which the total number of parameters is reduced by working on spaces of lower dimension. First, the Jacobian products, \(\mathbf{Js}\) and \(\mathbf{JX}\) are approximated by finite difference using a Taylor series expansion

(14)#\[\begin{split}\begin{split} \mathbf{Js} & \approx \dfrac{1}{\delta}\left[\mathcal{F}(\mathbf{s}+\delta s) - \mathcal{F}(\mathbf{s})\right] \to 2 \textrm{ forward runs because } \delta s\; \textrm{is a scalar},\\ \mathbf{JX}_{i} & \approx \dfrac{1}{\delta}\left[\mathcal{F}(\mathbf{s}+\delta \mathbf{X}_{i}) - \mathcal{F}(\mathbf{s})\right] \to p \textrm{ forward runs}. \end{split}\end{split}\]

Second, the core of the method relies on the compression of \(\mathbf{C}_{\mathrm{prior}}\) with a Karhunen–Loève transform (principal component) parametrization

(15)#\[\mathbf{C}_{\mathrm{prior}} = \mathbf{Z} \boldsymbol{\Lambda} \mathbf{Z}^{-1} \approx \mathbf{C}_{K} = \mathbf{Z}_{K} \boldsymbol{\Lambda}_{K} \mathbf{Z}_{K}^{-1},\]

with \(\mathbf{Z}\) a (\(N_{\mathrm{s}}\) times \(N_{\mathrm{s}}\)) matrix whose \(i^\mathrm{th}\) column is the \(i^\mathrm{th}\) eigenvector of \(\mathbf{C}_{\mathrm{prior}}\), and \(\boldsymbol{\Lambda}\) a diagonal matrix whose entries \(\boldsymbol{\Lambda}_{ii} = \lambda_{i}\) are the eigen values of \(\mathbf{C}_{\mathrm{prior}}\). Since \(\mathbf{C}_{\mathrm{prior}}\) is a real symmetric matrix, the eigenvalues are real and the eigenvectors can be chosen real and orthonormal. Thus \(\mathbf{C}_{\mathrm{prior}}\) can be decomposed as

(16)#\[\mathbf{C}_{\mathrm{prior}} \approx \mathbf{C}_{K} = \mathbf{Z}_{K} \boldsymbol{\Lambda}_{K} \mathbf{Z}_{K}^{\mathrm{T}} = \sum^{K}_{i=1} \mathbf{\xi}_{i} \mathbf{\xi}_{i}^{\mathrm{T}}, \quad \mathrm{with}\; \mathbf{\xi}_{i} = \sqrt{\lambda_{i}} \mathbf{Z}_{i}.\]

Practically, this means that instead of storing \(\mathbf{C}_{\mathrm{prior}}\), a (\(N_{\mathrm{s}} \times N_{\mathrm{s}}\)) matrix, the information is held by \(\mathbf{Z}_{K}\), a much smaller (\(K \times N_{\mathrm{s}}\)) matrix (\(\boldsymbol{\xi}_{i}\) is the \(i^{\mathrm{th}}\) column vector of \(\mathbf{Z}\)) that allows to compute \(\mathbf{C}_{K}\), a rank-K approximation of \(\mathbf{C}_{\mathrm{prior}}\). This compression coupled with finite difference approximation is used to approximate all heavy matrix products

(17)#\[\begin{split}\begin{split} \mathbf{JC}_{\mathrm{prior}} & \approx \mathbf{J} \mathbf{C}_{K} = \mathbf{JZ}_{K}\boldsymbol{\Lambda}_{K}\mathbf{Z}_{K}^{\mathrm{T}} = \mathbf{J}\sum^{K}_{i=1} \boldsymbol{\xi}_{i} \boldsymbol{\xi}_{i}^{\mathrm{T}} = \sum^{K}_{i=1} \mathbf{J}(\boldsymbol{\xi}_{i}) \boldsymbol{\xi}_{i}^{\mathrm{T}} \approx\sum^{K}_{i=1} \boldsymbol{\eta}_{i} \boldsymbol{\xi}_{i}^{\mathrm{T}}, \\ \mathbf{JC}_{\mathrm{prior}}\mathbf{J}^T & \approx \mathbf{JC}_{K}\mathbf{J}^T = \mathbf{JZ}_{K}\boldsymbol{\Lambda}_{K}\mathbf{Z}_{K}^{\mathrm{T}}\mathbf{J}^{\mathrm{T}} = \sum^{K}_{i=1} (\mathbf{J}\boldsymbol{\xi}_{i}) (\mathbf{J}\boldsymbol{\xi}_{i})^{\mathrm{T}} \approx \sum^{K}_{i=1} \boldsymbol{\eta}_{i} \boldsymbol{\eta}_{i}^{\mathrm{T}}, \\ & \textrm{with } \boldsymbol{\eta}_{i}= \mathbf{J}\boldsymbol{\xi}_{i} \approx \dfrac{1}{\delta}\left[\mathcal{F}(\mathbf{s}+\delta \boldsymbol{\xi}_{i}) - \mathcal{F}(\mathbf{s})\right] \to K \textrm{ forward runs}. \end{split}\end{split}\]

As a result, this method requires \(2+p+K\) runs of the forward problem in each iteration independently of \(N_{\mathrm{s}}\) and \(N_{\mathrm{obs}}\), where \(K\) is the number of principal components and can be much less than \(\mathbf{s}\) and \(\mathbf{d}\) for large-scale inverse problems. In addition, the computations involving large matrices (\(\mathbf{C}_{\mathrm{prior}}\), \(\mathbf{J}\)) can take advantage of fast linear algebra that allows fully parallelizable, fast matrix-vector multiplications:

Fast Fourier Transform (FFT) approach for regular grids [Nowak et al., 2003, Saibaba et al., 2013]
Hierarchical Matrices Approach [Saibaba et al., 2013] and fast Multipole Method (FMM) [Fong and Darve, 2009, Wang et al., 2021] for unstructured grids.
Stochastic Partial Differential Equation (SPDE) [Lindgren et al., 2022], for all grid cell types, with support for anisotropies and non-stationarity.

On the choice of k#

As detailed by Lee and Kitanidis [2014] and reference therein, the number of eigen pairs one should keep actually depends on the dimension and the smoothness of the covariance kernel. They propose to compute the relative error of the low-rank covariance matrix approximation as a ratio between the \(k^\mathrm{th}\) eigen value and the first one

(18)#\[\dfrac{\left\lvert \mathbf{C}_{\mathrm{prior}} - \mathbf{C}_{K}\right\rvert}{\left\lvert \mathbf{C}_{\mathrm{prior}}\right\rvert} = \dfrac{\lambda_{K+1}}{\lambda_{1}}.\]

However, this approach is very sensitive to how fast the eigenvalues decay, and the choice for a small error does not guarantee how much of the variance is explained. Instead, we preferred to choose \(K\) to reach a given level of cumulative explained variance, typically 80%. The cumulative explained variance can be easily computed as the sum of the \(K\) first eigen values divided by the sum of all eigen values. The sum of all eigen values is simply the trace of the matrix, i.e., the sum of its diagonal entries, i.e., the marginal variances of the estimated parameters emph{a priori}

(19)#\[\Sigma_{\sigma}(K) = \dfrac{\displaystyle\sum_{i=1}^{K} \lambda_{i}}{\displaystyle\sum_{i=1}^{N_{\mathrm{s}}} \mu \lambda_{i}} = \dfrac{\displaystyle\sum_{i=1}^{K} \lambda_{i}}{\mathrm{tr}(\mathbf{C}_{\mathrm{prior}})} = \dfrac{\displaystyle\sum_{i=1}^{K} \lambda_{i}}{\displaystyle\sum_{i=1}^{N_{\mathrm{s}}} \mathbf{C}_{\mathrm{prior},ii}} = \dfrac{\displaystyle\sum_{i=1}^{K} \lambda_{i}}{\displaystyle\sum_{i=1}^{N_{\mathrm{s}}} \sigma_{i}^{2}}.\]

As stated by Lee and Kitanidis [2014], in practice, one may start the eigen decomposition with large \(K\), e.g., \(K \sim 100\), then choose a value for \(K\) that satisfies a reasonable approximation accuracy.

Posterior uncertainty#

In this subsection, \(\mathbf{J}\) denotes \(\mathbf{J}|_{\widehat{\mathbf{s}}}\). To avoid the construction and inversion of \(\left(\mathbf{C}_{\mathrm{obs}} + \mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} \right)^{-1}\) from (7), Kitanidis [1995] remarks that the posterior covariance matrix approximation can be written as

\[\mathbf{C}_{\mathrm{post}} = \mathbf{C}_{\mathrm{prior}} - \mathbf{XM} - \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} \boldsymbol{\Lambda}^{\mathrm{T}}.\]

See 3.3. in Saibaba and Kitanidis [2015] for the derivation. Taking advantage of the fact that \(\boldsymbol{\Lambda}^{\mathrm{T}}\) and \(\mathbf{M}\) which are of size (\(N_{\mathrm{s}} \times N_{\mathrm{obs}}\)) and (\(N_{p} \times N_{\mathrm{obs}}\)) respectively, are also found solving a linear system of the form \(\mathbf{Ax}=\mathbf{b}\) with the same \(\mathbf{A}\) matrix than in (13). Hence the same efficient system resolution may be used with no extra forward run required in

(20)#\[\begin{split}\begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}} & \mathbf{J} \mathbf{X}\\ (\mathbf{J} \mathbf{X})^{\mathrm{T}} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \boldsymbol{\Lambda}^{\mathrm{T}}\\ \mathbf{M} \end{bmatrix} = \begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \\ \mathbf{X}^{\mathbf{T}} \end{bmatrix}.\end{split}\]

This is equivalent to computing

(21)#\[\begin{split}\mathbf{C}_{\mathrm{post}} = \mathbf{C}_{\mathrm{prior}} - \begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \\ \mathbf{X}^{\mathbf{T}} \end{bmatrix}^{\mathrm{T}} \begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}} & \mathbf{J} \mathbf{X}\\ (\mathbf{J} \mathbf{X})^{\mathrm{T}} & \mathbf{0} \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \\ \mathbf{X}^{\mathbf{T}} \end{bmatrix}.\end{split}\]

Note that the diagonal entries of the posterior covariance matrix \(\mathbf{C}_{\mathrm{post}}\) can be computed without building the full matrix explicitly by solving \(N_{\mathrm{s}}\) times the system

(22)#\[\begin{split}\mathbf{C}_{\mathrm{post},ii} = \mathbf{C}_{\mathrm{prior}, ii} - \begin{bmatrix} \left(\mathbf{J}\mathbf{C}_{\mathrm{prior}}\right)_{i}\\ \mathbf{X}^{\mathrm{T}}_{i} \end{bmatrix}^{\mathrm{T}} \begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}} & \mathbf{JX}\\ (\mathbf{JX})^{\mathrm{T}} & \mathbf{0} \end{bmatrix}^{-1} \begin{bmatrix} \left(\mathbf{J}\mathbf{C}_{\mathrm{prior}}\right)_{i}\\ \mathbf{X}^{\mathrm{T}}_{i} \end{bmatrix},\end{split}\]

where \(\mathbf{C}_{\mathrm{post}, ii}\) is the \(i^\mathrm{th}\) diagonal element of \(\mathbf{C}_{\mathrm{post}}\), \(\mathbf{C}_{\mathrm{prior},ii}\) is the \(i\mathrm{th}\) diagonal entry or the prior variance of \(i^\mathrm{th}\) parameter, \(\left(\mathbf{J}\mathbf{C}_{\mathrm{prior}}\right)_{i}\) is the \(i^\mathrm{th}\) column of \(\mathbf{J}\mathbf{C}_{\mathrm{prior}}\), and \(\mathbf{X}_{i}^{\mathrm{T}}\) is the \(i\mathrm{th}\) column of \(\mathbf{X}^{\mathrm{T}}\).

An alternative also introduced by Kitanidis [1995] is to rely on the Randomized Maximum Likelihood (RML) idea. In the case of GA and PCGA however, the sampling is performed emph{a posteriori} which means that no extra forward operator \(\mathcal{F}\) run is needed to sample from the PDF as we used the linearized operator \(\mathbf{J}\) instead. The conditional realizations, i.e., samples from the posterior distribution are computed from the perturbed objective function after convergence to \(\widehat{\mathbf{s}}\) with \(\mathbf{s}_{\mathrm{prior}} = \mathbf{X}\boldsymbol{\beta}\) and are expressed as

\[\widehat{\mathbf{s}}_{j} = \boldsymbol{\zeta}_{j} + \mathbf{X}\mathbf{\beta}_{j} + \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} \boldsymbol{\xi}_{j},\]

where \(\boldsymbol{\zeta}_{j}\) are unconditional realization of the prior distribution, i.e., \(\boldsymbol{\zeta} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{C}_{\mathrm{prior}}\right)\), and \(\{\boldsymbol{\xi}_{j},\boldsymbol{\beta}_{j}\}\) are found solving

(23)#\[\begin{split}\begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}} & \mathbf{J}\mathbf{X}\\ (\mathbf{J}\mathbf{X})^{\mathrm{T}} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \boldsymbol{\xi}_{j}\\ \boldsymbol{\beta}_{j} \end{bmatrix} = \begin{bmatrix} \mathbf{d}_{\mathrm{uc},j} - \mathcal{F}(\widehat{\mathbf{s}}) + \mathbf{J}\left(\widehat{\mathbf{s}}- \boldsymbol{\zeta}_{j}\right)\\ \mathbf{0} \end{bmatrix}.\end{split}\]

After an ensemble of \(N_{e}\) conditional realizations \(\widehat{\mathbf{s}}_{j}\) have been computed, associated realizations of the dependent variables \(\mathcal{F}(\widehat{\mathbf{s}}_{j}) = \mathbf{d}_{j}\) are generated by solving the forward problem.

Other fast computation methods have been developed but are not reviewed here [Saibaba and Kitanidis, 2015, Saibaba et al., 2013].

System resolution#

Several methods can be used to solve the systems in (13), (20), (21), (22), (23). In all cases, the matrix to be inverted is identical. The first approach is to use a direct solver. In this case, a Cholesky factorization can be used. Indeed, the matrix

\[\begin{split}\begin{bmatrix} \mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}} & \mathbf{JX}\\ (\mathbf{JX})^{\mathrm{T}} & \mathbf{0} \end{bmatrix}\end{split}\]

can be factorized using the fact that \(\mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\) is symmetric positive. Recall the \(\mathbf{LEL}^{\mathrm{T}}\) factorization for a block matrix

\[\begin{split}\begin{aligned} \mathbf{A} = \mathbf{L}_\mathrm{A}\mathbf{E}\mathbf{L}_{\mathrm{A}}^{\mathrm{T}} &= \begin{pmatrix} \mathbf{L}_{11} & 0 \\ \mathbf{L}_{21} & \mathbf{L}_{22} \end{pmatrix} \begin{pmatrix} \mathbf{I} & \mathbf{0} \\ \mathbf{0} & -\mathbf{I} \end{pmatrix} \begin{pmatrix} \mathbf{L}_{11}^{\mathrm{T}} & \mathbf{L}_{21}^{\mathrm{T}} \\ \mathbf{0} & \mathbf{L}_{22}^{\mathrm{T}} \end{pmatrix}, \\ &= \begin{pmatrix} \mathbf{L}_{11} \mathbf{L}_{11}^{\mathrm{T}} & \mathbf{L}_{11} \mathbf{L}_{21}^{\mathrm{T}} \\ \mathbf{L}_{21} \mathbf{L}_{11}^{\mathrm{T}} & \mathbf{L}_{21} \mathbf{L}_{21}^{\mathrm{T}} - \mathbf{L}_{22} \mathbf{L}_{22}^{\mathrm{T}} \end{pmatrix}^{\mathrm{T}}, \end{aligned}\end{split}\]

by substitution in three successive steps, we can determine that:

\(\mathbf{L}_{11}\) is the lower triangle of the Cholesky factorization of \(\mathbf{J} \mathbf{C}_{\mathrm{prior}} \mathbf{J}^{\mathrm{T}} + \mathbf{C}_{\mathrm{obs}}\), of size (\(N_{\mathrm{obs}} \times N_{\mathrm{obs}}\));
\(\mathbf{L}_{21} = \left(\mathbf{L}_{11}^{-1}\mathbf{JX}\right)^{\mathrm{T}}\), of size (\(p \times N_{\mathrm{obs}}\));
\(\mathbf{L}_{22}\) is the lower triangle of the Cholesky factorization of \(\mathbf{L}_{21}\mathbf{L}_{21}^{\mathrm{T}}\), of size (\(p \times p\)).

The resulting inversion is much more efficient and accurate. This is an improvement that we implemented in PyPCGA. For large-scale system with \(N_{\mathrm{obs}} > 100\), the system can also be resolved in a matrix-free fashion using iterative Krylov space solvers as described in Saibaba et al. [2013], Saibaba and Kitanidis [2012]. The later never requires the explicit entries of the matrix but only rely on matrix-vector products involving the matrix or its transpose. In addition, a fast and exact preconditioner for PCGA have been developed [Lee et al., 2016] to deal with very large \(N_{\mathrm{obs}}\).