In general, models could be classified as   :
1) Process specific models (Causality Process) are based on conservation of laws of nature e.g.: Shallow water modeling, Navier Stokes Equation.
2) Data specific models (correlation based) are based on developed experimental models e.g.: Time Series models, Machine Learning, Neural Networks. etc.
Atmospheric and oceans models are process specific models which Navier stokes equations are the core of the solver to predict, simulate and estimate system states. Process specific models have different classifications from different perspectives, time, space, and structure of the model, in addition to another classification which is deterministic and stochastic.
It is assumed that the state of dynamical system evolves according to first order nonlinear equation
where is the current state of the system; M(.) is the mapping function from the Current state X at time K to the next state X at K + 1
・ If the mapping function M(.) doesn’t depend on the time index K then it is called a time invariant or autonomous system
・ If the M(.) varies with time index k that is , then it is called time-varying system (dynamic system)
・ If for M non-singular matrix, then it is called time invariant linear system.
・ If matrix M varies with time, that is , then it is called time varying linear or non-autonomous system.
・ In the special cases when M(.) is an identity map, that is , then it is called static system
The mentioned above is the deterministic case; the Randomness in the model can enter in three ways:
(A) Random initial conditions (B) Random forcing (C) Random Coefficients
So, a random or a stochastic model is given by:
where the random sequences , denotes to the external forcing; typically captures uncertainties in the model including model error.
So, the classes of models that are used in the data assimilation could be classified as:
(1) Deterministic-Static (3) Stochastic-Static
(2) Deterministic-Dynamic (4) Stochastic-Dynamic
The paper is organized as follow: Section 2 states the mathematical foundation for data assimilations which describes linear and nonlinear least square in addition to weighted least square and Recursive Least square. Section 3 introduces the deterministic static model linear and nonlinear cases. Section 4 shows stochastic static model: Both linear and nonlinear cases. Section 5 describes deterministic dynamic linear case and the recursive case. Section 6 explains stochastic dynamic linear, nonlinear, reduced and hybrid filters.
2. Mathematical Background
Mainly all the data assimilation techniques are based on least square estimation, so classification of the estimation problems based on different criteria. The Estimation problems can be underdetermined problem when number of observations (m) are less than number of state (n) (m < n). And can be over determined when number of observations (m) is larger than number of states (n) (m > n)  . Also the Estimation problem can be classified according to the mapping function from the state space to observation space which can be linear or nonlinear. and nonlinearity should be handled
Other type of classification is offline or online problems. Where offline problems if the observations are known priori. In other words we have historical set of data and we treat with them. While Online/Sequential problems are computing a new estimate of the unknown state X as a function of the latest estimate and the current observation. So, online formulation is most useful in real time applications. The last type of classifications is Strong and Weak Constraints, The Strong Constraint is occurred when the estimation is performed under the perfect model assumption. While in case of allowing for errors in the model dynamics, it is considered Weak Constraint 
Since, Data Assimilation based on Least Square Approach     . First we introduce linear version of least square then will move to nonlinear version. After that weighted version for both linear and nonlinear will be introduced.
2.1. Linear Least Square Method
Given the observation vector Z and the mapping function/Interpolation matrix H (Full Rank) find the unknown state vector X.
Then the error Vector which represents the difference between the Observations Z and the required estimated state X can be represented as follow
The Previous term is called also error term or Innovation term. So we need to find the best estimate that minimizes the error. So the problem is converted from Estimation Problem to Optimization Problem. One of the basic classifications for problem is over determined Problem (m > n) or under determined problem (m < n) and we consider first over determined case (m > n). To measure The vector e using The Euclidean norm and the fact of Minimization of norm is equal to minimization of square norm. The cost function will be as follow
And minimization of X is Obtained under the following condition . This lead to
And this equation was known as normal equation method. In case of over determined problem m > n. While in case under determined case (m < n). The estimation for X will be as follow.
And this undetermined problem where m number of observation is less than n, number of unknowns is known in geophysical (physical process physical properties of the earth and its surrounding space Environment) because of the cost of collection of the observation is high. And in case uniquely determined case (m = n) which mean that the error e is zero. The estimated state vector X will be
As we can see from the above three formulation that minimization for the cost function ends to Solving system of linear equation, and solving the linear system (A = Xb) of equations can be done using:
o Direct methods (Cholesky decomposition, Q-R decomposition, Singular Value decomposition, …)  .
o Iterative methods (Jacobi Method, Gauss-Seidel Method, Successive Over-Relaxation method, …)  .
2.2. Non Linear Least Square Method
The problem here will be: Given set of observation Z and knowing the function form h which is nonlinear function. Find the state vector X as shown
And the Innovation term will be
And the cost function will be
And so that, the idea here is extension for the linear case by replacing the nonlinear term h(x) with its Linear approximation of Taylor series expansion around an Operating Point let it and in this case it is called first order approximation for nonlinear least square as follow
where is the Jacobin matrix of h which is an matrix given by:
And by substitution of first order approximation of Taylor series expansion in the cost function and giving it name and the index one refer for first order
And simplifying the notation by define
And by comparing the previous equation by the linear version.
You will find that every Z was replaced by g(x) and every H is replaced by also every X is replaced by
And if the gradient is ; we will obtain
And this is iterative approach and given an initial value for and solving the above equation using direct methods or iterative methods then iterate again until < prescribed threshold
And by similarity the second order algorithm for non Linear Least Square will be same steps except that Taylor series expansion will be full Quadratic approximation
where the is the hussian of and by substituting this second order expansion in the nonlinear cost function and giving it name since the index 2 refer to second order approximation and Putting the gradient equal zero you will get the following
And this is iterative approach and given an initial value for and solving the above equation using direct methods or iterative methods then iterate again until < prescribed threshold. All the above methods were for deterministic least square where
2.3. The Weighted Least Square Method
If there additive random noise was V presented. Which mean the observation is noisy with as follow
・ Mean E(V) = zero which mean that the instrument is well calibrated Unbiased and If it is ≠ zero which mean that there is bias (for example under or over reading).
・ Covariance , The covariance matrix for the instrument. which it is property for the instrumentation.
So, for the linear form of Least Square, the cost function will be for linear form will be
And the non Linear form
Also by the same methodology that used above, the best estimate form for Stochastic linear least square after equality for the Gradient to zero will be
And for the stochastic nonlinear least square first order approximation will be
And for the stochastic nonlinear least square Second order approximation will be
where for simiplyifing the notation.
2.4. The Recursive Least Square Estimation Approach (Offline/Online Approach)
All the above analysis was assumed that the number m of observations is fixed. which mean by another term it is offline version of least square. So if we don’t know number observation m and they are arrive sequentially in time. There’re two ways to add this new observations:
・ The first approach is to solve the system of linear equation repeatedly after arrival of every new observation. But this approach is very expensive from Computational point of view
・ The second approach is to formulate the following problem which is based on Knowing the Optimal estimate based on the m observations. we need to compute based on observation. In more clear words we need to calculate to reach to formulation that can compute the new estimate function of the old estimate plus sequantional term. And this approach is called situational or recursive framework
It was known as mentioned above that the optimal linear least square estimate for is . Let be the new observation then the can be expanded in the form of matrix-vector relation as:
So, the Innovation will be:
So, the Cost function will be:
So, by taking the gradient and equal it to zero . We can get the following
The cost of computation for the second term of this equation is less than solving the all system again.
The basic building block for understanding the Data assimilation based on Least Square approach was introduced.
3. Deterministic-Static Models
In Atmospheric science, when we want to assimilate observation data to the model at time step, Two source of information are available, one of them after mapping the observation Z to X and the other source is the Prior information.
The Linear/Non Linear Case
The formulation for the problem, we need the best estimate for X given the two source of information
・ The first source is the given the Observation Z and mapping function H, where the innovation term was
・ The Second source is given the Prior or Background information where the innovation term is
So the cost function for this case will
For linear Case
For Nonlinear case
4. Stochastic-Static Models
The formulation for the problem will be the same as in Deterministic/Static problem, we need the best estimate for X given the two source of information
・ The first source is the given the Observation Z and mapping function H, where the innovation term was and the observation has noise with its covariance R
・ The Second source is given the Prior or Background information where the innovation term is , and the Background has its Covariance B
So the cost function for this case will
For linear Case:
For Nonlinear case:
The Dynamical models can be classified as:
where is the state of the dynamical system, so if the initial condition is known, so computing the is a forward problem
It is assumed that is not known. And Estimating based on noisy indirect information is the inverse problem that it is required to solve.
The observations also can be classified as:
And assume the is White noise, has zero mean and Known covariance matrix R, which depend on the nature and the type of instruments used
So, formulation for statement of problem is: given set of noisy observations and the model equations, it is required to estimate the initial
Condition that give best fit between the background states and noisy observations
To conclude there are four different types of problems:
1) Linear Model-Linear observation
2) Linear model-non Linear observation
3) Nonlinear model-linear observation
4) Non linear model-Non linear observation
We will consider only one case only the which is the simplest formulation with both model and observations are linear and the other cases could be checked in   .
The Definition of the cost function which is weighted sum of the squared errors is given as:
For linear case
For nonlinear case
Depend on the whether the observations are linear or not linear. And the goal is to minimize the J(X) w.r.t
In case there background will be included for linear case
And for nonlinear case
There are two approaches for minimization of this cost functions
5.1. Deterministic-Dynamic Linear Case
5.1.1. Linear Case-Method of Elimination
This method is mainly based on substitute the following equation in the cost function J(X) then, get the following equation
Then we get the gradient for the
and it is quadratic in (46)
So, the gradient is which leads to
But, this approach is not practical, since it involves matrix-matrix products in the computation of A and b, so there is need to another way
5.1.2. Linear Case-Lagrangian Multipliers Formulation
Define Lagrangian L:
And the Necessary conditions for the minimum
And substitute it in the last two equations
So, as shown that the formulation for calculate is backword relation. we can iterate backward starting from to Compute . And this technique is known as backword adjoint dynamics. Then after getting substitute in and get the gradient then use any minimization algorithm to get , The 4-D Var Algorithm (First order adjoint method) can be summarized as follow:
1) Start with an arbitrary , and compute the model solution using
2) Given the observation where
3) Set → and Solve to find
4) Compute the gradient
5) Use this gradient in minimization algorithm to find the optimal by repeating the steps 1 through 4 until convergence.
5.2. Recursive Least Squares Formulation of 4D Var (Online Approach)
In the previous part of 4Dvar section the solution for off-line 4Dvar problem of assimilating given set of observations in deterministic-dynamic model using classical least square method.
Now there is need to develop an online or recursive method for computing the estimate of the state as new observations arrive. which mean we need to compute the in terms of and the new observation .
Consider linear deterministic dynamical system without model noise
where the initial condition is random variable with and . and the observations for are given as
where is full rank and is the observation vector noise with the following known properties
So, the objective function that
Since, our goal to find an optimal that minimize the , it is needed to express and
In tem of the corresponding N values and . So,
Hence, the trajectory of the model starting from
Substitute for and into
And differentiate w.r.t twice to get the gradient and Hessian. Then setting the gradient to zero and simplifying the notation
By induction the minimization for of is given by
The goal of the recursive framework is to express as function of and . this calls when expressing , , , in terms of , , ,
So using equations from (61) to (64) and and the following equation
Then N + 1 formula in terms of N can be get
By combining Equations (70) and (76) into Equation (68) and defining after combination
The right hand side of this equation is sum of two terms the first one is
So adding and subtracting this term is equal to
is called kalman gain matrix and
combine it with Equation (77). The desired recursive expression will be gained
6. Stochastic-Dynamic Model
This type of data assimilation problems is same as deterministic/dynamic problems, except that, this type introduces an additional term in the forecast equation which is noise vector that associated with the model (i.e. model error).
For Stochastic-dynamic model we can divide the filters to Linear and Nonlinear filters Linear filter has evolution function in the model and the mapping function is linear while in Nonlinear filters those two functions are nonlinear
6.1. Linear Filters
Kalman Linear Filter
Kalman filter approach was first introduced at reference  
This section will show how model and observation with error will be presented then will formulate the algorithm:
A1-Dynamic model: it will be assumed linear, non autonomous, dynamical system that evolves according to
where is nonsingular system matrix that varies with time K and denote to model error. It is assumed that and satisfy the following conditions (A) is random variable with known mean vector and known covariance matrix (B) The model erro is unbiased, mean for all k, and temporally uncorrelated (white noise) when if and otherwise (C) The model error and the initial state is uncorrelated for all k
B1-Observations: The observation is the observation at time k and related to via
where represent the time varying measurement system and represent the measurement noise with the following properties (A) has mean zero (B) is temporary uncorrelated: if while otherwise . (C) is uncorrelated with the initial state and the model error which mean for all and for all K and j
C1-statement of the filtering problem: Given that evolves according to equation 81 and set of observations. Our goal to find an estimate that minimize the mean square error
And the following is the summary of the Kalman filter procedure (Covariance Form)
is random with mean and
Model Forecast ,
The computation of the covariance matrices and is the most time-consuming part since in many of the applications n > m. so that reduced order filters has been introduced  .
6.2. Non Linear Filters
6.2.1. First Order Filter/Extended Kalman Filter (EKF)
The extended kalman filter is an extension for kalman filter idea in case the evolution function in the model and the mapping function is non linear
For nonlinear model
For nonlinear observation
The main idea for First order filter/extended kalman filter is to expand the around
and around in first order taylor series expansion. when and are linear it reduces to kalman filter. The following is the summary steps for Extended Kalman filter
Data Assimilation Step
6.2.2. Second Order Filter
The second order filter is the same idea of the first order filter except that the expansion of around and around in Second order Taylor series expansion. And the following is summary of the second order nonlinear filter
Data Assimilation Step
6.3. Reduced Rank Filters
Ensemble Kalman Filter
The ensemble Kalman filter  originated from the merger of Kalman filter theory and Monte Carlo estimation methods. It was introduced the basic principles of linear and nonlinear filtering but these are not used in day by day operations at the national centers for weather predictions. Because of the cost of the updating the covariance matrix was very high.
So, there are mainly two ways to avoid the high cost of computing the covariance matrix
1) The first method was the Parallel computation which mainly dependent on
a) The Algorithm
b) The number of Processors
c) The topology of the interconnection of the network
d) How the tasks of the algorithm are mapped on the processor
2) The second method which became more popular which to compute low/ reduced rank approximation to the full rank covariance matrix. and most of low rank filters differs only in the way in which the approximation are derived. Excellent review on the Ensemble Kalman filter was introduced
Formulation of the problem
It is assumed that the model is nonlinear and observations are linear functions of the state
And it is assumed that
1) The initial conditions
2) The dynamic system noise wk is white Gaussian noise with
3) The observation noise is White noise with
4) , , are mutually uncorrelated
・ Create the initial ensemble
1) Create the ensemble of forecasts at time using the following
The members of the ensemble forecast at time are generated where
2) Compute and using
Data assimilation step
1) Create the ensemble of estimates at time using
2) Compute and using
The sample mean of the estimate at time is then given by
where for large N
All summaries, derivation and details could be checked in reference   . and full review on ensemble kalman filter for atmospheric data assimilation is inteoduced by P. L. Houtekamer and Fuqing Zhang, 2016  .
6.4. Hybrid Filters
3DVar uses static climate Background error while 4DVar uses implicit flow dependent information but still start with static background error. And since. The B-matrix affects the performance of the assimilation heavily  it is important to use a B-matrix that is a realistic representation of the actual forecast error covariance  . So many proposed hybrid filters were introduced. they are to use flow dependent background error in vartional data assimilation system by combining the 3Dvar climate background error covariance and error of the day from ensemble.
In Equation (37) replace the B by weighted sum of 3Dvar B and the ensemble covariance as follow  :
The Ensemble covariance is included in the 3DVAR cost function through
Augmentation of control variables  and the following formula is mathematically equivalent to  .
This is well known hybrid 3DVar-EnKF method. While 4DVar-EnKF method has the same idea if we substitute by Equation (43) by the same methodology in 3DVar-EnKF part. More advanced hybrid filters are highlighted in ref   .
This paper shows the mathematical perspective for the basic foundation of data assimilation modules starting from least square to advanced filters that used in data assimilation as journey. This work is the first of its type to summarize the mathematical perspective for data assimilation in extensive way and highlights both classical and advanced data assimilation methods. This paper could be used as reference to understand the mathematics behind data assimilation. It started by least square method and their different versions then explains on the classical method 3Dvar. 4DVar also is introduced. Advanced filters such kalman filter and its families were highlighted. The idea of hybrid filter was introduced finally.
For future work, detailed hybrid filters should be highlighted, since there are different hybrid filters structure were introduced. Generic case studies the evaluate performance of the different assimilation techniques.
 Evensen, G. (1994) Sequential Data Assimilation with a Nonlinear Quasi-Geostrophic Model Using Monte Carlo Methods to Forecast Error Statistics. Journal of Geophysical Research, 99, 10143-10162.
 Bannister, R.N. (2008) A Review of Forecast Error Covariance Statistics in Atmospheric Variational Data Assimilation. I: Characteristics and Measurements of Forecast Error Covariances. Quarterly Journal of the Royal Meteorological Society, 134, 1951-1970.
 Bannister, R.N. (2008) A Review of Forecast Error Covariance Statistics in Atmospheric Variational Data Assimilation. II: Modelling the Forecast Error Covariance Statistics. Quarterly Journal of the Royal Meteorological Society, 134, 1971-1996.
 Hamill, T.M. and Snyder, C. (2000) A Hybrid Ensemble Kalman Filter-3D Variational Analysis Scheme. Monthly Weather Review, 128, 2905-2919.
 Lorenc, A.C. (2003) The Potential of the Ensemble Kalman Filter for NWP: A Comparison with 4D-Var. Quarterly Journal of the Royal Meteorological Society, 129, 3183-3203.