The Use of Some of the Information Criterion in Determining the Best Model for Forecasting of Thalassemia Cases Depending on Iraqi Patient Data Using ARIMA Model

Show more

1. Introduction

The Arabisation (The Arabic meaning) of Thalassemia is Mediterranean disease, where the Middle East and the Gulf Arab states are included, Thalassemia, which is spread throughout Iraq and other Middle East countries, is one of the most famous genetic diseases that lead to severe anaemia and other complications in the long term. It requires quoting a regular blood transfusion every three to four weeks, accompanied by pain to the patients and great suffering to their families. On top of this, the transfer of continuous blood also leads to the accumulation of iron in the vital organs of the body, such as the liver and heart, which leads to serious complications. Infected cases of this disease are divided into two types: Thalassemia Minor―characterises people who have the disease, but are living a normal life and not complaining of any symptoms. The second type is Thalassemia major, which can occur in offspring of two carriers of Thalassemia Minor, with 25% likelihood that their children are not infected with the disease, 25% that they suffer a severe form of the disease, and 50% that they have Thalassemia Minor [1] . Iraq is witnessing a monthly birth-rate of between (100 - 150) children with the disease, which is the same as getting approximately (2000) cases per year and these numbers are disastrous compared to neighbouring countries. Another classification of Thalassemia is according to the site of the defect in the protein molecule of the red blood cells, according to which: α- and β-Thalas- semia, β-Thalassemia is a common hereditary disorder caused by mutations in one or more of the β-globin gene loci that result in reduced β-globin production [2] . Recently, more than 200 different mutations have been detected affecting a variety of levels of β-globin gene expression causing β-Thalassemia. These mutations are not uniformly distributed, but have a geographical particularity and racial origin, as each is characterised by the presence of a few common mutations and variable numbers of rare ones [3] . Mutations such as nucleotide substitutions and/or frame shifts of the insertion/deletion kind have been reported to interfere with the transcription of the β-globin gene, splicing procedures, and translation of β-globin gene mRNA, which causes either absence or reduction of synthesis of β-globin chains [4] . One of the most important elements of building health is the prevention of all diseases, including serious diseases, such as Thalassemia, that cause a high percentage of deaths compared to other diseases, due to the increasing number of people infected with this disease in recent times. This study was conducted in order to forecast the prevalence of this disease in future, which has been increasing in all areas of Iraq, especially in the province of Maysan. This study was based on monthly data for the patients with Thalassemia for the period between 2005-2015, and used data of patients as a series of time for the purpose of analysis for optimal modelling using the ARIMA Methodology, the forecasting to predict the numbers of people with this disease in subsequent, periods in order to take the necessary measures and substitutions to reduce morbidity in the future. A time-series typically consists of a set of observations of a variable taken at evenly spaced intervals of time [5] . The most comprehensive of all popular and widely known statistical models which have been used in the last four decades for time series forecasting are the Box-Jenkins method. However, the ARIMA model is only a class of linear model and, thus, it can only capture linear feature of data time series [6] . Many of standards determining the rank of model have been proposed by researchers [7] [8] [9] . There are different models of precision can juggle in time series analysis to clarify the given set of data, not be easy to choose the better model, in many cases, it has developed several criteria to compare models in the process of selecting the rank of model, and derive the importance of selecting the rank of the model from the fact that choosing the lowest rank of the actual rank of the model leads to inconsistency of the model parameters, while choosing a higher rank than the actual rank of the form to increase the model variance; this leads to a loss of accuracy due to the increase in the number of model parameters chosen.

2. Method

For realizing the forecast of the analysed time-series we use modern methods, such as ARIMA models, because they are among the models that can analyse large time-series data and forecast future cases

2.1. Models of Box & Jenkins

The pioneers in this area were Box and Jenkins, who popularized an approach that combines the moving average and the autoregressive models (1971). An ARMA (p, q) model is a combination of AR (p) and MA (q) models, and is suitable for univariate time-series modelling. In an AR (p) model the future value of a variable is assumed to be a linear combination of p past observations and a random error, together with a constant term. Mathematically, the AR (p) model can be expressed as [10] :

${Y}_{t}=c+{\displaystyle {\sum}_{i=1}^{p}{\phi}_{1}{y}_{t-i}}+{\epsilon}_{t}=c+{\phi}_{1}{y}_{t-1}+{\phi}_{2}{y}_{t-2}+\cdots +{\phi}_{p}{y}_{t-p}+{\epsilon}_{t}$ (1)

Here ${y}_{t}$ and ${\epsilon}_{t}$ are respectively the actual value and random error (or random shock) at time period t, ${\phi}_{i}\left(i=1,2\cdots p\right)$ are model parameters and c is a constant. The integer constant p is known as the order of the model. Sometimes the constant term is omitted for simplicity. Usually, for estimating parameters of an AR process using the given time series, the Yule-Walker equations are used. Just as an AR (p) model regresses against past values of the series, an MA (q) model uses past errors as the explanatory variables. The MA (q) model is given by [11]

${y}_{t}=\mu +{\displaystyle {\sum}_{j=1}^{q}{\theta}_{j}{\epsilon}_{t-j}}+{\epsilon}_{t}=\mu +{\theta}_{1}{\epsilon}_{t-1}+{\theta}_{2}{\epsilon}_{t-2}+\cdots +{\theta}_{q}{\epsilon}_{t-q}+{\epsilon}_{t}$ (2)

Here $\mu $ is the mean of the series ${\theta}_{j}\left(j=1,2\cdots q\right)$ are the model parameters and q is the order of the model. The random shocks are assumed to be a white noise process, i.e. a sequence of independent and identically distributed (i.i.d) random variables with zero mean and a constant variance ${\sigma}^{2}$ . Generally, the random shocks are assumed to follow the typical normal distribution. Thus, conceptually, a moving average model is a linear regression of the current observation of the time series against the random shocks of one or more prior observations. Fitting an MA model to a time series is more complicated than fitting an AR model because in the case of the former the random error terms are not fore-seeable. Autoregressive (AR) and moving average (MA) models can be effectively combined together to form a general and useful class of time series models, known as the ARMA models. Mathematically an ARMA (p, q) model is represented as

${y}_{t}=c+{\epsilon}_{t}+{\displaystyle {\sum}_{i=1}^{p}{\phi}_{1}{y}_{t-1}}+{\displaystyle {\sum}_{j=1}^{q}{\theta}_{j}{\epsilon}_{t-j}}$ (3)

Usually ARMA models are manipulated using the lag operator notation. The lag or Backshift operator is defined as $L{y}_{t}={y}_{t-1}$ . Polynomials of lag operator or lag polynomials are used to represent ARMA models as follows [11] :

$AR\left(p\right)\text{model}:{\epsilon}_{t}=\phi \left(L\right){y}_{t}$ (4)

$MA\left(q\right)\text{model}:{y}_{t}=\theta \left(L\right){\epsilon}_{t}$ (5)

$ARIMA\left(p,q\right)\text{model}:\phi \left(L\right){y}_{t}=\theta \left(L\right){\epsilon}_{t}$ (6)

Here $\phi \left(L\right)=1-{\displaystyle {\sum}_{i=1}^{p}{\phi}_{1}{L}^{i}}$ and $\theta \left(L\right)=1+{\displaystyle {\sum}_{J=1}^{Q}{\theta}_{J}{L}_{j}}.$

It is shown in that an important property of AR (p) process is invertibility, i.e. an AR (p) process can always be written in terms of an MA (∞) process. Whereas for an MA (q) process to be invertible, all the roots of the equation $\theta \left(L\right)=0$ must lie outside the unit circle. This Condition is known as the Inevitability Condition for an MA process.

2.2. ARIMA Model

In both statistics and econometrics, time series analysis of an autoregressive integrated moving average, an ARIMA model is the integration of an autoregressive moving average (ARMA) model. These models are fitted to time-series data, either to better understanding the data or to forecast future points in the series (forecasting). It is applied, in some cases where the figures show proof that they are not stationary, where an initial differencing step (corresponding to the integrated fraction of the model) can be applied to reduce the non-stationarity [12] , non-periodical ARIMA models which are generally denoted by ARIMA (p,d,q) where parameters p, d, and q are non-negative integers, p is the order of the Autoregressive model, d is the degree of differencing, and q is to arrange of the Moving-average model. ARIMA models form is an important part of the Box- Jenkins approach to time-series modeling [13] . ARIMA models can be seen as a chain of two models. The first is not fixed:

${x}_{t}={\theta}_{0}+{\varphi}_{1}{x}_{t-1}+{\varphi}_{2}{x}_{t-2}+\cdots {\varphi}_{p}{x}_{t-p}+{e}_{t}-{\theta}_{1}{e}_{t-1}-{\theta}_{2}{e}_{t-2}-\cdots \theta {e}_{t}-q$ (7)

where X_{t} and e_{t} are the actual values and random error at time t, respectively,
$\varphi i\left(i=1,2,\cdots ,p\right)$ and
$\theta j\left(\text{}j=1,2,\cdots ,q\right)$ are model parameters. p and q are integers and often referred to as orders of autoregressive and moving average polynomials respectively.

2.3. Steps of ARIMA Methodology [14]

・ Analysis of the series: The first step in the process of modelling is to check for the stationary of the time series data.

・ Identification of the model: This step aimed to detect periodically and to identify the order of seasonal autoregressive terms and seasonal moving average terms. This stage includes calculation of the estimated autocorrelation function (ACF) and estimation of partial autocorrelation function (PACF) these functions measure the statistical dependence between observations of data outputs.

・ Estimation of ARIMA parameters: The estimation of ARIMA parameters is achieved by the nonlinear least squares method. The values of the model coefficients are determined in relation to a particular criterion; one of these may be the maximum likelihood criterion. It can be shown that the likelihood function associated with a correct ARIMA model, used to determine the estimates of maximum likelihood of the parameters, contains all the useful information from the data series about the model's parameters.

・ Diagnostic checking: In this stage it is assumed that the errors represent a stationary process and the residues are white noise (or independent if the distribution is normal), a normal distribution with mean and variance stable. The tests used to validate the model are based on the estimated residues. It is checked that the components of this vector are autocorrelated. If there is autocorrelation, the checked model is not correctly specified. In this case, the dependencies between the components series are specified in an incomplete manner, and we have to return to the model identification step and try another model. Otherwise, the model is good and can be used to make predictions for a given time horizon.

・ Forecasting.

3. Criteria for Selection of the Rank of the Model

3.1. Bayesian Information Criterions (BIC) [8]

$\begin{array}{c}BIC=n\mathrm{ln}{\stackrel{^}{\sigma}}_{a}^{2}-\left(n-M\right)\mathrm{ln}\left(1-M/n\right)+M\mathrm{ln}\left(n\right)\\ \text{}+M\mathrm{ln}\left[\left(\left(\left({\stackrel{^}{\sigma}}_{Y}^{2}\right)/\left({\stackrel{^}{\sigma}}_{a}^{2}\right)1\right)\right)/M\right]\end{array}$ (8)

where:

P: model rank,

n: Views,

M: The number of parameters,

${\stackrel{^}{\sigma}}_{Y}^{2}$ : Estimator series variance,

${\stackrel{^}{\sigma}}_{a}^{2}$ : Estimator error variance,

${\stackrel{^}{\sigma}}_{a}^{2}={{\displaystyle {\sum}_{t=1}^{n}\left({y}_{t}-{\stackrel{^}{y}}_{t}\right)}}^{2}/\left(n-p\right)$ (9)

3.2. Akaike Information Criterion [7]

$AIC\left(M\right)=n\mathrm{ln}{\stackrel{^}{\sigma}}_{a}^{2}+2M$ (10)

or

$AIC\left(p,q\right)=\mathrm{ln}{\stackrel{^}{\sigma}}_{a}^{2}+2\left(p+q\right)/n$ (11)

where:

M: p + q; p, q: model rank; n:views; ${\stackrel{^}{\sigma}}_{a}^{2}$ : Estimator error variance

4. The Application of Data

This study is based on the time-series data provided by the hereditary blood disease centre in Iraq (Maysan) from people diagnosed with Thalassemia for the period 2005-2015.

4.1. Analysis of the Series

The first step in the process of modelling is to check for the stationary of the time series data. This is done by observing the graph of the data or autocorrelation and the partial autocorrelation functions [2] . It notes, through a graph of the time series, that there are high rates of Thalassemia as compared with previous years (Figure 1).

4.2. Stationary

The first stage of ARIMA model building is to identify whether the variable, which is being forecasted, is stationary in the time-series or not. By stationary, with autocovariance functions, we can define the covariance stationarity, or weak stationarity. In the literature, usually stationarity means weak stationarity, unless otherwise specified. The time serie $\left({x}_{t},t\u03f5z\right)$ s.

Where z is the integer set is said to be stationarity if:

$\mathrm{var}\left({x}_{t}\right)<\infty \forall t\in z.$

$E{X}_{t}=\mu \forall t\in Z$

$\gamma x\left(s,t\right)=\gamma x\left(s+h,t+h\right)\forall s,t,h\in Z$

The time plot of the $\left\{{x}_{t}\right\}$ must have three features: finite variation, constant first moment, and that the second moment $\gamma x\left(s,t\right)$ only depends on $\left(t-s\right)$

Figure 1. Graph of original series, increase the number of patients suffering from thalassemia year 2005-2015 in southern Iraq, time series before differencing shows the variability of the series appears to be changing with time. Therefore the mean and variance are not constant, suggesting that the series is not stationary.

and not depends on s or t In light of the last point, we can rewrite the auto covariance function of a stationarty process as

$\gamma x\left(h\right)=\text{Cov}\left({x}_{t},{x}_{t+h}\right)\text{for}t,h\in Z$ (12)

Also, when ${x}_{t}$ is stationarty we must have

$\gamma x\left(h\right)=\gamma x\left(-h\right)$ (13)

When $h=0$ , $\gamma x=\left(0\right)=\mathrm{cov}\left({x}_{t},{x}_{t}\right)$ is the covariance of ${x}_{t}$ so the autocorrelation function for stationarty time series ${x}_{t}$ is defined to be

$px\left(h\right)=\left(\gamma x\left(h\right)\right)/\left(\gamma x\left(0\right)\right)$ (14)

In Figure 1 the thalassemia data, clearly shows that the data is not stationary (actually, it shows an increasing trend in time series). The ARIMA model cannot be built until we make this series stationary. We first have to differentiate the time series “d” times, to obtain a stationary series in order to have an ARIMA (p, d, q) model with “d” as the order of differencing. Care should be taken in differencing, as over differencing will tend to an increase in the standard deviation, rather than a reduction. The best idea is to start with differencing of the lowest order (of first order, d = 1) and test the data for unit root problems 6.

First Difference: $Zt=yt-1\text{where}t=2,3\cdots n$

Second Difference: $Zt=\left(yt-yt-1\right)-\left(yt-1-yt-2\right)\text{where}t=3,4\cdots n$

As a result we obtained a time-series of first order differencing and Figure 2, below, is the line plot of the first order differenced Thalassemia data. It can easily be inferred from the graph that the time series appears to be stationary both in its mean and variance. Moreover, the time series data was subjected to Dickey-Fuller test to check the number of differenced time-series data for stationarity.

Using the adjusted (ADF) test [15] :

Our null hypothesis (H0) in the test is that the time series data is non-statio- nary; while the alternative hypothesis (Ha) is that the series is stationary. The

Figure 2. Time series for after first difference d = 1 become stationary.

hypothesis is then tested by performing appropriate differencing of the data in dth order and applying the ADF tests to the differenced time series data. First order differencing (d = 1) means we generate a table of differenced data of current and immediately previous time $\Delta {x}_{t}={x}_{t}-{x}_{t-1}$ . The ADF test result, as obtained upon application, is shown below (using Eviews program) in Table 1.

We, therefore, fail to accept H0 and, hence, can conclude that the alternative hypothesis is true i.e. the series is stationary in its mean and variance. Thus, there is no need for further differencing of the time series and we adopt d = 1 for our ARIMA (p, d, q) model. This test enables us to go further in steps for ARIMA model development i.e. to find suitable values of p in AR and q in MA in our model. For that, we need to examine the correlogram and partial correlogram of the stationary (first order differenced) time series. The dickey fuller test depends on three simple equations and assumes a random the context of a pattern of downhill autocorrelation of (1) these equations are:

$\Delta {x}_{t}={\alpha}_{1}{x}_{t-1}+{e}_{t}$ (15)

$\Delta {x}_{t}={\alpha}_{0}+{\alpha}_{1}{x}_{t-1}+{e}_{t}$ (16)

$\Delta {x}_{t}={\alpha}_{0}+{\alpha}_{1}{x}_{t-1}+{B}_{t+}{e}_{t}$ (17)

Whereas:

$\Delta $ : The first difference factor, which: $\Delta {x}_{t}={x}_{t}-{x}_{t-1},$

${e}_{t}$ : White noise process.

Table 1. The augmented dickey-fuller test results.

4.3. Correlogram and Partial Correlogram

Figure 3, below, represents the plot of correlogram (auto-correlation function, ACF) from lags 1 to 20 of the first order differenced time-series of the Thalassemia patients (Figure 3). The above correlogram infers that the auto-correlation and partial autocorrelation between lag 1 and 20 does not exceed the significance limits and auto-correlations tail off to zero the autocorrelation at rest.

4.4. Selecting the Rank of Model [16]

Seen from the Table 2 that the best model for this series is ARIMA (0, 1, 1) for having the lowest values for the standards of information. This is perhaps the most commonly used model in forecasting. It is the exponential smoothing model. The general form of the model is:

${Z}_{t}-{Z}_{t-1}=c+{a}_{t}-\theta {a}_{t-1},\text{}\text{\hspace{0.17em}}\left|\theta \right|<1$ (18)

Following the common practice, we shall assume c = 0. Since the model is invertible, the $\pi $ weights are ${\pi}_{i}={\theta}_{i}-1\left(1-\theta \right)$ for $i\ge 1$ . Thus

${Z}_{t}-{Z}_{t-1}=0.05577677+0.98928205t-1$

Figure 3. Plot of the autocorrelation and partial autocorrelation functions of the differenced series.

Table 2. Shows the results of information criterion for various models of the time series.

5. The Forecasting

The results showed that after we apply (eviews, gretl, minitab) specific statistical programs on our data and charts, there is an increase in cases with Thalassemia in the coming years from 2016 to 2018, according to the Figure 4, Figure 5 and Table 3, where the number of patients monthly will be between (7 - 11) patients

Figure 4. Forecasting the expected number of patients with thalassemia years (2016- 2018) using a form ARIMA (0, 1, 1).

Figure 5. Forecasting the expected number of patients with thalassemia years (2016- 2018) using a form ARIMA (1, 1, 0).

Table 3. Year Forecasting for thalassemia patients The number of people forecaster their thalassemia in 2016-2018 For 95% confidence intervals, z (0.025) = 1.96.

Table 4. The values of criterion of final prediction error.

and these numbers are high compared to previous years, where highest forecasting for the patients is during June of 2018. Some months during the period of 2005-2015 did not record any cases while in the forecasting period 2016-2018 every month is accepted to have some patients.

Final Prediction Error (FPE) [9]

Been using the final prediction error (FPE) a good estimate of prediction error for model with n parameters is given by the final prediction error see Table 4:

$FPE={\sigma}_{r}^{2}\left(N,{\beta}^{\wedge}\right)\left(N+n+1\right)/\left(N-N-1\right)$ (19)

${\sigma}_{r}^{2}$ =variance of the residuals.

N is the number of values in the estimation data set.

From Table 4 that the model ARIMA (0, 1, 1) has the lowest value of the criterion of (FPE).

6. Conclusions

1) ARIMA model was suitable for application to Thalassemia data and analysis of other similar medical data.

2) By application of ARIMA model, we were able to forecast future cases easily and accurately.

3) Cases of Thalassemia will increase within coming years, which means that, currently, no serious efforts are offered to solve or treat this disease in Iraq.

7. Recommendations

1) Doing more similar studies using data from other provinces of Iraq to overcome the disease.

2) Doing pre-marriage tests to detect the carriers to limit the future incidence.

3) The bone marrow transplant is the most suitable treatment of the disease but it is so costly for patients that governmental support will be very helpful.

4) Should take care to Sanitaryware industry pain less and more safety during blood transfusions to relieve pain and prevent infection from bacterial and viral diseases.

References

[1] Pearson, H. (2003) Abnormalities of Hemoglobin Structure and Function.

[2] Mok, S., Imwong, M., Mackinnon, M.J., et al. (2011) Artemisinin Resistance in Plasmodium Falciparum Is Associated with an Altered Temporal Pattern of Transcription. BMC Genomics, 12, 391.

https://doi.org/10.1186/1471-2164-12-391

[3] Cao, A. and Galanello, R. (2010) Beta-Thalassemia. Genetics in Medicine, 12, 61-76.

https://doi.org/10.1097/GIM.0b013e3181cd68ed

[4] Lahiry, P., Al-Attar, S.A. and Hegele, R.A. (2008) Understanding Beta-Thalassemia with Focus on the Indian Subcontinent and the Middle East. The Open Hematology Journal, 2, 5-13

https://doi.org/10.2174/1874276900802010005

[5] Popescu, S. (1991) Practica modelarii si predictiei seriilor de timp (Practical Modeling and Prediction of Time Series). Technical Publishing, Bucharest.

[6] Zhang, G.P. (2003) Time Series Forecasting Using a Hybrid ARIMA and Neural Network Model. Neurocomputing, 50, 159-175.

https://doi.org/10.1016/S0925-2312(01)00702-0

[7] Kadilar, C. and Erdemir, C. (2003) Modification of the Akaike Information Criterion to Account for Seasonal Effects. Journal of Statistical Computation and Simulation, 73, 135-143.

https://doi.org/10.1080/00949650215731

[8] Mcquarrie, A.D.R. and Tsai, C. (1998) Regression and Time Series Model Selection. World Scientific Publishing Company, Singapore.

https://doi.org/10.1142/3573

[9] Khim, S. and Mahnendran, S. (2002) The Performance of AICC as an Order Selection Criterion in ARMA Time Series Models. Pertanika Journal of Science & Technology, 10, 25-33.

[10] Hipel, K.W. and McLeod, A.I. (1994) Time Series Modelling of Water Resources and Environmental Systems. Amsterdam, Elsevier.

[11] Cochrane, J.H. (1997) Time Series for Macroeconomics and Finance. Graduate School of Business, University of Chicago, Chicago.

[12] Meyer, Y. (1993) Wavelets, Algorithm & Applications. SIAM, Philadelphia.

[13] Chatfield, C. (1999) The Analysis of Time Series: Theory and Practice. Chapman and Hall, London.

[14] Box, G.E., et al. (2015) Time Series Analysis: Forecasting and Control. John Wiley & Sons.

[15] Dickey, D. and Fuller, W. (1979) Distribution of the Estimators for Autoregressive Time Series With a unit Root. Journal of the American Statistical Association, 74, 427-431.

https://doi.org/10.1080/01621459.1979.10482531

[16] Raicharoen, T., Lursinsap, C. and Sanguanbhoki, P. (2003) Application of Critical Support Vector Machine to Time Series Prediction. Proceedings of the International Symposium on Circuits and Systems, Bangkok, 25-28 May 2003, 741-744.

https://doi.org/10.1109/iscas.2003.1206419