Forecasting Alzheimer’s Disease Using Combination Model Based on Machine Learning

Show more

1. Introduction

Alzheimer’s Disease (AD) is a rapidly growing neurological issue all over the world. As a common form of dementia, it has neurological and behavioral repercussions. As the sixth leading cause of death, every 71 seconds, there is one person fall ill in AD. Terribly, the rate doubles roughly after age 65 [1] .

Many machine learning methods have been developed to model, analyze and predict the relevant problems. For examples, Chung-Chou H. Chang proved smoking is an underlying triggering factor of AD [2] . C Reitz also indicated that smoking increases the risk of dementia by exploring the impact of the APOεE4 allele, sex and age on the association between smoking status and dementia [3] . As a consequence, doing research based on life style to forecast AD has great significance nowadays.

To determine a more proper result, we use three machine methods to build prediction model. This paper shows three methods for the prediction of Alzheimer’s disease―BP neural network, SVM and random forest. In the previous research, many papers show the methods of predicting diseases by machine learning. For example, in 2011, I.A. Illan introduced a computer aided diagnosis (CAD) system for the early detection of AD that combines mask-based feature reduction techniques and a pasting-votes method for aggregation of SVM classifiers ensemble applied to the relevant image components [4] , which is defined as a hybrid scheme. Besides, as for the method of random forest, in 2017, Ashwani Kumar introduced models of classification for different AD genes were generated according to Mini-Mental State Examination scores and all other vital parameters to achieve the identification of the expression level of different proteins of disorder that may possibly determine the involvement of genes in various AD pathogenesis pathways [5] , which is a new decision tree to solve the puzzle of AD pathogenesis through standard diagnosis scoring system. Like those, by machine learning, we can determine the inner relationship between life style and Alzheimer’s disease. Moreover, by comparing the three models and considering the models synthetically, we can determine a new prediction model, which can maximum the advantages as well as having pretty accurate in predicting AD. All the work above is supported by Wuhan University of Technology.

2. Alzheimer’s Disease Forecasting Model

2.1. Data Pretreatment

At first, we select elderly people aged more than 65 in Wuhan, China for example, and mainly surveyed via internet. Moreover, we visit hospitals and nursing homes to get enough papery questionnaires. To make the investigation believable, we eliminated the samples with liver, kidney and cardiopulmonary disease. Then we obtained 1157 questionnaires, including 1038 effective questionnaires in them. Next we code some of the factors as shown on Table 1.

2.2. Forecasting Based on Back Propagation Neural Network

Owing to the gradient descent optimization algorithm, back propagation (BP) neural networks can take only from the previous layer and send outputs only to the next layer. Moreover, it is characterized as self-learning and self-adaptability, as well as the significant improved performance in computation-intensive field. Therefore, it is commonly used to adjust the weight of neurons by calculating the gradient of the loss function in the context learning, which is widely used in such aspects as, classification, evaluation, identification and forecasting. As for disease forecasting, BP neural networks can constitute the input layers (symptoms), hidden layers and output (falling ill or not) layers, making it more objective to define the index weight. That is, the different impacts of different factors

Table 1. Assignment of the factors.

can be find out objectively. Above all, BP neural networks is a simulation of human nervous system, decreasing the subjectivity by finding the inner relations according to the selected training set, and it can be properly applied in the disease forecasting.

By training the BP neural networks, we can forecast the disease according the life style of the elderly people. In 1989, Hornik, Stinchcombe and White proved that a nonlinear neural network combining with three layers can approach any continuous function with any precision as long as it processes enough hidden nodes [6] . Now we assume that there are n neurons in the input layer, m neurons in the hidden layer, and one neuron in the output layer. As for this problem, the twenty factors can be treated as the input layers. Furthermore, we can determine the well-trained BP neural networks by the steps as follows [7] .

・ Hidden layer stage: The output of all neurons in the hidden layer can be described as:

$ne{t}_{j}={\displaystyle \underset{i=0}{\overset{n}{\sum}}{v}_{ij}{x}_{i}},\text{}j=1,2,\cdots ,m$ (1)

${y}_{j}={f}_{H}\left(ne{t}_{j}\right),\text{}j=1,2,\cdots ,m$ (2)

Here $ne{t}_{j}$ is the activation value of the jth node, ${y}_{j}$ is the output of the hidden layer, and ${f}_{H}$ is called the activation function of a node, usually a sigmoid function as follow:

${f}_{H}\left(x\right)=\frac{1}{1+{\text{e}}^{-x}}$ (3)

・ Output stage: The outputs of all neurons in the output layer are given as follows:

$O={f}_{O}\left({\displaystyle \underset{j=0}{\overset{m}{\sum}}{\omega}_{jk}{y}_{j}}\right)$ (4)

where ${f}_{O}$ is the activation function, usually a line function. All weights are assigned with random values initially, and they are modified by the delta rule according to the learning samples traditionally.

Then we set the output vector Y as follows:

$Y\in \{\begin{array}{l}\left(0.5,1\right]\text{whentheintervieweeisforecastedasADpatients}\\ \left[0,0.5\right)\text{else}\end{array}$ (5)

Now we choose 730 specimens from all the 1038 interviewees randomly as the training set, select 154 specimens as the validation set, and let the 154 specimens remaining as the testing set. From the training set, we can adjust the weight and bias of each neuron to decrease the error until getting the satisfied minimum error, while the validation set can help us minimize the degree of over fitting. In the testing set, we can examine the accuracy of the BP neural network we build.

As for the number of neurons in hidden layer, we determine k, the number of neurons, by cutting and trying. After doing numbers of experiments, by using the empirical formula $k={\mathrm{log}}_{2}N$ , we determine the primary value of k as 8. By the method of cut-and trial, we finally set the number of neurons in hidden layer as 65.

Before constructing the neural network, we normalize the data by using the max-min algorithm, so that we can eliminate the influence of dimension. And the algorithm can be described as follows:

${{X}^{\prime}}_{ij}=\frac{{X}_{ij}-{X}_{i\mathrm{min}}}{{X}_{i\mathrm{max}}-{X}_{i\mathrm{min}}}$ (6)

Consequently, the diagram of the BP neural networks structure is as shown in Figure 1.

At last, to judge whether an old man is going to fall in ill with the AD, we set a standard. If the output vector Y is more than 0.5, the old man will be an AD patient. Otherwise, he is not an AD patient. To solve this problem, we use the neural network toolbox in MATLAB to find the training regression, and the result is shown in Figure 2.

Figure 1. BP neural networks model structure.

From Figure 2, the linear transfer function of neurons is

$Y=0.86T+0.066$ (7)

And the correlation coefficient R between the predictive value and the real value is 0.96132, indicating that the prediction ability of the neural network is pretty well.

Then we check the neural network training performance, the mean squared error is shown in Figure 3.

As the increment of epoch during the training, the number of updates in the weights of neural network is becoming increasingly growing, and the fitted curve turns more and more overfitting. From Figure 3, we can determine that the best validation performance is 0.02968 at epoch 3.

In this way, we can determine the accuracy as shown on Table 2.

From Table 2, the accuracy is satisfying, so the neural network is well trained. By entering data of the elderly’s early symbols, we can predict whether they will fall ill with the Alzheimer’s Disease.

Figure 2. Neural network training regression.

Table 2. The results of BP neural networks.

Figure 3. Neural network training performance.

2.3. Forecasting Based on Support Vector Machine

As a supervised learning model, support vector machine (SVM) can assign new examples to the two categories by a clear gap that is as wide as possible. As for this problem, to avoid over fitting, we structure the boundary conditions to relax the constraint conditions. By introducing the slack variable ${\xi}_{i}$ , the constraint conditions can be described as follows [8] :

$\underset{w}{\mathrm{min}}\frac{1}{2}{\Vert w\Vert}^{2}+C{\displaystyle \underset{i=1}{\overset{l}{\sum}}{\xi}_{i}}$ (8)

$s.t.\text{}\{\begin{array}{l}{y}_{i}\left(w\cdot {x}_{i}+b\right)\ge 1-{\xi}_{i}\\ {\xi}_{i}\ge 0,\text{}i=1,2,\cdots ,l\end{array}$ (9)

where $\xi ={\left({\xi}_{1},\cdots ,{\xi}_{l}\right)}^{\text{T}}$ , and the parameter C refers to the penalty coefficient of the wrongly classified cases.

2.3.1. The Best Number of Training Set Samples

In the previous section, we use the convex hull theory to determine that the data is linear inseparable. Therefore, by setting the penalty factor parameter C as 1, we use radial basis function (RBF) to find the best number of the training samples. We optimize the number of training set via the grid-search method. And the accuracy is finally leveling out at around 87%. However, the tendency of the accuracy is rising up at the beginning and declining in late, it reaches its peak 6 when the number of the samples is 730, which is the optimum number of the training sample.

2.3.2. The Best Kernel Function

By using the kernel trick, SVMs can perform a non-linear classification efficiently, which can map the input factors into high-dimensional feature spaces. In the previous section, we determine the number of the training set is 730. Now we choose the appropriate kernel function by testing the accuracy of training set and testing set via four different common kernel function: linear kernel, polynomial kernel and radial basis function. By doing 100 times experiments, we can derive the diagrams are as shown in Figures 4-6.

And the results are as Table 3.

From Table 3, RBF is the best kernel function. So we choose RBF as the kernel function, that is:

$K=\mathrm{exp}\left(-\gamma {\Vert x-{x}_{i}\Vert}^{2}\right)$ (10)

Figure 4. Error rate of linear kernel.

Figure 5. Error rate of cubic polynomial.

Figure 6. Error rate of RBF.

Table 3. Accuracy comparison table of four different kernel function.

As for the function, the parameter γ in the RBF represents the bandwidth of the kernel function, which has great influence on the classification results. To make the problem easier, we define the value γ as 0.5 from the perspective of experience.

2.3.3. Results of SVM

From the previous section, we defined all the parameters that may be used in SVM, so that we can derive the accuracy of the prediction models applied in AD. And the prediction effect is as shown on Table 4.

From Table 4, we can derive that the accuracy of SVM is similar to that of BP neural network. So that we propose another method―random forest

2.4. Forecasting Based on Random Forest

Before introducing random forest model, let us discuss decision tree model at first. A decision tree, representing by the set of nodes, branches and leaves, can generate rules for classification. As for this problem, the endpoint nodes refer to whether an old man is an AD patient, and the other nodes refer to the living habits. To make the problem more impressive, we use random forest classification (RFC) to illustrate it.

RFC is a classification model contributed by several decision trees, and every decision tree has one classification result. Firstly, we select k samples randomly

Table 4. The Results of SVM.

from the primary training set using bootstrap. Then we build k decision trees based on the k samples, so that we can derive the classification results. At last, we take votes to the classification results. The flow diagram of the algorithm is shown in Figure 7.

From Figure 7, random forest can improve the forecasting ability by constructing different training set and enlarging the difference between the classification models. After k epochs, we can get classification model series $\left\{{h}_{1}\left(X\right),{h}_{2}\left(X\right),\cdots ,{h}_{k}\left(X\right)\right\}$ , by which we can construct a multi-classification model, and the criterion function is

$H\left(x\right)=\mathrm{arg}\underset{Y}{\mathrm{max}}{\displaystyle \underset{i=1}{\overset{k}{\sum}}I\left({h}_{i}\left(x\right)=Y\right)}$ (11)

where $H\left(x\right)$ represents the multi-classification model, ${h}_{i}$ refers to the decision tree model, Y refers to the target vector, $I(\cdot )$ is the indicative function. From the formula above, we can confirm the classification result by voting.

As for this problem, we select 692 samples from 1038 samples, then we adjust the parameter mtry―the numbers of the node, as well as ntree―the number of the trees in the forest. Now we do the control experiment to search the best value of the two parameters. Firstly, we adjust mtry by fixing ntree, the accuracy is shown in Figure 8 after 100 times experiments.

From Figure 8, the error decreases when mtry turns from 1 to 2. Moreover, the error turns stable after 4, so we define $mtry=4$ .

Then we fix $mtry=1$ , and adjust the value of ntree using the method above, the result is shown in Figure 9.

From Figure 9, the error decreases when ntree turns from 10 to 30, but increases mildly later. Finally, we get the best result at $ntree=80$ .

In a word, we set the number of training set as 692, and we have $ntree=80$ , $mtry=4$ . The accuracy based on these parameters is 99.3%, then we plot the generalization error rate as in Figure 10.

From Figure 10, the error rate waves around 0.01, and the average error is 99.1%, proving the high classification accuracy and stability of the random forest model. Also, we can determine the importance of each factor based on contribution rate in decision tree classification, in which education has great effect on Alzheimer’s Disease. Moreover, the factors such as exercise, chatting, smoking and eating habits also have pretty affection on AD.

Figure 7. Algorithm flow diagram.

Figure 8. Accuracy caused by mtry.

3. Comparison between Three Models & Building Combination Forecasting Model

In the previous section, we predict Alzheimer’s Disease by three methods: BP neural networks, SVM and random forest classification. Now we compare the three models and build a combination forecasting model based on the three machine learning models.

3.1. Data Pretreatment of the Three Models

The parameters needed by the three models are different. For BP neural network,

Figure 9. Accuracy Caused by ntree.

Figure 10. Generalization error rate.

the data should be normalized so that we can eliminate the influence of dimension. Similarly, data normalizing in SVM rely on the parameters of it. Particularly, random forest model does not have to do any data pretreatment. Thus, from the aspect of data pretreatment, random forest is the simplest method.

3.2. Parameters of the Three Models

By analyzing the parameters of the models, we can determine the sophistication of them, so that we can judge whether a model is easy to carry out.

From Table 5, SVM needs to define four parameters, indicating it needs more

Table 5. Parameters of the three models.

extra work. BP neural network and random forest only need two parameters, so they are more easy to carry out.

3.3. Combination Forecasting Model Based on the Three Models

By comparing the three models, we can determine a final result. To improve classification accuracy, we consider the forecasting results synthetically to build a new model as follows.

At first, we define the probability formula of the three classification boxes. The probability formula of BP neural network is

$P=\left[\begin{array}{cccc}{\text{e}}^{-\frac{{E}_{11}}{2{\sigma}^{2}}}& {\text{e}}^{-\frac{{E}_{12}}{2{\sigma}^{2}}}& \cdots & {\text{e}}^{-\frac{{E}_{1m}}{2{\sigma}^{2}}}\\ {\text{e}}^{-\frac{{E}_{21}}{2{\sigma}^{2}}}& {\text{e}}^{-\frac{{E}_{22}}{2{\sigma}^{2}}}& \cdots & {\text{e}}^{-\frac{{E}_{2m}}{2{\sigma}^{2}}}\\ \vdots & \vdots & \ddots & \vdots \\ {\text{e}}^{-\frac{{E}_{p1}}{2{\sigma}^{2}}}& {\text{e}}^{-\frac{{E}_{p2}}{2{\sigma}^{2}}}& \cdots & {\text{e}}^{-\frac{{E}_{pm}}{2{\sigma}^{2}}}\end{array}\right]=\left[\begin{array}{cccc}{P}_{11}& {P}_{12}& \cdots & {P}_{1m}\\ {P}_{21}& {P}_{22}& \cdots & {P}_{2m}\\ \vdots & \vdots & \ddots & \vdots \\ {P}_{p1}& {P}_{p2}& \cdots & {P}_{pm}\end{array}\right]$ (12)

where ${E}_{ij}$ refers to the Euclidean distance between the ith sample and the jth sample. And P refers to the initial probability matrix behind activation of neurons in the model layer Gaussian function.

$S=\left[\begin{array}{cccc}{\displaystyle \underset{l=1}{\overset{k}{\sum}}{P}_{1l}}& {\displaystyle \underset{l=k+1}{\overset{2k}{\sum}}{P}_{1l}}& \cdots & {\displaystyle \underset{l=m-k+1}{\overset{m}{\sum}}{P}_{1l}}\\ {\displaystyle \underset{l=1}{\overset{k}{\sum}}{P}_{2l}}& {\displaystyle \underset{l=k+1}{\overset{2k}{\sum}}{P}_{2l}}& \cdots & {\displaystyle \underset{l=m-k+1}{\overset{m}{\sum}}{P}_{2l}}\\ \vdots & \vdots & \ddots & \vdots \\ {\displaystyle \underset{l=1}{\overset{k}{\sum}}{P}_{pl}}& {\displaystyle \underset{l=k+1}{\overset{2k}{\sum}}{P}_{pl}}& \cdots & {\displaystyle \underset{l=m-k+1}{\overset{m}{\sum}}{P}_{pl}}\end{array}\right]=\left[\begin{array}{cccc}{S}_{11}& {S}_{12}& \cdots & {S}_{1c}\\ {S}_{21}& {S}_{22}& \cdots & {S}_{2c}\\ \vdots & \vdots & \ddots & \vdots \\ {S}_{p1}& {S}_{p2}& \cdots & {S}_{pc}\end{array}\right]$ (13)

where S refers to the samples reached by the neural network summation layer belong to the initial probability of each class.

And we have

${p}_{ij}=\frac{{S}_{ij}}{{\displaystyle \underset{l=1}{\overset{c}{\sum}}{S}_{il}}}$ (14)

Similarly, probability formula of SVM is

$Pr\left(y=1|x\right)\approx {P}_{A,B}\left(f\left(x\right)\right)\equiv \frac{1}{1+\mathrm{exp}\left(Af\left(x\right)+B\right)}$ (15)

As for random forest, the formula is

${P}_{i0}=\frac{{n}_{i0}}{ntree}$ (16)

where ${n}_{i0}$ refers to the number of judging the samples as 0. That is,

$\begin{array}{l}{P}_{i0}>{P}_{i1}\Rightarrow {P}_{i}=0\\ {P}_{i1}>{P}_{i0}\Rightarrow {P}_{i}=1\end{array}$ (17)

And

$\{\begin{array}{c}{P}_{i0}={\displaystyle \underset{j=1}{\overset{3}{\sum}}{p}_{ij0}}\\ {P}_{i\text{1}}={\displaystyle \underset{j=1}{\overset{3}{\sum}}{p}_{ij\text{1}}}\end{array}\text{}i=1,2,\cdots ,n$ (18)

where i refers to the number of samples, and j is the code of the three models (if $j=1$ , the model is BP neural network, else if $j=2$ , the model is SVM, else if $j=3$ , the model is random forest).

Moreover, ${P}_{i0}$ is the total probability of the ith sample to fall ill, ${P}_{i1}$ is the total probability of the ith sample to be healthy, and ${P}_{ij0}$ is the total probability of the ith sample to fall ill under the jth algorithm.

By following the steps above, we use the advanced algorithm to find the accuracy of it, and the diagram is shown in Figure 11.

From Figure 11, the error rate based on the combination model is at around 0.7%, which is far lower than that of the other three machine learning methods solely.

Figure 11. Error rate of combination model.

Table 6. Generalization errors of the models.

3.4. Accuracy of the Four Methods

In the previous section, we derived the error rate of every method. Then we compare the accuracy between the three machine learning models and the combination forecasting model as follows. By comparing the generalization errorbetween the models, we can evaluate whether a model is accurate in forecasting AD. And the generalization errors are as shown on Table 6.

From Table 6, the combination forecasting model improves the accuracy significantly.

4. Conclusions and Suggestions

From the previous section, the three methods are all well applied in predicting Alzheimer’s Disease. By adjusting the parameters, the classification results of the three methods are all improved. Among them, the random forest model is the best method, followed by SVM, and BP neural network has a more terrible effect compared with the other two models. Moreover, by building a new combination forecasting model based on the three machine learning, the error rate decreases accurately. However, we mainly discuss the accuracy in this paper without considering the complexity of the algorithm. In the future, we are going to research on this.

As for the life style, we suggest elderly people do more exercises to avoid AD. Moreover, taking group activities is also a pretty good idea. Above all, AD is an intractable problem and solving it is pretty difficult, trying our best to avoid it is all we should do.

Acknowledgements

This paper is financially supported by National Students Innovation and Entrepreneurship training Program, Wuhan University of Technology, China (No. 20171049714004).

References

[1] Alzheimer’s Association (AA) (2012) Alzheimer’s Disease Facts and Figures. Alzheimer’s & Dementia, 8, 131-168.

https://doi.org/10.1016/j.jalz.2012.02.001

[2] Chang, C.-C., et al. (2012) Smoking, Death, and Alzheimer Disease: A Case of Competing Risks. Alzheimer Disease and Associated Disorders, 26, 300-306.

https://doi.org/10.1097/WAD.0b013e3182420b6e

[3] Reitz, C., et al. (2007) Relation between Smoking and Risk of Dementia and Alzheimer Disease: The Rotterdam Study. Neurology, 69, 998-1005.

https://doi.org/10.1212/01.wnl.0000271395.29695.9a

[4] Kumar, A. and Singh, T.R. (2017) A New Decision Tree to Solve the Puzzle of Alzheimer’s Disease Pathogenesis through Standard Diagnosis Scoring System. Interdisciplinary Sciences: Computational Life Sciences, 9, 107-115.

https://doi.org/10.1007/s12539-016-0144-0

[5] Illan, I.A., Gorriz, J.M., et al. (2011) Computer Aided Diagnosis of Alzheimer’s Disease Using Component Based SVM. Applied Soft Computing, 11, 2376-2382.

https://doi.org/10.1016/j.asoc.2010.08.019

[6] Hornik, K., Stinchcombe, M. and White, H. (1989) Multilayer Feedforward Networks Are Universal Approximators. Neural Networks, 2, 359-366.

[7] Zhang, Y.D. (2009) Stock Market Prediction of S&P 500 via Combination of Improved BCO Approach and BP Neural Network. Expert Systems with Applications, 36, 8849-8854.

https://doi.org/10.1016/j.eswa.2008.11.028

[8] Peker, M. (2016) A Decision Support System to Improve Medical Diagnosis Using a Combination of k-Medoids Clustering Based Attribute Weighting and SVM. Journal of Medical Systems, 40, 116-132.