Alzheimer’s disease causes memory loss, and it is not a normal part of aging. It is the only disease that cannot be prevented, treated or even slowed. A recent fact from Alzheimer’s Association report in 2018 shows that only deaths from Alzheimer’s disease have increased significantly while from other major causes of death in the United States have decreased significantly. The bar chart in Figure 1 shows the percentage changes in the top causes of death between 2000 and 2015. As we can see, the number of deaths from heart disease, the number one cause of death in the United States, decreased by 11%; however, recorded death from Alzheimer’s disease increased by 123% .
Figure 1. Percentage of selected causes of death between 2000-2015. Source: 2018 Alzheimer’s Disease Facts and Figures.
In comparison to cancer, 90% of patients become aware of their diagnosis, but only 45% of the people with Alzheimer’s are aware . Thus, researchers and doctors are working to develop a diagnosis pattern of Alzheimer’s disease that helps in early detection of the disease before symptoms increase. Different types of tests include neuropsychological test, blood tests, cerebrospinal fluid analysis, and brain imaging have been used to help understand and diagnosis this severe disease. Neuropsychological tests are an assessment of the brain function to evaluate numbers of areas including attention, problem-solving, memory, language, mood and behavior. Commonly used test tools include the Mini-Mental Status Examination (MMSE) and Dementia Rating Scale (CDR).
Brain imaging is used to detect some brain changes caused by Alzheimer’s disease, that is, detecting the levels of plaques and tangles, the two types of disorders in the brain associated with the presence of Alzheimer’s. Plaques are found between the dying cells in the brain from the buildup of a protein called beta-amyloid and tangles are twisted fibers within the dying cells from the other protein called tau. Beta-Amyloid and tau proteins are normally fragmented that the body produces, but in Alzheimer’s the proteins are abnormal.
Cerebrospinal fluid analysis (CSF) is collecting the clear fluid that protects and surrounds the brain and spinal cord to determine the levels of beta-amyloid, total tau (T-tau) and phosphorylated tau (P-tau) proteins. Since CSF is in direct contact with the brain and spine, collecting a sample of the fluid can be a useful diagnostic tool for this neurodegenerative disease.
The primary goal of the present study is to develop the best statistical model to correctly predict Alzheimer’s patients with their demographic, CSF, laboratory and brain imaging factors using logistic regression model. This model will allow us to accurately evaluate the probability that a patient is diagnosed with Alzheimer’s disease. Moreover, we can rank the significant contributing risk factors based on their relative importance to the response. Hence, medical doctor can use our proposed data-driven model as a decision supportive before starting any treatment.
2. The Data
In the present study, we used data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The primary goal of ADNI is to detect and track the progression of Alzheimer’s disease by combining clinical, imaging, genetic and biological markers of participants to help researchers and doctors develop new treatments. More information about ADNI visits http://adni.loni.usc.edu.
Our data consist of 169 subjects with an age range from 58 - 94 years old. We have information about their demographic characteristics, neuropsychological test, laboratory data, cerebrospinal fluid analysis, and brain imaging data. Figure 2 below gives an extended detail of our data.
In the cerebrospinal fluid analysis, we have a concentration of P-tau and amyloid beta levels in picograms per milliliter (pg/ml) from the cerebrospinal fluid. The laboratory data consist of the levels of vitamin B12 in nanograms per milliliter (ng/mL), thyroid stimulating hormone in milliunits per liter (mU/L), Hemoglobin in grams per deciliter (g/dL) and cholesterol in milligram per deciliter (mg/dL) as they have been linked to Alzheimer’s disease.
Figure 2. Schematic diagram of the data.
MRI scan includes measures about total brain volume, whole brain gray matter volume, whole brain white matter volume, and intracranial volume.
Our response in this Analysis is the status of the participants as cognitively normal individuals (CN) or Alzheimer’s disease (AD) based on SPARE-AD score (Spatial Pattern of Abnormalities for Recognition of Early AD). SPARE-AD is an imaging analysis of the spatial patterns of brain atrophy to distinguish individuals with AD from CN. Positive diagnostics values indicate the presence of Alzheimer’s disease and negative values indicate a normal pattern of brain structure   .
Comparison of the Probability of Male and Female Diagnosed with Alzheimer’s Disease
Several studies have mentioned that women are more likely than men, to be identified with Alzheimer’s disease . We proceed to investigate this issue by addressing the following question:
● Are male and female equality diagnosed with Alzheimer’s disease?
To answer this question, we used the hypothesis test to determine whether the difference between the two proportions is significant. That is, to test the hypothesis
that vs. , where is the proportion of male with AD and is the proportion of female with
AD. A p-value = 0.7951 indicate that at 5% level of significance, there is no statistically significant difference between the percentage of males and females diagnosed with Alzheimer’s disease.
3. Statistical Method
For our analysis, we used multiple logistic regression to predict the status of the patients as CN or AD. The logistic regression is a method used to describe and explain the relationship between binary response and the statistically significant risk factors. It can answer questions like: do age, body weight, vitamin B12, cholesterol level, tau, and beta-amyloid proteins influence on the probability of having Alzheimer’s disease?
Mathematically, let Y be the binary response and its possible outcome by 1 (“AD”) and 0 (“CN”). The distribution of Y is specified by probability of AD and of CN, where is the mean of Y. Let denote the probability of selecting AD patient given the risk factors x. The logistic regression model has a linear form for the logit of this probability defined as .
where is the coefficient of the jth risk factor , is the ith observed value of the risk factor j and is the odds which
expresses the ratio between the probability of predicting AD patient to the probability of CN.
The logistic regression model implies the analytic for the probability of selecting AD patient given by the risk factors as:
4. Implementation of the Multiple Logistic Model
We partition our data set into two parts training and testing with 75% and 25% of the data, respectively. We started with the full logistic regression model that includes all predictors and their possible interactions. Our logistic model with all independent variables and their possible interactions to predict whether the patient has Alzheimer’s disease is given by:
where P denote the probability of selecting AD patient, βj’s denote the coefficients and X’s are the risk factors and possible interactions. Using backward elimination algorithm to remove the term in the complex model that has the largest P_value and stop when any further elimination leads to poor fit. In addition to the minimum AIC (Akaike information criterion) that judges the quality of the model by how close the fitted values to the true expected values, that means, selecting the best statistical predictive model that minimize,
where L is the value of the likelihood and k is the number of parameters in the model. Thus, our optimal data-driven statistical logistic model that predicts the patient’s condition with minimum AIC is given by:
The symbol ( ) means interaction and as we can see from our proposed model, six risk factors and only two interaction terms are statistically significant contributing to the prediction of the patient’s condition, namely, phosphorylated tau protein (P-tau), beta-amyloid protein, thyroid stimulating hormone, vitamin B12, cholesterol, hemoglobin, and the interaction between (cholesterol hemoglobin) and (thyroid stimulating hormone hemoglobin). Furthermore, as we can see, age is not one of the significant risk factors in our optimal predictive model, and this holds that Alzheimer’s disease is not part of normal aging.
The coefficients in the logistic regression indicate the change in the expected log odds relative to the one-unit change in (Xj) holding all other predictors are constant  . Thus, the interpretation of the coefficient (0.170) of P-tau protein means as the P-tau protein level increases, the odds of the participant diagnosed with AD will increases while holding all other variables constant. Alternatively, we can use the odds ratio , and that means with all other predictors unchanged, every unit increase in the P-tau protein increase the odds of being Alzheimer’s patient by a factor of 1.85.
Similarly, the interpretation of the coefficient (−0.003) of beta-amyloid protein means that as the beta-amyloid protein level decrease, the odds of the participant diagnosed with AD will increase while holding all other variables constant. Alternatively, by using the odds ratio , with all other predictors unchanged, every unit decrease in the beta-amyloid protein increases the odds of being Alzheimer’s patient by a factor of 0.997.
To evaluate our optimal predictive model, we used classification accuracy, sensitivity, specificity values and area under the curve (AUC) for testing data. The proportions of correctly identified AD and CN participants from the multiple logistic model is called “accuracy”. The proportions of actual Alzheimer’s patients who are correctly identified from our predictive model as having the disease is known as “sensitivity” and the proportions of actual cognitively normal individuals who are correctly identified from the model is known as “specificity”. A perfect predictive model would be described as 100% sensitive (that is predicting all sick people from Alzheimer’s disease group as Alzheimer’s) and 100% specific (that is predicting all normal individual as cognitively normal). For any test, however, there is usually a trade-off between these two measures and can be explored graphically by the receiver operating characteristic curve (ROC).
We used the confusion matrix of the testing data to get the values needed to assess the model. The confusion matrix is a classification table describe how well our multiple logistic regression model does in predicting Alzheimer’s patients from cognitively normal individuals. Table 1 shows an illustration of a confusion matrix that we used to evaluate our proposed model on the test data. The four outcomes that formulated the table are true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP is the number of Alzheimer’s patients correctly identified as sick, and TN is the number of normal individuals correctly classified as healthy. FP is the number of healthy people incorrectly
Table 1. The confusion matrix.
identified as sick, and FN is the number of Alzheimer’s cases predicted incorrectly by our model as a healthy individual.
Using the confusion matrix, we found out that our model accuracy is and it correctly predicts 78.26% of all Alzheimer’s disease cases (the sensitivity = ). Also, it correctly identifies 83.33% of those who don’t have Alzheimer’s disease (the specificity = ). A summary of our classification results is given in Table 2 below.
Another method to evaluate our model graphically is the receiver operating characteristic (ROC). Each point on the ROC curve represents a (sensitivity, 1-specificity) pair corresponding to a different decision cut-off point. The area under the ROC curve (AUC) is a measure of how well the model can distinguish between two diagnostic groups. For our proposed model, the AUC value is 87.68% which implies that our model does well in discriminating between the two classes of the patient’s condition. Figure 3 represents the receiver operating characteristic curve with the corresponding AUC value. After a careful investigation of our results, we can conclude that our predictive model provides a good prediction of the patient’s condition.
Table 2. Classification summary of the multiple logistic regression model.
Figure 3. The receiver operating characteristic curve.
After validating our proposed model, we need to rank the risk factors in terms of their importance to Alzheimer’s diagnostic. We identified the relative importance of the risk factors by the absolute value of their standardized coefficients (weights) and pseudo partial correlation. In the standardized coefficients, the higher the absolute value points to the greater strength of association with Alzheimer’s diagnostic  . The standardized weight is defined as:
where is the estimated coefficient (weight) for predictor i, is the sample standard deviation for predictor i, and .
The pseudo partial correlation is given by:
where is the Wald chi-square statistic for predictor i, K is the degrees of freedom of predictor i, and is the log-likelihood of the model with only intercept term. The closer the value to 1 or −1, the stronger the association between a predictor and the outcome .
Thus, the relative importance of the significantly contributing risk factors in our predictive model is presented in Table 3. As can be seen, the result of the two methods is consistent, and we found out that P-tau protein is the most critical factor in diagnosing with Alzheimer’s disease followed by beta-amyloid. These two proteins have been extensively studied by the author . Also, the interaction between (thyroid hemoglobin) is ranked as number three significant predictor before the level of thyroid hormone alone and hemoglobin alone which they ranked as number 4th and number 8th significant risk factors, respectively.
Table 3. Relative importance of the risk factors.
The importance of knowing the causes of the disease helps find the best way to cure it. While several top causes of death are decreasing, Alzheimer’s deaths are on the rise. Thus, in the present study, we developed a statistical predictive model using multiple logistic regression to predict Alzheimer’s disease patients by selecting the relevant risk factors using backward elimination. We found that six risk factors and only two interaction terms namely, phosphorylated tau protein (P-tau), beta-amyloid protein, thyroid stimulating hormone, vitamin B12, cholesterol, and the interaction between (cholesterol hemoglobin) and (thyroid stimulating hormone hemoglobin) were significantly contributing to Alzheimer’s disease.
We evaluated the quality of the proposed model by classification accuracy, sensitivity, specificity values and area under the curve, the result of which attested to the effectiveness of the model. Then, we examine the relationship between the response and the significant contributing predictors and rank them based on their standardized coefficients. By defining and ranking the statistically significant risk factors, they will be useful as a screening tool to discriminate Alzheimer’s disease patients from cognitively normal individuals.
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.
 Alzheimer’s Association (2018) 2018 Alzheimer’s Disease Facts and Figures. Includes a Special Report on the Financial and Personal Benefits of Early Diagnosis. Alzheimer’s & Dementia, 14, 367-429.
 Davatzikos, C., Xu, F., An, Y., Fan, Y., and Resnick, S.M. (2009) Longitudinal Progression of Alzheimer’s-Like Patterns of Atrophy in Normal Older Adults: The Spare-AD Index. Brain, 132, 2026-2035.
 Davatzikos, C., Bhatt, P., Shaw, L.M., Batmanghelich, K.N. and Trojanowski, J.Q. (2011) Prediction of MCI to AD Conversion, via MRI, CSF Biomarkers, and Pattern Classification. Neurobiology of Aging, 32, 2322.e19-2322.e27.
 Chapman, R.M., et al. (2011) Women Have Farther to Fall: Gender Differences between Normal Elderly and Alzheimer’s Disease in Verbal Memory Engender Better Detection of Alzheimer’s Disease in Women. Journal of the International Neuropsychological Society, 17, 654-662.
 Thompson, D., Wi, M. and Health, A. (2009) LR, Ranking Predictors in Logistic Regression. Paper D10-2009, Assurant Health, West Michigan, 1-13.
 Bhatti, I.P., Lohano, H.D., Pirzado, Z.A. and Jafri, IA. (2006) A Logistic Regression Analysis of the Ischemic Heart Disease Risk. Journal of Applied Sciences, 6, 785-788.