Cancer is one of a variety of diseases with high mortality rate, and because of the increasing of environmental pollution, the incidence of cancer is increasing   . It can be considered as an epidemic since the number of incidences is rising rapidly  . The highest impact on the mortality rate is the stage of the cancer at the point of detection. If cancer can be detected early, the survival rate may be improved. Current clinical methods often detect a tumor only when invasion has already taken place. Hence diagnostic methods that can detect cancer development at a pre-malignant level are needed.
Through literatures investigation, it can be found that many researches about cancer diagnosis are based on body tissues, such as skin, stomach, and breast etc.  . Human serum contains rich information that can provide important clues for cancer diagnosis and treatment. Compared with the tissue and cell, serum sample is simple, fast and easy to be extracted and detected, and it brings small hurt to patients. At present, serum spectrograms are measured by mass spectrum, Raman spectrum, infrared spectrum and fluorescence spectrum etc. Usually cancer is identified by observing and comparing the differences of spectrum and spectral parameters (peak position, peak height, peak area, and peak shape etc.) between healthy serum and cancer serum. However, due to the limitation, complexity and subjectivity of this traditional method, it is hard to obtain objective and accurate results. When we need to mine and systematically research on the information of these high flux maps, the method of chemical information or chemometrics is helpful and necessary      .
Raman spectroscopy was found by C. V. Raman from the molecule of water scattering phenomenon. In recent years, Raman spectroscopy has made significant development in many areas, such as diagnosis of cancer, identification of gem, and agriculture etc. In the diagnosis of cancer, Raman spectroscopy is being applied more and more. Raman spectra are obtained by pointing a laser beam at a sample, which excites molecules in the sample and a scattering effect is observed. Inelastic scattering results in a wave number shift in the reflected Raman spectra, which are functions of the type of molecules structure in the sample. Thus, the Raman spectra hold useful information on the different chemical compounds  . For example, the difference between cancer cells and normal tissues was analyzed by Raman spectroscopy. It was found that the content of nucleic acid in cancer tissue was high, most of the protein was low, and the content of carbohydrate and lipid was also low. Because the Raman scattering of water is very weak, and other biological substances have wealthy Raman information, so Raman spectroscopy is very suitable for the detection of serum containing a large amount of moisture. At present, Raman spectroscopy combined with chemometrics methods to study cancer, mostly on human tissue research   . However, there is little research on serum by Raman spectroscopy combined with chemometrics. This study is based on serum Raman spectroscopy data of lung cancer patients and healthy people. By pattern recognition methods of chemometrics including principal component analysis (PCA), non-negative matrix factorization (NMF), partial least squares-discriminant analysis (PLS-DA), linear correlation analysis (LDA) and uncorrelated linear discriminant analysis (ULDA), lung cancer patients and healthy people were distinguished. The results show that the method of ULDA or LDA combined with multiple scatter correction (MSC) could accurately distinguish the patients of lung cancer and healthy people. This work provides a new way on the identification of lung cancer, which has academic significance and promising clinical application value.
2. Theories and Algorithms
2.1. Multiple Scatter Correction (MSC)
Methods of spectral pretreatment can be adopted to reduce the effects of collineation, dimension, background, noise level and baseline drift of spectrum. Multiple Scatter Correction (MSC)   is a kind of very efficient data-processing method in spectra analysis, which can be used to eliminate the scattering effect and enhance the absorption of spectral information that related to the components.
2.2. Principal Component Analysis (PCA) and Non-Negative Matrix Factorization (NMF)
Principal component analysis (PCA) is one of the most commonly used unsupervised recognition methods, which means no prior knowledge is available about the group to which samples belong. PCA of a data matrix extracts the dominant patterns in the matrix, and represents it as new orthogonal variables (principal components), and display the pattern of similarity of the observations and the pattern of the variables as points in map   . High dimensional data are often transformed into lower dimensional data via PCA method, and value characteristics are retained simultaneously.
Non-negative matrix factorization (NMF) is another unsupervised recognition method, which can divide large matrix into two small matrices, and the decomposition of the matrix does not contain negatives. The non-negativity makes the matrices easier to inspect. Also, non-negativity is meaningful for the data in actual applications  .
2.3. Partial Least Squares-Discriminant Analysis (PLS-DA)
Partial least squares-discriminant analysis (PLS-DA) consists of a classical PLS regression wherein the response variable is a categorical one expressing the class membership    . It uses prior knowledge about the group to which samples belong. Instead of finding hyperplanes of minimum variance between independent variables and the response, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space.
2.4. Linear Correlation Analysis (LDA) and Uncorrelated Linear Discriminant Analysis (ULDA)
Linear correlation analysis (LDA) also named as FLD (Fisher Linear Discriminant), is a classical pattern recognition method that could put the high dimensionality of the data through maximum Fisher criterion to reduce into the low dimensional  . Through the observation of the relationship between the data points and the linear degree, the data are analyzed correctly.
Uncorrelated linear discriminant analysis (ULDA) is a kind of feature extraction and dimension reduction algorithm based on LDA, aiming to maximize separate different kinds of samples, which extracts the uncorrelated discriminant vector (UDV) with largest discriminant ability. At the same time, the vectors obtained are uncorrelated, which makes the information redundancy minimal    . ULDA could get the results that the linear correlation analysis method could not obtain. ULDA    was first put forward by Jin and others in the field of face recognition. Now it has been successfully applied in analysis of metabolomics, proteomics and gene expression profile. The specific steps of ULDA algorithm are referred to literature  .
3. Experiment and Data Processing
3.1. Samples and Experiments
Serum samples were provided by the Chinese people’s liberation army 252 hospital in Hebei province, which were composed of 68 patients with lung cancer (stage I) and 29 healthy people. The age of lung cancer group (65.32 ± 10.13) years old, including 40 males and 29 females. The age of the healthy group was (63.45 ± 8.32) years, including 15 males and 14 females. There was no statistically significant difference between the two groups in age and sex. The day before the blood collection, the relevant person cannot eat greasy food and drink alcohol. Study subjects were admitted to the hospital for the next day to take the morning fasting venous blood 4 ml, and then the blood were centrifuged for 10 minutes by 5000 rpm. Then the serums were placed into the test tube and stored in the refrigerator with −80˚C.
HORIBA Jobin Yvon micro-confocal Raman spectrometer (LabRAM XploRA ONE) was used. The serum samples were moved gradually from −40˚C - −20˚C ~ −4˚C - 24˚C (room temperature) for 30 minutes respectively. Then the thawed serums were taken out 200 μL in a liquid quartz cuvette and the cuvette was put horizontally on the stage. The laser wavelength, transmittance and grating were adjusted by univariate method, and the best experimental conditions were obtained. The laser wavelength was 532 nm, the transmittance was 50%, and the grating was 600 gr/mm. In the case of wavelengths of 100 - 5000 cm−1, each sample was selected for three different angles to repeat the scan, resulting in three spectra, recording the average measured value. Finally, the spectra are smoothed for further calculation.
3.2. Data Processing
The first step is to preprocess the spectral data: compare the different spectral preprocessing methods, including normalization, scaling, concentration and MSC, and finally selecting the MSC to process the original spectral data. The second step is to model the data: lung cancer sample data and healthy human sample data PLS-DA, ULDA and LDA modeling two groups. 34 lung cancer samples and 14 healthy subjects were modeled, and the remaining 34 lung cancer samples and 15 healthy subjects were predicted.
Sensitivity and specificity are two statistical parameters used to measure diagnostic modeling accuracy. Ideally, the two parameters should be close to one.
Sensitivity = true positive/(true positive + false negative) × 100%
Specificity = true negative/(true negative + false positive) × 100%
Analysis of the spectral data using self-compiled program to complete, the calculation software used for the Matlab 7.0.
4. Results and Discussions
Figure 1 shows the average Raman spectrum of lung cancer examples and healthy examples. Each spectrum measured was composed of 1582 variables and their corresponding intensity values. It is seemed that Raman spectra of lung cancer patients and healthy people are in general with the same shape, nevertheless the Raman intensities are different in the two figures. Another main difference of the two figures exists in that relatively strong peaks which indicate beta carotenes appear around the wave numbers of 1200 cm−1 and 1600 cm−1 in the Raman spectra of healthy people, respectively corresponding to the functional groups of C-H, C-C, C = C  . However, Raman spectra of lung cancer patients have relatively weak peaks at the corresponding wave numbers, as maybe due to the absence of beta carotenes whose content in serum is far lower than that of normal  .
4.1. The Results of PCA and NMF Methods
It is investigated that the first two principal components (PC1 and PC2) had explained most variances of the data sets. The score values of lung cancer and healthy group were plotted against the sample index in Figure 2. It can be clearly observed that, the data points of two groups are seriously dispersed and overlapped, and it is hard to separate the two groups. Even though more principal components were used for the data, the classification results were not significantly improved. So the patients and healthy people could not be distinguished by MSC-PCA method.
Figure 1. Raman spectra of lung cancer examples (a) and healthy examples (b).
Figure 2. The classification results by MSC-PCA (a) and MSC-NMF (b).
The classification results of patients of lung cancer and healthy people by MSC-NMF were also showed in Figure 2. The values of non-negative factorization NNF1 and NNF2 were plotted against the sample index. As can be seen in Figure 2, the results are similar with those of PCA, so the patients and healthy people also could not be distinguished by MSC-PCA method.
Above all, when no prior knowledge is available about the group to which samples belong, the two common unsupervised recognition methods are not suitable for extracting diagnostic information on the data in this study. Next, supervised recognition methods were adapted to analyze and classify the data.
4.2. The Results of PLS-DA Method
In this study, the number of nf in the PLS-DA modeling was set at 13 after optimization from 5 - 15. As a supervised recognition method, the lung cancer group was marked as [1, 0] and the healthy group was marked as [0, 1] in advance. The prediction results of MSC-PLSDA are given in Figure 3. It was found that the samples were classified with 52.94% (18/34) sensitivity and 33.33% (5/15) specificity. Though the obtained prediction results were more intuitive than those of PCA and MMF, the lower prediction sensitivity and specificity could not meet the demand of the clinical diagnosis by a great deal.
It is estimated that PLS-DA is essentially a feature transformation way, the new variables are some type of combination of the original variables. The variables with large variance or high covariance may affect the modeling, although those variables contain little or even no message contributing to the discrimination of samples, which may result in the loss of optimal features in some case   . So it seemed that MSC-PLSDA was not a brilliant method on Raman spectroscopy that could help us to distinguish and identify the samples consisted of lung cancer and healthy people.
4.3. The Results of LDA and ULDA Methods
At first, the data was not disposed by MSC method, the classification results by sole LDA and ULDA individually were shown in Figure 4. Due to the number of
Figure 3. The column diagram of MSC-PLSDA classification.
Figure 4. The classification results by LDA (a) and ULDA (b).
categories used in this study was two (malignant and healthy), only one discriminant vector (DV) or uncorrelated discriminant vector (UDV) was obtained, which results in the decreased complexity of the model and the enhanced explain ability of the model. It can be clearly observed from Figure 4, although the cancer and healthy groups cannot completely be distinguished, the values of DV and UDV in the vertical axis had shown obvious clustering. In the case of the dotted lines in the figures as the classification boundaries, the prediction sensitivity of LDA reached 88.24% (30/34) and the prediction specificity of LDA reached 80.00% (12/15). It was also found with the same sensitivity of 88.24% and the specificity of 80.00% in ULDA method.
The basic idea of LDA is to project the high dimensional pattern samples to the best discriminant vector space, and to make the maximum of between-class scatter matrix and the smallest of within-class scatter matrix. That is to say, the projection patterns in the new space are ensured with the minimum between-class distance and the maximum within-class distance, and accordingly the patterns in the space have the best separability. ULDA is developed base on LDA algorithm, and considers no correlation between column vectors in the transformation matrix; therefore, it can reduce the data redundancy after dimension reduction and simultaneously avoid the shortage of sample lacking  . Probably because of above causes, the classification performances of LDA and ULDA far exceed those of PCA, NMF and PLS-DA when they are applied to data of high dimensional pattern samples.
Then the data were further pre-processed with MSC method, and the classification results by MSC-LDA and MSC-ULDA were given in Figure 5 separately. It can be clearly observed from Figure 5 that cancer group and healthy group were all completely distinguished and clustered with 100% sensitivity and 100% specificity. Compared with the results of sole LDA and ULDA method, sensitivity and specificity of distinguishing were significantly improved, which reveals MSC is a very effective preprocessing method for extracting illness characterize information on serum Raman spectra. The results of the above methods used were list in Table 1. Therefore, it can be concluded that MSC-LDA and MSC-ULDA may be more feasible methods which could distinguish serum samples that consisted with lung cancer and healthy people.
Pattern recognition techniques were applied to research Raman spectra of serums. The classification results show that the method of ULDA or LDA combined
Figure 5. The classification results by MSC-LDA (a) and MSC-ULDA (b).
Table 1. The sensitivities and specificities of three methods.
with multiple scatter correction pretreatment method could accurately distinguish the groups of lung cancer patients and healthy people. This study may provide a new way for early identification of lung cancer, which has academic significance and promising clinical application value.
This study was supported by National Natural Science Foundation of China (No. 21305043), Beijing Natural Science Foundation (No. 7142102) and the Fundamental Research Funds for the Central Universities (No. 2014ZD38).
The human serum used in this article was approved by the local medical ethics review committee.