Microarray technology provides a high-throughput way to simultaneously investigate gene expression information in a whole genome level. In the field of cancer research, the genome-wide expression profiling of tumors has become an important tool to identify gene sets and signatures that can be used as clinical endpoints, such as survival and therapy response  . When we are contrasting expressions between different groups or conditions (i.e., the response ispolytumous), such important genes are described as differentially expressed (DE)  . To identify important genes, a statistical scheme is required that measures and captures evidence for a DE per gene. If the response consists of binary data, the DE is measured using a two-group comparison for which such statistical methods as t-statistics, the statistical analysis of microarray (SAM)  ), fold change, and B statistics have been proposed  . The p-value of the statistics is calculated to assess the significance of the DE genes. The p-value per gene is ranked in ascending order; however, selecting significant genes must be considered by multiple testing corrections, e.g., false discovery rate (FDR)  , to avoid type I errors. Even if significant DE genes are identified by the FDR procedure, the gene list may still include too many to apply a statistical test for a substantial number of probes through whole genomic locations. Such a long list of significant DE genes complicates capturing gene signatures that should provide the availability of robust clinical and pathological prognostic and predictive factors to guide patient decision-making and the selection of treatment options.
As one approach for this challenge, based on the residuals from the Autoregressive Conditional Heteroskedasticity (ARCH) models, the proposed rank order statistic for two-sample problems pertaining to empirical processes refines the significant DE gene list. The ARCH process was proposed by Engle  , and the model was developed in much research to investigate a daily return series from finance domains. The series indicate time-inhomogeneous fluctuations and sudden changes of variance called volatility in finance. Financial analysts have attempted more suitable time series modeling for estimating this volatility. Chandra and Taniguchi  proposed a rank-order statistics and the theory provided an idea for applying residuals from two classes of ARCH models to test the innovation distributions of two financial returns generated by such varied mechanisms as different countries and/or industries. Empirical residuals called “innovation” generally perturb systems behind data. Theories of innovation approaches to time series analysis have historically been closely related to the idea of predicting dynamic phenomena from time series observations. Wiener’s theory is a well-known example that deems prediction error to be a source of information for improving the predictions of future phenomena. In a sense, innovation is a more positive label than prediction error  . As we see in innovation distribution for ARCH processes, it resembles the sequential expression level based on the whole genomic location. For applying time indices of ARCH model to the genomic location, the time series mining has been practically used to DNA sequence data analysis  and microarray data analysis  . To investigate the data’s properties, we believe that innovation analysis is more effective than analysis just based on the original data. While the original idea in Chandra and Taniguchi  was based on squared residuals from an ARCH model, not-squared empirical residuals are also theoretically applicable, as introduced in Lee and Taniguchi  . In this article, we apply this idea to test DEs between two sample groups in microarray datasets that we assume to be generated by different biological conditions.
To investigate whether ARCH residuals can consistently refine a list of significant DE genes, we apply publicly available datasets called Affy947  for breast cancer research to compare significant gene signatures. As a statistical test for two-group comparisons, the estrogen receptor (ER) is applied in clinical outcomes to identify prognostic gene expression signatures. Estrogen is an important regulator of the development, the growth, and the differentiation of normal mammary glands. It is well documented that endogenous estrogen plays a major role in the development and progression of breast cancer. ER expression in breast tumors is frequently used to group breast cancer patients in clinical settings, both as a prognostic indicator and to predict the likelihood of response to treatment with antiestrogen  . If the cancer is ER+, hormone therapy using medication slows or stops the growth of breast cancer cells. If the cancer is ER-, then hormonal therapy is unlikely to succeed. Based on these two categorical factors for ER status, we applied our proposed statistical test to the expression levels for each genomic location. After identifying significant DE genes, biological enrichment analyses use the gene list and seek biological processes and interconnected pathways. These analyses support the consistency for refined gene lists obtained by ARCH residuals.
Denote the sample and the genomic location by and in microarray data . The samples for the microarray data are divided by two biological different groups, one group is for breast cancer tumors driven by ER + and another group is for breast cancer tumors driven by ER−. We apply the two-group comparison testing to identify significant different expression level between two groups of ER+ and ER− samples for each gene (genomic location). As the statistical test, we propose the rank order statistics for ARCH residual empirical process introduced in 2.1. For comparisons with the ARCH model’s performance, we consider applying the two-group comparison testing to original array data and applying the test to the residuals obtained by ordinal AR (autoregressive) model. The details about both methods are summarized in 2.2. For the obtained significant DE gene lists, biologists or medical scientists require further analysis for their biological interpretation to investigate the biological process or biological network. In this article, we apply GO (gene ontology) analysis shown in 2.3 and Pathway analysis shown in 2.4, which methods are generally used to investigate specific genes or relationships among gene groups.
2.1. The Rank Order Statistic for ARCH Residual Empirical Process
Suppose that a classes of ARCH (p) models is generated by the following equations
where is a sequence of i.i.d.(0,1) random variables with fourth-order cumulant , is an unknown parameter vector satisfying , , , , and is independent of , . Denote by the distribution function of and we assume that exists and is continuous on .
Suppose that another class of ARCH(p) models, independent of , is generated similarly by the equations
where is a sequence of i.i.d. (0,1) random variables with fourth-order cumulant , is an unknown parameter vector satisfying , , , , and is independent of , . The distribution function of is denoted by and we assume that exists and is continuous on . For (2.1.1) and (2.1.2), we assume that and for stationarity (see  ).
Now we are interested in the two-sample problem of testing
In this article, and correspond to the distribution for the expression data of samples driven by ER+ and ER−, individually.
For this testing problem, we consider a class of rank order statistics including, such as Wilcoxon’s two-sample test. The form is derived from the empirical residuals and . Because Lee and Taniguchi  developed the asymptotic theory for not squared empirical residuals, we may apply the results to and .
2.2. Two-Group Comparison for Microarray Data
To obtain the empirical residuals as mentioned in 2.1, the ARCH model is applied to a vector for the th sample, where is the total number of genomic locations in the microarray data. Assuming that the ER+ and ER− samples correspond to distributions and as shown in 2.1, orders and of the ARCH model are identified by model selection using the Akaike Information Criterion (AIC), where the model with the minimum AIC is defined as the best fit model  (see 1. in Figure 1). According to those responses, the empirical residuals are grouped as and . Wilcoxon statistic is applied as order statistic to those two groups for each genomic location , and the p-value is calculated (see 2. in Figure 1). The p-values obtained for all genes are adjusted for multiple testing corrections using false discovery rate (FDR)  (see 3. in Figure 1).
For comparisons with the ARCH model’s performance, the two-group comparison testing to original array data and applying the test to the residuals obtained by ordinal AR (autoregressive) model.AR model represents the current
Figure 1. Complete proposed algorithm.
value using the weighted average of the past values as , where , , and are the AR coefficient, the AR order, and the error terms. The AR model is widely applied in time series analysis and the signal processing of economics, engineering, and science. In this article, we apply it to the expression data for two ER+ and ER− groups. The AR order for the best fit model is identified by AIC. Empirical residuals for ER+ and for ER− are subtracted from the data by predictions.
These procedures are finally summarized as follows: 1) take the original microarray and the clinical data for ER from one study cohort; 2) apply the ARCH and AR models to the original data for each sample and identify the best fit model among the model candidates within 1 - 10 time lags; 3) subtract the residuals from the data by the prediction for the best fit model; 4) apply Wilcoxon statistic to the original data and to the empirical residuals by ARCH and AR; 5) list the p-values and identify the significant FDR (5%) corrected genes. 6) apply Gene Ontology analysis and pathway analysis (see the details in 2.3 and 2.4) for biological interpretation to the obtained gene list (see 4. in Figure 1).
The computational programs were done by the garchFit function (in “fGARCH”) for ARCH fitting, by the ar.ols function for AR fitting, the wilcox.test as a rank- sum test, and fdr.R for the FDR adjustment in the R package.
2.3. GO Analysis
To investigate the gene product attributes from the gene list, Gene Ontology (GO) analysis was performed to find specific gene sets that are statistically associated among several biological categories. GO is designed as a functional annotation database to capture the known relationships between biological terms and all the genes that are instances of those terms. It is widely used by many functional enrichment tools and is highly regarded both for its comprehensiveness and its unified approach for annotating genes in different species to the same basic set of underlying functions  . Many tools have been developed to explore, filter, and search the GO database. In our study, Gorilla  was used as a GO analysis tool. GOrilla is an efficient web-based interactive user interface that is based on a statistical framework called minimum hypergenometric (mHG) for enrichment analysis in ranked gene lists, which are naturally represented as functional genomic information. For each GO term, the method independently identifies the threshold at which the most significant enrichment is obtained. The significant mHG scores are accurately and tightly corrected for threshold multiple testing without time-consuming simulations  . The tool identifies enriched GO terms in ranked gene lists for background gene sets which are obtained by the whole genomic location of microarray data. GO consists of three hierarchically structured vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components, and molecular functions. The building blocks of GO are called terms, and the relationship among them can be described by a directed acyclic graph (DAG), which is a hierarchy where each gene product may be annotated to one or more terms in each ontology  . GOrilla requires a list of gene symbols as input data. The obtained significant Etrez gene lists by FDR correction are converted into gene symbols using a web-based database called SOURCE  , which was developed by the Genetics Department of Stanford University.
2.4. Pathway Analysis
As well as for GO analysis, the identified genes are mapped to the well-defined- biological pathways. Pathway analysis determines which pathways are overrepresented among genes that present significant variations. The difference from GO analysis is that pathway analysis includes interactions among a given set of genes. Several tools for pathway analysis have been published. In this study, we used a web-based analysis tool called REACTOME, which is a manually curated open-source open-data resource of human pathways and reactions  . REAC- TOME is a recent fast and sophisticated tool that has grown to include annotations for 7088 of the 20,774 protein-coding genes in the current Ensembl human genome assembly, 15,107 literature references, and 1421 small molecules organized into 6744 reactions collected in 1481 pathways  .
3. Simulation Study
To investigate the performance of our proposed algorithm, we performed a simulation study. We first prepared the clinical indicator like ER+ and ER−. The artificial indicator includes “1” for 50 samples and “0” for 50 samples. Next, we considered two types of artificial 1000-array and 100 samples: one array data (A) generating by normal distributions was set. The mean and variance values of the distribution were set as 1.0 to generate overall array data at once. In addition, the array data for the 201 - 400 array and the 601 - 600 array were replaced with the data generating different normal distribution with 1.8 mean and 10 variance; another array data (B) was generated by ARCH model. The model was applied to real array (DES data, see the detail in Section 4) and the parameters (mu: the intercept, omega: the constant coefficient of the variance equation, alpha: the coefficients of the variance equation, skew: the skewness of the data, shape: the shape parameter of the conditional distribution setting as 3) for the model was estimated for ER+ and ER−, respectively. We used these parameters and random number to generate the simulation data. For the computational programs, we conducted normrnd of MatlabÒ command to generate random variables by normal distributions for array data A, and conducted garchSim of the R package fGARCH for array data B. We iterated 100 times to generate the two array data sets. To 100 data sets for A and B, we applied two-group comparison for the original simulation data and the ARCH residuals of them and identify 5% FDR significant parts.
Due to the extensive usage of microarray technology, in recent years publicly available datasets have exploded  , including the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/)  and Array Express (https://www.ebi.ac.uk/arrayexpress/). In this study for breast cancer research, we used five different expression datasets, collectively called the Affy947 expression dataset  . These datasets, which all measure the Human Genome HG U133A Affymetrix arrays, are normalized using the same protocol and are assessable from GEO with the following identifiers: GSE6532 for the Loi et al. dataset  (Loi), GSE3494 for the Miller et al. dataset  (Mil), GSE7390 for the Desmedt et al. dataset  (Des), and GSE5327 for the Minn et al. dataset  (Min). The Chine et al. dataset  (Chin) is available from ArrayExpress. This pooled dataset was preprocessed and normalized, as described in Zhao et al.  . Microarray quality-quality-control assessment was carried out using the R AffyPLM package from the Bioconductor web site (http://www.bioconductor.org  ). The Relative Log Expression (RLE) and Normalized Unscaled Standard Errors (NUSE) tests were applied. Chip pseudo-images were produced to assess artifacts on the arrays that did not pass the preceding quality control tests. The selected arrays were normalized by a three- step procedure using the RMA expression measure algorithm (http://www.bioconductor.org  ): RMA background correction convolution, the median centering of each gene separately across arrays for each dataset, and the quantile normalization of all arrays. Gene mean centering effectively removes many dataset specific biases, allowing for effective integration of multiple datasets  . 22,268 is the total number of probes for these microarray data.
Against all probes that covered the whole genome, we use the probes that correspond to the intrinsic signatures that were obtained by classifying breast tumors into five molecular subtypes  . We extracted 777 probes from the whole 22 K probes for the microarray data-sets using the intrinsic annotation included in the R codes in Zhao et al.  . As the response contrasting expression between two groups, we used a hormone receptor called ER, which indicates whether a hormone drug works as well for treatment as a progesterone receptor and is critical to determine the prognosis and predictive factors. ERsare used for classifying breast tumors into ER-positive (ER+) and ER-negative (ER−) diseases. The two upper figures in Figure 2 present the mean of the microarray data by averaging all of the previously obtained samples  . The left and right plots correspond to a sample indicating ER+ and ER−. The data for ER− show more fluctuation than for ER+. The two lower figures illustrate histograms of the averaged data for ER+ and ER− and present sharper peakedness and heavier tails than the shape of an ordinary Gaussian distribution.
5. Results and Discussion
5.1. Simulation Data
For the simulation data and ARCH residuals, we summarized the average of the number of the identified 5% FDR significant parts and the number of the overlapped parts in Table 1. In the case of the simulation data generated by normal
Figure 2. Upper figures: mean of expression levels for ER+ (left) and ER− (right) across all Des samples. Lower figures: histograms of expression levels for ER+ (left) and ER− (right).
Table 1. Summary of the average for the identified significant parts.
distribution, the significant number for original series and ARCH residuals in array sets A and B was not differ. The parts identified in A were mostly same as ones in B. On the other hand, in the simulation data generated by ARCH model, the ARCH residuals identified more significant parts from the data than the original series. The number of significant parts for ARCH residuals was about 30% less than the number of significant parts for the original series. The overlapped number was less than the case A, however over 50% parts were covered.
5.2. Affy947 Expression Dataset
Based on the method explained in 3.2, the best fit AR and GARCH models were selected by AIC for each sample. The estimated orders of all the best fit models of all the studies are summarized in Supplementary Table 1. Figure 3 summarizes the ratio of the sample numbers for each selected order against the total number of samples. These figures suggest that the most often selected orders were one while ER+ samples tended to take more complicated models than for the ER− samples.
Using residuals obtained by the best fit ARCH and AR models and the original data, we applied Wilcoxon statistic to compare DEs between two groups divided by ER+ and ER−. The significant genomic locations were assessed by a FDR. The locations were mapped on Entrez gene IDs according to the Affy probes presented in the original microarray data and converted into gene symbols by SOURCE. The identified genes in the original data and the ARCH residual analyses are listed in Supplementary Table 2. Based on these gene lists, we investigated the overlapped significant genes for the original data with significant genes for the ARCH and AR residuals and summarized the results in Table 2. About 200 - 280 significant DE genes in the studies of Des, Mil, Min, and Chin were identified with FDR correction. For Loi, the significant genes were fewer than the other studies in all cases. Except for Loi, the number of significant genes for the ARCH residuals was reduced by about 20% - 35% less than the number of genes for the original data. The estimated genes for all the datasets (except for Loi’s case) overlapped 100% with the estimated genes for the original data. The number of significant genes for the AR residuals in the cases of Mil, Min, and Loi was less than the number of genes by ARCH and resulted in about a 20% - 35% reduction of significant genes for the original data except for Loi. The reduction rate was similar to the rate shown in the case B of our simulation study. Furthermore, these genes for the AR residuals did not completely 100% overlap with the genes for the original data unlike the case of the ARCH residuals. The results suggest that the real array data might be generated by a
Figure 3. Selected model’s orders. Five upper panels indicate GARCH results. Five lower panels indicate AR results. Vertical and horizontal axes indicate percentage of selected order toward total sample numbers and model’s order. Solid and broken lines correspond to ER− and ER+ samples.
similar structure as the ARCH process and empirical ARCH residuals might be more effective to specify important genes from a list of long genes than AR residuals.
To investigate the overlapping genes by the ARCH residuals with genes by the original data, the corresponding cytoband and gene symbols are summarized in Table 3. The total numbers of common genes by the original and ARCH residuals in four studies were 132 and 99. If we take into account Loi’s case, the total numbers of common genes across all studies for the original and ARCH residuals are 12 and 9. The genes obtained by the ARCH residuals were completely covered by the genes obtained by the original data. The results by the ARCH residuals covered several important genes for breast cancer, such as TP53 in the
Table 2. Summary for FDR 5% adjusted Entrez genes of five datasets. Values in parentheses indicate number of unique genes to avoid duplicate and multiple genes from obtained gene list. Percentages for overlapped with original indicate ratios for overlapped significant genes for ARCH or AR residuals with significant genes in original data.
chromosome 1q region, ERBB2 in the chromosome 17q region, and ESR1 in the chromosome 6q region, even if the number of identified Entrez genes was less than the number of identified genes from the original data.
Next, we performed GO enrichment analysis using significant DE gene lists for the original data and ARCH’s residual analyses in all studies. To correctly find the enriched GO terms for the associated genes, a background list was prepared of all the probes included in the original microarray data. The Entrez genes in the background list were converted into 13,177 gene symbols without any duplication by SOURCE. As the input gene lists to GOrilla, the numbers of summarized unique genes are shown in the parentheses of Table 2. All the associated GO terms for the original and ARCH residuals in all the studies are summarized in Supplementary Table 3. Since the estimated gene symbols in Loi’s case were less than half of the amount taken in other studies, few associated GO terms were identified in the biological process and cellular component and no GO terms in the molecular function. Also, significant DE genes for the ARCH residuals contributed to finding additional associated GO terms that did not appear in the GO terms for the original data, e.g., mammary gland epithelial cell proliferation for Des, a single-organism metabolic process for Des, an organonitrogen compound metabolic process for Mil and Min, and a single-organism developmental process for Min and Chin, all of which are related to meaningful biological associations like cellular differentiation, proliferation, and metabolic pathways in cancer cells  . Table 3 summarizes the common GO terms of the biological processes for Des, Mil, Min, and Chin and presented 13 terms for the original data. The terms for the ARCH residuals mostly overlapped with them except for Mil’s case. As shown in Supplementary Table 3, two terms in the molecular function and eight in the cellular components were commonly identified by the original data. The GO terms for the ARCH residuals covered them, and more terms were shown in the molecular function.
Furthermore, to investigate the consistency of the refined significant gene
Table 3. Identified differentially expressed genes (FDR 5%) and cytobands for ER status in original data and empirical ARCH residuals.
lists, we applied pathway analysis to the significant DE genes for the original and ARCH residuals listed in Table 4. All the identified pathways with Entities FDR (<1.0) and associated genes are summarized in Supplementary Table 4. In the pathway components shown in Supplementary Table 3, ERBB2 signaling, EGFR, cell-cycle, immune system, metabolic pathway, AKT signaling and Wnt pathway are well- known important breast cancer-signaling pathways  . We took them to be representative of important pathways and counted the number of identified pathways related to these components in the case of the original and ARCH residuals. The number and associated gene symbols are summarized in Table 5. The representative pathways were mostly covered by the significant DE genes for the ARCH residuals. This result supports that the refined gene lists obtained by the ARCH residuals generally captured the differentiating breast tumors based on ER status and did not overlook any important biological information by the limited DE gene lists for the ARCH residuals.
We applied a rank order statistic for an ARCH residual empirical process to refine significant DE genes by two-group comparison in microarray analysis. Our approach considered publicly available gene expression datasets and the clinical output for ER in addition to the simulation study. We compared the analysis performances by the ARCH residuals with the AR residuals and the ordinal original microarray data. While the genes for the AR residuals did not cover 100% of the genes for the original data analysis, the genes by the ARCH residuals were mostly 100% overlapped with the original data, and the gene lists were reduced about 30% from the gene lists obtained by the original data analysis. We confirmed the similar property for the 30% reduction in the simulation study. In GO enrichment and pathway analyses, the result by the ARCH residuals was mostly covered with associated biological terms obtained by the original data
Table 4. Common associated biological processes among Des, Mil, Min, and Chin for original and ARCH residuals.
Table 5. Identified important breast cancer-signaling pathways and associated gene symbols obtained from original data and ARCH residuals.
analysis and presented additional important GO terms in biological processes. These results suggest that data processing using ARCH residuals array data could contribute to refining significant DE genes that follow the required gene signatures and provide prognostic accuracy and guide clinical decisions.
Acknowledgements and Funding
The research by the second author was supported by Japanese Grant-in-Aids A23244011 (Taniguchi, M., Waseda Univ.).
The authors declare no competing interests.
Supplementary (see https://www.dropbox.com/sh/sotz8jufje73eg6/AACKHD-tXqB02h_rxlvVlXqsa?dl=0 )
Supplementary Table 1. Estimated order of all best fit models for each sample.
Supplementary Table 2. Significant DE Entrez genes and gene symbols identified by original data and ARCH residuals.
Supplementary Table 3. Associated GO terms for original and ARCH residuals.
Supplementary Table 4. Identified pathways and associated genes.