Odds ratio is commonly used in the analysis of association of two factors that both have two categories. In epidemiological studies and clinical trials, these two factors usually refer to the exposure (treatment/intervention/risk) factor X and the outcome factor Y respectively. The association between X and Y, however, could be modified or confounded by a third factor Z. For example, in a multi-center clinical trial, factor Z could be the center. Each center is corresponding to a stratum of Z. Because the presence of the heterogeneity of odds ratios may lead to different methods of analysis, researchers usually want to test whether the odds ratios are homogeneous across the strata of the factor Z or not. This type of tests is called the tests for homogeneity of the odds ratios, or the tests for null interaction.
A few procedures have been developed for testing the homogeneity of odds ratios. They are usually categorized into two classes: exact tests and asymptotic tests. Most of the asymptotic statistics were derived for “large-stratum” settings, where the sample size is large, and the number of strata is small. Liang and Self  developed two asymptotic statistics—score statistics for the “sparse-data” setting, where there are many cells with small counts and/or zeros. The asymptotic tests could be poor if there were some cells with small cell counts in the contingency tables even when the sample size was quite large. The empirical sizes of the asymptotic tests were shown to be conservative when the data were “small-stratum” setting  . Jones et al.  showed in their simulation study that, in the “sparse-data” setting, five asymptotic tests suffered in both empirical size and power. Zelen  constructed an exact test for homogeneity that employs the ordering principle for a single 2 by 2 table, which was reexamined by Halperin et al.  . Algorithms were developed for the Zelen exact test later     . Hirji et al.  showed that their algorithm for the Zelen statistic was more versatile and efficient in terms of the accuracy and the usage of memory in the computer computation. Hirji et al.  also constructed five other exact statistics from their asymptotic counterparts, which are score, likelihood ratio, Pearson χ2 and mixture model χ2 tests. The exact tests are computationally intensive, particularly when the cell counts are large. Reis, Hirji and Afifi  examined the performance of the empirical power of six asymptotic statistics and six exact tests mentioned above; the results of the study showed that the power of both exact and asymptotic tests was low when sample sizes were small even with relatively large heterogeneity among odds ratios. Baghri et al.  compared three tests of homogeneity of odds ratios. A recent study investigated profile likelihood tests for common odds ratios in meta-analysis  .
Our study presented in this paper compared a class of U-statistics with the established tests such as the Zelen test and the Breslow-Day test.
The statistics of the form
are known as U-statistics, where is a set of l-subset of ;
the sum is taken over all subsets of ; h is the
kernel function and symmetric in its arguments. In our study, we investigated the applicability of this class of statistics with l = 2 in testing homogeneity of odds ratios among 2 by 2 tables. U-statistics were first identified as a minimum-variance unbiased estimator by Halmos  and were named by Hoeffding  , who demonstrated the asymptotic normality of this class of statistics in his seminal paper  , Many statistics in common use are members of this class, including the sample mean, the sample variance and the sample covariance   .
Table 1 shows a 2 by 2 table of a stratum k.
Assume that ak and bk are counts of independent binomial outcomes from number nk and mk of trials with or without exposure at the stratum k respectively; Nk is the total sample size of the stratum k; and the tables are independent among strata. The commonly used estimate of the odds ratio of the kth stratum is expressed as: . We want to test whether the odds ratios are homogeneous among all K strata (or all levels of the third variable Z), that is to test H0: against Ha: for at least one pair of (i, j), where .
Our study evaluated a class of weighted U-statistics as well as
a class of unweighted U-statistics , where and are the estimates of the odds ratios in the ith and jth stratum of Z, is a function of and , and is the weight associated with . Based on the simulation results, we only focus our attention on the following two statistics in this paper:
U3: , (2)
WU3: . (3)
The base of log in all formula is e. The sample distribution of the estimated odds ratio is highly skewed when the sample size is small or moderate. Because of this, we used the natural logarithm of in U3 and WU3 to reduce the skewness. Consider that a large stratum offers more accurate estimate for the odds ratio, a weight was selected for each in expression (3), which is proportional to the stratum’s size. It has the following form:
where . The log transform of the sample odds ratio has an asymptotic variance in a simple form, which is, . If there was any cell count that equals to zero, the odds ratio estimate and from the above formula would be undefined. We added 0.5 to each cell count of that table in the calculations to get the amended estimators  . In our simulation, we studied other forms of U-statistics. Due to the relatively poor performance of the others, we only report the results of the U3 and the WU3 and their comparisons to the Zelen’s test and the Breslow-Day test in this paper. Detailed results are available upon request.
Table 1. 2 × 2 table of X and Y at Z = k.
3. Simulation Study
A total of 10,000 data sets were simulated using the SAS subroutine RANBIN. Each data set contains pre-specified K sets of 2 by 2 tables. The cell counts ak and bk were independently generated from binomial distributions (nk, p1k) and (mk, p0k), where is the probability of Y = 1 when X = x (x = 1, 0) in the kth stratum. Each set of the tables was simulated with a given nk, mk, the number of the strata K and the odds ratios. For a given odd ratio ψk and a binomial proportion p0k, p1k was calculated by solving:
Following the previous simulation study by Reis, Hirji and Afifi  , five factors that might influence the sizes and power of the test statistics were studied. These five factors are: 1) the sample size for X = 1 (nk); 2) the ratio of the sample sizes of X = 0 and X = 1 (mk:nk); 3) the probability of Y = 1 among X = 0 (p0k); 4) the odds ratio (OR) ; 5) the number of 2 × 2 tables, K (strata). In order to study the performance of the statistics, different values of the above five factors (parameters) were chosen to simulate different data sets. In choosing the values of nk, mk:nk, p0k, , and K, we took the following factors into account: the situations in practice, the characteristics of U-statistics and the design of the simulation study by Reis, Hirji and Afifi  . When the effect of one factor was studied, the other four factors’ values were fixed.
We compared the performance of the U-statistics with the Breslow-Day statistic and Zelen’s exact test in our simulation study. A C++ program was written to calculate these statistics’ exact P-values, empirical sizes and power. The empirical size was calculated as the percentage of times that the test rejected the null hypothesis of a common odds ratio at a pre-specified α level among 10,000 tests that were simulated with same odds ratios among K tables. The empirical power was calculated as the percentage of times that a test rejected the null hypothesis of a common odds ratio at a prescribed α level when data were simulated under alternative hypotheses. Because the U-statistics studied here are functions of the sums across all the absolute distances between all possible pairs of the estimated odds ratios in log scale, a large value of U statistics indicates the heterogeneity of the odds ratio.
Theoretically, under suitable conditions, will be asymptotically following as , where T represents a U-statistic. In our study, the sample mean and the sample variance of 10,000 statistics under the null hypothesis were used to estimate the E(T) and the var(T). In an actual application, the sample mean and variance may be estimated as suggested in our application section. In the simulation, one would use the result from expression  for the variance. In so doing, one would get a different numerical value of the variance for each realization, which was inconsistent with our assumption of common variance. The effect of unequal variance is still under investigation.
4. Results of the Simulation Study
4.1. Empirical Sizes of the Tests
The five factors affected the test statistics differently. The empirical size of the Breslow-Day test was improved (moved closer to the predefined α level) as the values of nk, mk:nk, p0k and odds ratios increased but diverged from the pre-specified α level when the number of stratum increased. A weak trend was observed that the empirical size of U3 and WU3 moved closer and then diverged from the pre-specified α level when the sample size nk increased (Figure 1). Similar results were observed in studying the effect of ratio mk:nk; their empirical sizes were improved when mk:nk increased to moderate value (mk:nk equal to 2 or 3), they then diverged from the predefined significant level 0.05 if the value of mk:nk was higher (Figure 2). The empirical sizes of U-statistics increased as the value of the odds ratio increased. The probability p0k had the least impact on the empirical sizes of the two U-statistics.
The number of strata K had an apparent effect on the sizes of U-statistics; their empirical sizes were improved as K increased (Figure 3), even when the total sample size remained the same (Figure 4). The empirical size of Breslow-Day test diverged from the predefined nominal size of 5% as the value of K increased. The empirical size of Zelen test improved as K and total sample size increased but diverged from nominal size of 5% when the number of stratums increased without increasing the total sample size.
4.2. Empirical Power of the Statistics
Seven settings of heterogeneous odds ratios were evaluated as alternative hypotheses in our study. However, in this article, we only reported the empirical powers from the scenario that the alterative odds ratios were generated following the pattern of 1, 2, 3, 7. That is, 25% of the generated tables under Ha have odds ratios being 1, 2, 3 and 7, respectively. In order to show the effects of different factors on the test statistics, we also simulated the critical values based on these factors (Figures 5-11).
Figure 1. The effect of nk on empirical size, Sparse-strata setting, between strata balanced: (p0k = 0.1, K = 10, mk:nk = 1, OR = 2).
Figure 2. The effect of ratio mk:nk on empirical size, between strata balanced setting: (p0k = 0.05, K = 10, nk = 10, OR = 2).
Figure 3. The effect of the number of strata K on empirical size, between strata balanced and within strata balanced: (p0k = 0.05, nk = 10, mk:nk = 1, OR = 2).
Figure 4. The effect of the number of strata K on empirical size, between strata unbalanced and within strata balanced: (p0k = 0.1, nk = 10, mk:nk = 1, OR = 2).
Figure 5. The effect of nk on empirical power, between strata balanced: (p0k = 0.1, K = 10, mk:nk = 1, Ha7 shown in Table 1).
Figure 6. The effect of nk on empirical power, between strata unbalanced and within strata balanced setting: Two large strata with OR = 2, and 7, nk changed as shown in figure, Eight small strata with OR = 1, 2, 3, 7, 1, 2, 3, 7, nk = 5. (p0k = 0.05, K = 10, mk:nk = 1, Ha7 shown in Table 1).
Figure 7. The effect of ratio mk:nk on empirical power, between strata balanced setting: (p0k = 0.05, K = 10, nk = 10, OR = 2, Ha7 shown in Table 1).
Figure 8. The effect of the value of odds ratio on empirical power, between strata balanced and within strata balanced, (p0k = 0.05, nk = 10, mk:nk = 1, K = 10, Ha7 shown in Table 1).
Figure 9. The effect of the value of p0k on empirical power, between strata balanced and within strata balanced, (nk = 10, mk:nk = 1, K = 10, OR = 2, Ha7 shown in Table 1).
Figure 10. The effect of the number of strata K on empirical power, between strata balanced and within strata balanced: (p0k = 0.05; nk = 10; mk:nk = 1; OR = 2, Ha7 shown in Table 1).
Figure 11. The effect of the number of strata K on empirical power, between strata unbalanced and within strata balanced: Total sample size keeps unchanged while changing value of K. (p0k = 0.05; mk:nk = 1; ORi = ORj = 2. 4 small strata with nk = 5, and (K-4) large strata with nk = 240/(2 × (K − 4)), Ha7 shown in Table 1).
All the statistics’ empirical power increased as nk increased (Figure 5, Figure 6). The value of the ratio mk:nk had very little effect on the empirical power of U3; the empirical power of the weighted U-statistic and Breslow-Day test were improved as the value of mk:nk increased (Figure 7). Increasing the value of the odds ratio under the null hypothesis decreased the empirical power of all statistics studied (Figure 8). The empirical power increased as the value of p0k increased (Figure 9). When the number of strata K increased, together with total sample size increased, the power of the U-statistics and Zelen’s exact test were improved (Figure 10); If the total sample size remained the same when K increased, the empirical sizes of the U-statistics were improved; the empirical size of the Breslow-Day and Zelen’s statistics diverged from 5%; the empirical power of U3 almost remained unchanged, the others’ power decreased (Figure 11).
5. Application to Published Data
Generally, U3 and WU3 performed well in terms of both size and power. Their empirical sizes were stable under various situations and had relatively high power.
With the assumption that the odds ratios of K 2 × 2 tables are the same; and is normally distributed with variance equal to , we can derive the estimated expected value of U3 and WU3. They are:
The variance of U3 can be also expressed as:
The variance of WU3 can be also expressed as:
To estimate the , consider the Mantel-Haenzel estimator of common odds ratio as a weighted average of odds ratios. Given the weight, we can solve the as a function of , which is:
And the value of can be calculated by the following formula:
where , , , .
To illustrate the application of these two U-statistics, we applied and compared them to the Breslow-Day statistic and the Zelen statistic in two published data sets: 1) Alcohol assumption data (Table 2) are from Statistical Methods in Cancer Research  , volume 1 page 137, which is an example for introducing the use of Breslow-Day statistic; 2) new drug data (Table 4) are from a study of comparing a new drug with a controlled drug among 22 hospital sites  . Because of the sparseness of the data, asymptotic tests (Breslow-Day statistic) might not be able to yield an accurate p-value. The test results for the homogeneity of odds ratios are presented in Table 2 and Table 4. For data set 1), according to the p-values of the test statistics, U3 rejected the null hypothesis; the WU3, Breslow-Day, and Zelen statistics accepted the null hypothesis at 0.05 level. For data set 2), Zelen and WU3 rejected the null hypothesis, yet Breslow-Day and U3 accepted the null hypothesis of no difference among odds ratios at 0.05 level (Tables 2-5).
6. Discussion and Conclusions
To summarize the simulation study that we conducted, the following are some remarks: When the number of strata is not very small, (K ≥ 6), the empirical size of U3 and WU3 were very stable under various situations and stay very close to the nominal of 0.05. In terms of size and power, U3 and WU3 performed better than the Breslow-Day statistic and the Zelen’s exact test. Therefore, U3 and WU3 are considered as better statistics for testing the homogeneity of odds ratios in this situation. The test statistic U3 is recommended when the sample size is the same in each stratum, the number of strata is large and the sample size in each stratum is not large. Otherwise, WU3 is recommended.
Breslow-Day test is conservative in most situations; its empirical size is close
Table 2. Alcohol consumption data.
Source: Statistical Methods in Cancer Research, volume 1, page 137.
Table 3. Homogeneity odds ratio test results for alcohol consumption data.
Table 4. New drug data.
Table 5. Homogeneity odds ratio test results for new drug data.
to 0.05 when the sample size is large; when sample size is small, Breslow-Day test is not recommended. Breslow-Day test is never recommended for sparse data;
When the sample size is small and the number of strata is small, say less than 5, Zelen’s exact test is recommended;
In our application, the sample mean and the variance were estimated based on certain assumptions. The empirical power and size of U3 and WU3 would be highly dependent on how well the estimator of would be.
This work was partially supported by Cancer Prevention Research Institute of Texas (RP170668).
 Jones, M.P., O’Gorman, T.W., Lemke, J.H. and Woolson, R.F. (1989) A Monte Carlo Investigation of Homogeneity Tests of the Odds Ratio under Various Sample Size Configurations. Biometrics, 45, 171-181.
 Halperin, M., Ware, J.H., Byar, D.P., Mantel, M., Brown, C.C., Koziol, J., Jail, M. and Green, S.B. (1977) Testing for Interaction in an I × J × K Contingency Table. Biometrika, 64, 271-275.
 Pagano, M. and Tritchler, D. (1983). Algorithms for the Analysis of Several 2 × 2 Contingency Tables. SIAM Journal of Scientific and Statistical Computing, 4, 302-309.
 Thomas, D.G. and Gart, J.J. (1992) Improved and Extended Exact and Asymptotic Methods for the Combination of 2 × 2 Tables. Computers and Biometrical Research, 25, 75-84.
 Hirji, K.F., Vollset, S.E., Reis, I.M. and Afifi, A.A. (1996) Exact Tests for Interaction in Several 2 × 2 Tables. Journal of Computational and Graphical Statistics, 5, 209-224.
 Reis, I. M., Hirji, K.F. and Afifi, A.A. (1999) Exact and Asymptotic Tests for Homogeneity in Several 2 × 2 Tables. Statistics in Medicine, 18, 893-906.
 Bagheri, Z., Ayatollahi, S.M.T. and Jafari, P. (2011) Comparison of Three Tests of Homogeneity of Odds Ratios in Multicenter Trials with Unequal Sample Sizes within and among Centers. BMC Medical Research Methodology, 11, 58.
 Viwatwongkasem, C., Donjdee, K. and Poodphraw, T. (2018) Profile Likelihood Tests for Common Risk Ratios in Meta-Analysis Studies. Open Journal of Statistics, 8, 915-930.
 Gart, J.J. and Zweifel, J.R. (1967) On the Bias of Various Estimators of the Logit and Its Variance, with Application to Quantal Bioassay. Biometrika, 54, 181-187.