In a clinical trial with binary outcomes, the risk ratio (RR) as an intervention effect is defined by the ratio of probabilities (risks) of having an adverse event between a treatment group and a control group   . Let xT and xC be the number of events out of nT and nC, the total number of persons (or the total of times that every person exposed) in the treatment arm and the control arm, respectively. Then the maximum likelihood estimate for RR is obtained as
  .
A meta-analysis of study size k is a statistical approach that combines the results from k studies, conducted on the same topic and with the similar methods, into a single summary result. In clinical trials, meta-analysis is an essential tool to obtain a better understanding of how well the treatment effects work. Two popularly statistical models used are the fixed effect model and the random effect model. Under the assumption of the fixed effect model, we assume that all studies share a common effect size. It means that there is no heterogeneity between the studies; all studies contain only one true effect size over all k independent trials, and the observed effect is determined by the common true effect plus the sampling error (within-study error). On the contrast, under the random effects model, the true effect is not the same in all studies; we allow that there is a distribution of true effect sizes. It follows that the combined estimate is not an estimate of one value, but rather it is the average of distribution values. Hence, there are two levels of errors (within-study error and between-study error). Consequently, the observed effect is determined by the mean of all true effects plus the within-study error and the between-study error. In this sense, heterogeneity may refer to various true effect sizes from studies to studies, or the difference of studies gives the difference of the effect sizes so that one can incorporate this heterogeneity into a random effect model. Alternatively, heterogeneity in the effect sizes from different studies may be explained by a set of covariates, such as characteristics of studies, type of treatment status, some average or aggregate characteristics of patients, even publication bias; therefore, a meta-regression approach may be used to account for variation from such covariates among these heterogeneous effects.
Traditionally, before combining the effects of separate studies by using either the fixed effect model as homogeneity or the random effect model as heterogeneity, the conventional Cochran’s Q test is adopted to test whether these treatment effects are homogeneous, or not. Unfortunately, it is widely known that the standard Q test may be inaccurate in testing the null homogeneity of effect sizes in the sense of low power of test. Kulinskaya and Dollinger  and Boissel et al.  stated that Cochran’s Q test had low power in most situations, especially, when the number of studies (k) was small. The work of Kulinskaya, Dollinger, and Bjørkestøl  , Lipsitz et al.  and Lui’s  were also confirmed the low power problem of Cochran’s Q test. The low power of Q test implies the low ability to detect the effect when the effect actually exists (i.e. the low chance of rejecting the null homogeneous effects when the different effects exist). The simple correction for Q test to solve the problem of low power is taking a larger level of significance; Fleiss  recommended using a cut-off significance level of 0.1, rather than the usual 0.05. This has also been a common customary practice for the Cochran’s Q homogeneity test in meta-analysis. Considerably, the way to increase the power is equivalent to the reduction of the chance of type II error. But this reduction of the chance of type II error also increases the risk or the chance of type I error. Obviously, when we make a low power problem better by using a cut-off of 10% for significance criterion, the new problem of allowance for the increase of the chance of type I error may occur. The increasing risk of type I error potentially leads to the problem of not maintaining the type I error at the conventional level of significance. Additionally, Shandish and Haddock  stated that when the sample sizes in each study were very large, the null hypothesis of the equal population effects might be rejected even if the individual effect estimates did not really differ much.
Profile likelihood estimation, stated by Ferrari et al.  and Böhning et al.  , deals with elimination of the nuisance parameters. Generally, let the log-likelihood depend on a vector of parameters of interest and a vector of nuisance parameters. If as a function of is the solution such that for all , then is called the profile log-likelihood. Profile log-likelihood is not an ordinary log-likelihood, but log-likelihood maximized over nuisance parameters given the values of the parameter of interest. We can observe that the profile log-likelihood now depends only on the parameter of interest.
With the Q limitations of low power and not maintaining type I error at the conventional level of significance, many scientists have attempted to propose some new tests and/or some modified tests to be alternative candidates. To meet the gaps of limitations, our proposed tests modified from the standard test of homogeneity as an alternative choice are based on the substitution of profile maximum likelihood estimates derived by Böhning et al.  into the variance formula of logarithm of risk ratio as the effect measure of interest over all k studies. Another comparative test was the simply naive test based on the variance estimate of the conventional Poisson likelihood. Some numerical examples are illustrated later. Then, the next contribution focuses on a comparison of the performance among these homogeneity tests via a simulation plan. The result is related to the mentioned tests through the type I error probability and the power criteria lying on the later section. The conclusion and discussion are presented finally.
2. Motivational Applications
Two examples of meta-analysis are presented to illustrate the implementation of the related Q test and the other usefulness demonstrates how to set the parameters in a simulation plan. Farquhar et al.  conducted a meta-analysis on k = 7 studies to assess 5 years follow up of high dose chemotherapy and autograft comparable with the conventional chemotherapy for poor prognosis breast cancer. The outcome of treatment is event free survival. The value of Cochran’s Q-test was 4.72. Since Q is distributed as a standard chi-square statistic with k-1 degrees of freedom (df), leading to the p-value of 0.58 for accepting the null homogeneity of risk ratios across trials. Additionally, the statistic denoted as describing the percentage of variation across studies due to heterogeneity is very low of 0%; consequently, a fixed effects model might be appropriate. The conclusion of acceptation of the null hypothesis was that there was no presence of heterogeneity (Figure 1). In addition, there was no difference between treatment and control groups on binary events; the pooled estimate of RR being of 1.01 under a fixed effects model lies on the 95% confidence interval (CI.) of [0.97, 1.06], covering the null value 1. Forest graph of meta-analysis is created by R package provided by Schwarzer et al.  , http://meta-analysis-with-r.org/.
Mottillo et al.  considered the data from meta-analysis of 16 trails about the metabolic syndrome and cardiovascular risk. The value of Cochran’s Q-test is 6.12. The Chi-square approximation with 15 degrees of freedom provides 0.0003 of the p-value for testing the null homogeneity. The heterogeneity value of index was 64%. The result shows evidence to conclude heterogeneity of across studies (Figure 2). Furthermore, there exist the treatment effects on the binary outcomes since RR of 2.34 under a random effect model lies away from 1; the 95% CI has the range of [2.02, 2.72], not covering 1.
3. Deriving Profile Likelihood Tests for Common Risk Ratio
The purposes of study are 1) to derive the profile likelihood tests for testing a null common risk ratio RR across k studies in which is equivalent to homogeneity of treatment effects overall k studies ( ) by replacing the profile likelihood estimator into the formulas of the estimate of variance of logarithmic relative risk, , of the standard chi-square test; 2) to compare the performance of test statistics based on the profile likelihood method regarding
Figure 1. Forest plot of meta-analysis comparing high dose chemotherapy and autograft with the conventional chemotherapy.
Figure 2. Forest plot of meta-analysis of 16 trials about the metabolic syndrome and cardiovascular risk.
the different formulas of the variance estimates of logarithm of risk ratio with the conventional Cochran’s Q test for testing a null common risk ratio RR across k studies ( ) against , (i.e. has a specific distribution).
We followed the work and the notation of Böhning et al.  and further proposed some profile likelihood tests by modifying the standard test for homogeneity through the various ways of the variance estimates of the logarithm of risk ratios at the ith study.
3.1. Profile Likelihood Estimator under a Fixed Effect Point for a Common Risk Ratio across Studies
The result of the work of Böhning et al.  under profile likelihood concept provides a fixed-effect point RR for all k studies ( ) as
leading to the iterative processes of the profile maximum likelihood estimator (PMLE) in the following
where and are the numbers of events in treatment and control arms for each clinical trial i and and are the numbers of persons at risk or person-times.
3.2. Some Tests Based on Various Formulas of Variance Estimate of Logarithmic RRi
For testing the null hypothesis, the true relative risks ( ) are the same in all k centers/studies, , for versus the alternative that at least one of the effect sizes ( ) differs from the remainder. Alternatively, this is reasonable to assume that all null parameters of the centers to be combined are summarized into a single underlying population parameter, against the alternative parameters different among centers are likely to have a wholly random with a specific distribution. Our proposed tests are modified on the base of a standard test for homogeneity in the following form:
where k is the number of studies being combined, is a PMLE of a common RR, is an estimate of at the ith study, two natural logarithm transformations, such as and , are needed to adapt the non-symmetric distribution, and is degrees of freedom of test. It is a common way that the variance of the logarithm of risk ratios at the ith study, , is replaced by its various estimates, , leading to the several candidates tests, finally.
1) Simply naive test (SIM), based on variance estimate at the ith study under Poisson likelihood by Delta method, is denoted as
where , , and .
2) Profile likelihood test (PL1) with the same form above will be obtained but getting the different formula due to the variance estimate under the null hypothesis as
where and .
3) Profile likelihood test (PL2) will also be obtained after using the different formulas of variance estimate as
where and are the results of Böhning et al.  under profile likelihood concept.
4) Cochran’s Q test as the weighted sum of squares is distributed as a chi-square statistic with k − 1 degrees of freedom, under the null of homogeneity of treatment effects across k studies, denoted as
where , , , , and .
4. Monte Carlo Simulation
We perform two simulation plans. One is conducted on type I error for testing a null common risk ratio, RR, over all k studies or in other words for testing the null homogeneity we have . The other is used for comparing the performance of tests with the highest power after all test statistics could be controlled within the same limit range of the empirical type I error.
4.1. Simulation Plan for Studying Type I Error
Parameters Setting: The values of parameter setting followed two motivational examples. Let the common relative risk (RR) be 1, 2 and 4. Baseline risks in the control arm for the ith center are generated from a uniform distribution in which its range depends on the values of RR. For example, if then and the correspondent treatment risks have the possible values less than or equal to 0.9 as for the ith center. If then and . The sample sizes and are distributed from Poisson with the mean of 5, 10, 50, 100, 500, 1000. The number of centers k is 4, 16, and 32.
Statistics: Poisson random variables and in treatment and control arms for center i are generated with parameters and , respectively. All candidate tests are then computed. The procedure is replicated 5000 times. From these replicates, the number of the null hypothesis rejections is counted for the actual (empirical) type I error.
Type I error among the tests is considered by comparing the actual (estimated) type I error ( ) with the nominal level of significance ( ). The departure of the estimated type I error from the nominal level of significance must not exceed the precise limit. In this study, the evaluation for two-sided tests in terms of the probability is based on Bradley limit  yielding the limiting ranges of . For an example, at level of significance, value lies between [0.5%, 1.5%], at level of significance, value lies between [2.5%, 7.5%], at level of significance, value lies between [5%, [15%].
If the empirical type I error lies within the range of Bradley limit, then the statistical test can capture type I error.
4.2. Simulation Plan for Studying Power of Tests
Before comparing the power of test statistics, all test statistics could be calibrated to have the same limit range of type I error rate under the null hypothesis. It means that power comparisons of tests are reliable if all tests are previously based on the same range of type I error rate before the process of power comparisons is employed.
Under the alternative hypothesis that has been assumed a specific distribution around the mean ( ) of 1, 2, 4, we let where is a uniform over (-mm, mm) for a given mm = 0.2, 0.4, 0.6, and U is a uniform over (0, 1). Baseline risks are still generated from a uniform distribution over [0, 0.25]. Poisson random variables and are generated with parameters and , respectively. The procedure is replicated 5000 times and the number of the null hypothesis rejections is counted for the empirical power.
Since it is difficult to present all enormous results from the simulation study, we just have illustrated some instances, coping with 0.05 levels of significances, some common true relative risk values of 1 and 2, in both equal and unequal sample sizes.
5.1. Equal Sample Sizes ( )
5.1.1. Studying Type I Errors
・ From Table 1, the results show that for small sample sizes ( ), regardless of study size k, almost all tests cannot control type I error.
・ For moderate to large study sizes ( ) in combination with moderate to large sample sizes ( ), two proposed tests (PL1, PL2) can maintain type I error rates in almost all situations. Meanwhile, for moderate to large study sizes ( ), the Q test seems to handle type I error when sample sizes are large ( ).
・ For small center size (k = 4), the SIM test can capture type I error on some moderate and large sample sizes ( ) and the Q test can control type I error on sample size being moderate ( ).
・ In summary, for study size is moderate to large ( ), two profile likelihood tests (PL1 and PL2) perform well with maintaining type I error rates when sample sizes are moderate to large ( ); in the meanwhile, the Q test can capture type I error on sample size being quite large ( ).
Table 1. At 5% significance level, a comparison of the empirical type I error rates among four statistical tests with the equal in the mean of sample sizes.
Note: Bold values indicate that the test statistics can control type I error rate.
5.1.2. Comparing Powers of Tests
・ The process of power comparisons is conducted after all candidate tests can previously maintain the same limit range of type I error.
・ Table 2 showed that both of the PL1 test and the PL2 test are best with the highest powers when study size is moderate to large ( ) and sample sizes are moderate ( ) in every degrees of variation (mm = 0.2, 0.4, 0.6), coping with RR = 1, 2. Additionally, in more detail, PL2 seems better than PL1 with higher power.
・ When study size is moderate to large ( ) and sample size is large ( ), the Q test is best with the highest power of test in every degrees of variation (mm = 0.2, 0.4, 0.6), coping with RR = 1, 2.
・ For the number of studies is small (k = 4) in combination with large sample sizes ( ), the best performance of test is the SIM test since it is only one test that can formerly meet the criterion of controlling type I error.
5.2. Unequal Cases ( )
5.2.1. Studying Type I Errors
・ Table 3 indicates that for RR = 2 and moderate to large study size ( ), three tests (PL1, PL2, Q) can capture type I error when both of sample sizes in treatment and control groups are moderate to large ( , ). The SIM cannot control type I error in every case of sample sizes.
・ Table 4 is considered to highlight only for small study sizes (k = 4). For small study sizes (k = 4), the SIM seems to control type I error at least when one sample size of treatment groups is large. Both of PL1 and PL2 tests can control type I error when one sample size of treatment groups is small. The Q test can rarely control type I error in every sample size for small study sizes.
5.2.2. Studying Power of Tests
・ Table 5 indicates that for moderate to large study sizes ( ) in combination with moderate sample sizes ( ), two proposed tests (PL1, PL2) perform best and quite close together.
・ For moderate to large study sizes ( ) in combination of at least one treatment arm being large sample sizes ( ), Q test seems to have best performance with the highest power, followed by PL2 and PL1 tests.
・ Additionally, when the sample sizes of both treatment and control arms are quite small ( ), regardless of study size (k), no tests among four tests (SIM, PL1, PL2, Q) are reasonable since almost all tests cannot control type I error rates and they give too low powers.
In this study, we further focus on a comparison of the performance among four statistical tests including the simply naive test approach (SIM), the conventionally null approach of profile likelihood (PL1), the full profile likelihood approach
Table 2. Comparisons of the power of tests after capturing type I error at 0.05 significance level when means of sample sizes in treatment groups are equal ( ).
Note: Bold values indicate that the statistic tests can previously control type I error rates.
Table 3. At 5% significance level, a comparison of the estimated type I error rates for moderate to large study sizes ( ) with the unequal sample sizes ( ).
Note: Bold words denoting the statistic test can control type I error.
Table 4. At 5% significance level, a comparison of the actual type I error rates for small study size (k = 4) with the unequal sample sizes ( ).
Note: Bold words denoting the statistic test can control type I error.
Table 5. Comparison of the power of tests at 0.05 significance level for moderate to large study sizes ( ) with the unequal sample sizes ( ) at mm = 0.2.
Note: Bold words denoting the statistic test can formerly control type I error.
(PL2), and the conventionally weighted sum of square approach (Q). The main results found in the followings.
・ No tests could not capture type I error rates for small sample sizes ( ), regardless of study size k. This same result happened to the study of Mathes and Kuss  ; they stated that estimating between-study heterogeneity in meta-analysis of a small number of sample sizes ( ) is difficult in this situation.
・ The work of Willis and Riley  was also confirmed the properties of Q test to be a good test when there are large study sizes (50 studies or more), but for fewer studies the Q test has the low power.
・ We are scientist group that have attempted to propose some new/modified tests to bridge the gaps of limitation of the Q test. The idea of this paper shows how to use two proposed tests (PL1, PL2) based on substituting profile maximum likelihood estimates into the different variance formulas for obtaining the modified standard chi-square tests of heterogeneity.
・ Our profile likelihood tests (PL1 and PL2) for moderate to large study sizes ( ) in combination with moderate sample sizes ( ) can defeat the Q test with the higher power after capturing the same range of type I error limits.
・ The work of Bagheri, Ayatollahi and Jafari  and Viechtbauer  which also could evaluate the influence of the size of centers (k) and sample sizes ( ) on the type I error and the power for the null homogeneity testing in some situations. It means that the investigators should pursue their attempts to find some new/modified tests further.
・ In contrast, although two proposed tests (PL1, PL2) perform well in above situations, they cannot defeat the Q test when the number of studies is moderate to large ( ) in combination with large sample sizes ( ). Additionally, in unbalanced cases, for moderate to large study sizes ( ) and combination of moderate sample size and large sample sizes ( ), the Q test performs best with the highest power, followed by PL2 and PL1 tests.
In summary, the idea of replacement of profile likelihood estimates into the variance formulas of logarithm of relative risks works well when in combination with ( ).
Two proposed tests (PL1, PL2) based on substituting profile maximum likelihood estimates into the different variance formulas, perform best with the highest power (under formerly within the same range of type I error limits) in some situations, for examples, when the number of studies is moderate to large sizes ( ) in combination with moderate sample sizes ( ). This result leads to the suggestion of the use of two proposed tests in such practical situations.
In contrast, although two proposed tests (PL1, PL2) perform well with the high powers in above situations, they cannot defeat the Q test when numbers of studies are moderate to large ( ) in combination with large sample sizes ( ) in both balanced and unbalanced cases. This result leads to the suggestion to use the Q test in these situations. It means that it should be further investigated to find the new appropriate test to fill the gaps of low power of Q test in such situations.
The study was partially supported for publication by Faculty of Public Health, Mahidol University, Bangkok, Thailand.