Repeated binary testing, often referred to as a binary measurement system (BMS), is regularly used in quality control studies as a means of assessing the quality of the units produced. However, these inspection methods are highly dependent on the quality of the individual inspectors, thus making the inspection itself an integral part of the quality control process. Two aspects of evaluating the inspection process are repeatability and reproducibility. A process’s repeatability refers to how frequently a single inspector inspecting a single item will obtain the same result, while reproducibility refers to how often different inspectors inspecting the same item will reach the same conclusion. Estimating classification rates of a system has been considered by several authors.  considered various sampling plans to assess the qualities of a BMS.  found maximum likelihood estimators and method of moment estimators for the case of multiple raters, assuming fixed effects. When there are multiple inspectors, it may be of interest to determine which of several inspectors or inspection systems is performing best.
The model we consider here is a Bayesian version of  . Specifically, we consider a random effects model for multiple testers and multiple inspections per inspector from a Bayesian perspective. There are multiple advantages to using a Bayesian approach. For example, prior knowledge can be incorporated into the study with the use of informative prior distributions. This knowledge can be obtained either from previous data or expert opinion. Also, even in the absence of prior knowledge where the asymptotic dominance of the prior by the likelihood is present, interval estimates generated from the Bayesian paradigm are based largely on the likelihood which has been shown to be superior to other interval estimation methods  . Another advantage of the Bayesian paradigm is that if the prior is sufficiently informative, then, assumptions required for identifiability can be relaxed. Thus, our Bayesian approach can be used in situations when the parameters of a likelihood function are not identifiable. The Bayesian estimators considered here have no known closed form and, thus, must be found approximately. We use Markov Chain Monte Carlo (MCMC) simulations to sample from the model’s posterior distribution and obtain parameter estimates.
The remainder of the paper is outlined as follows. In Section 2, we present the model and give identifiability assumptions. In Section 3, we describe a simulation study and present the simulation results for the Bayesian estimation. In Section 4, we apply our model to two parameter-ranking applications and two subset selection problems for multiple sites, and in Section 5, we perform an additional simulation to determine the effectiveness of our subset selection procedure. Finally, in Section 6, we provide several comments summarizing our results.
2. The Model
Assume that N randomly selected items to be inspected are sampled from the general population of items. Let the true quality state of an item be denoted by T, where indicates a good item and denotes an item that fails to meet the quality specifications. The symbol denotes the overall conforming rate. Because we assume that no gold standard is used and because T is a latent variable, we also assume .
Repeated independent, fallible observations are then derived by m different inspectors on the ith unit to indirectly assess the true state of the ith unit, where . Let denote the result of the kth inspection on the ith item by the jth inspector, where denotes a passed inspection, denotes a failed inspection, and . For each and inspector j, we further define the conditional probabilities (false positive rate) and (false negative rate) with respect to the true state of the item, . Further, assume
here, we initially assume that inspections are independent, given the true latent state of the ith part. This conditional independence assumption yields
To relax assumptions that the inspectors all have the same probability of classifying correctly and allow for other random heterogeneity, we consider the random effects model where
where the Beta distribution has been reparameterized such that and . Thus, the reparameterized Beta probability density function (PDF) is
To complete the hierarchical model we require priors for , , and . Specifically we assume and priors for and , respectively. Finally, and priors are used for and , respectively. In the absence of prior information, priors can be used for and and diffuse Gamma priors are used for and .
We have chosen a Beta distribution to model the random effects because of its interpretability under this reparameterization. An often used alternative model structure is
where is generally given a normal prior and is often given a half-t or half-Cauchy prior.
For the parameters and , the likelihood of the latent vector , the observed data matrix is , where and
For the random effects model, the first assumption necessary for identifiability  is
The interpretation of (8) is that the overall expected probability of correctly classifying an item is greater than the chance of misclassifying it. This assumption is required due to the bimodal nature of the likelihood  .
The second identifiability assumption assures that there are enough degrees of freedom to estimate all model parameters. This assumption requires two things: that enough inspectors and inspections per inspector are available to estimate the status of each item, and that enough inspectors are available to estimate the inspectors’ random effects parameters. The second condition requires at least two inspectors while letting . A sufficient condition to meet the first requirement is that
In the present model, (9) is sufficient because additional inspections do not harm the model identifiability.
The third identifiability assumption is that both true negatives ( ) and true positives ( ) exist in the sample. This assumption is necessary because the absence of true negatives indicates one cannot estimate false negative rates.  have demonstrated that the absence of either true negatives or true positives essentially implies that there is enough data to estimate only half of the variables, namely , , and or , , and . We remark that the last two identifiability requirements can be omitted if one uses sufficiently informative priors on at least some parameters.
3. Ranking and Selecting Inspectors
Suppose we are interested in determining which inspector has the lowest overall error rate. Here, we have chosen to combine the false positive and false negative rates into a single positive likelihood ratio (LR), . Whichever inspector has the highest likelihood ratio would be determined to be the best. The positive likelihood ratio may not always be the most appropriate combination of the error rates, however, it is simply the one we use here as an example. In some cases, the negative likelihood ratio, or even a weighted sum of and may be more appropriate. This approach can be decided on a case by case basis. We follow the method of  who have derived a decision-theoretic approach to partition parameters into two sets based on an ordering of the parameters of interest. Also,  extended their work to determine the largest Poisson rate when counts are subject to misclassification. Here we apply the method to subset a group of inspectors into a superior set, S, and an inferior set, .
In the creation of a best subset, there are m separate two-state decision problems. Each decision involves whether or not to place an inspector’s likelihood ratio in the superior set, . We assign following constant loss functions:
where and are the loss functions for and , respectively. To make a decision, only is required. These loss functions determine the decision criteria: take action and include as a candidate for the largest parameter if . Here, generally, because failing to place the best in S is the more serious error. Thus, c should be chosen larger than 1.
The probability that is the best of the likelihood ratios is
where is the marginal posterior of the likelihood ratios. MCMC methods are used to approximate (11) numerically. To accomplish this task, we generate a sample , for of size B from the posterior distribution, and then approximate the posterior probability that is the best parameter by
where and , and B is the Monte Carlo repetition size.
The parameter is included in the superior set S if
As an example we consider data from  on a sample of 38 prints produced by inkjet cartridges. Three inspectors analyzed each print 3 times. Only the total number of passes out of the 9 inspections was provided, so for illustrative purposes, for those parts that did not have 0 or 9 passes, we distributed the number of passes across the three inspectors in a way to best match the frequentist parameter estimates provided in  . We assign beta (1, 9) priors to both and since both of these quantities are expected to be considerably below 0.50. Our expert was 95% certain that both misclassification rates were less than 0.40, and a beta (1, 9) prior appropriately modeled the uncertainty. These distributions have prior 95% intervals of (0.003, 0.336) and have an equivalent sample size of 10 observations, and, therefore, would be considered mildly informative. A beta (1, 1) prior is used for , and Gamma (0.1, 0.1) priors are used for both and . A burn-in of 10,000 iterations was used and inferences were based on the 20,000 subsequent iterations. The posterior summaries for each model parameter are provided in Table 1.
Table 1. Posterior summary for  example.
From Figure 1, we see that when combined into the positive likelihood ratio, where a higher number is better, Inspector 1 has the overall highest LR. To apply the decision theoretic procedure to determine if any inspector is “best,’’ we compute the posterior probabilities of each likelihood ratio parameter being the largest. Here, a value of , implies that it is 10 times worse to leave the best inspector out of the superior set than to put an inferior inspector in the superior set, the critical probability would then be 1/(10 + 1) = 0.091. The probabilities that Inspectors 1, 2, and 3 are each in the superior set are 0.891, 0.083, and 0.026, respectively. Thus, here, only Inspector 1 exceeds the 0.091 probability threshold. Thus, inspector 1 would be the only inspector placed in the superior set.
5. A Simulation Study
We conducted a simulation study to determine the effectiveness of the subset selection procedure. We set the number of inspectors to be and the number of repeats to be . For , , , , and we generated a single set of ’s and ’s. The values for , and the corresponding likelihood ratios are presented in Table 2.
The prior distributions used were
Figure 1. Posterior distributions of likelihood ratios.
Table 2. Misclassification parameters for simulation study.
Table 3. Simulation results for .
Table 4. Simulation results for .
Table 5. Simulation results for .
Thus, relatively non-informative priors were employed for all parameters. We considered sample sizes of , 100, and 200 and generated 1000 data sets for each sample size. We monitored the probability that each likelihood ratio is the largest and the 95% credible set of the rank for each . These results are provided in Tables 3-5. For the decision theory problem we used and, thus, also monitored whether the true “best’’ inspector was included in the superior set as well as the average size of the superior set. In this paper we are focusing on the ranking and selection methods, so those are the simulation results we report here. We also monitored posterior means and found they were close to the true values with small bias and coverage of intervals close to nominal for all parameters. The bias and coverage results are available upon request.
For all simulations, Inspector 6, who was the “best’’ inspector, yielded the highest probability of having the largest likelihood ratio, and, therefore, was correctly estimated to be the best inspector the most times. Also, the credible intervals on the ranks for Inspector 6 were consistently closest to the top rank. Conversely, Inspector 2, who was the “worst’’ inspector, produced the lowest probability of having the largest likelihood ratio and, was correctly considered the worst inspector the most times. Inspector 2 also yielded credible intervals for the rank with the largest values, implying this inspector was generally ranked last. Thus both the ranking and selection procedures performed well.
For all three considered sample sizes, the probability of the “best’’ inspector being included in the superior set was greater than 0.9. The average size of the superior set was 2.8 for a sample size of 50, 2.4 for a sample size of 100 and 2.2 for a sample size of 200.
In this paper we have proposed a Bayesian random effects model for a binary measurement system. As shown in our real data example, combining the data with mildly informative priors yields an identifiable model where inferences can be made on the overall classification rates along with comparisons of individual inspectors. Our simulation study shows that for moderate sample sizes, even when information is not available for priors, the procedure works well with the best inspector being included in the superior set a large percentage of the time.
The methods we have proposed could be extended to comparing overall defective rates and classification probabilities of manufacturing plants instead of inspectors, as we have done here. Expanding to continuous measurements from binary is also potentially of interest.