APM  Vol.9 No.3 , March 2019
Deconvolution of the Error Associated with Random Sampling
ABSTRACT
In this work empirical models describing sampling error (Δ) are reported based upon analytical findings elicited from 3 common probability density functions (PDF): the Gaussian, representing any real-valued, randomly changing variable x of mean μ and standard deviation σ; the Poisson, representing counting data: i.e., any integral-valued entity’s count of x (cells, clumps of cells or colony forming units, molecules, mutations, etc.) per tested volume, area, length of time, etc. with population mean of μ and ; binomial data representing the number of successful occurrences of something (x+) out of n observations or sub-samplings. These data were generated in such a way as to simulate what should be observed in practice but avoid other forms of experimental error. Based upon analyses of 104 Δ measurements, we show that the average Δ () is proportional to  (σx•μ-1; Gaussian) or  (Poisson & binomial). The average proportionality constants associated with these disparate populations were also nearly identical (; ±s). However, since  for any Poisson process, . In a similar vein, we have empirically demonstrated that binomial-associated  were also proportional to σx•μ-1. Furthermore, we established that, when all  were plotted against either  or σx•μ-1, there was only one relationship with a slope = A (0.767 ± 0.0990) and a near-zero intercept. This latter finding also argues that all , regardless of parent PDF, are proportional to σx•μ-1 which is the coefficient of variation for a population of sample means (). Lastly, we establish that the proportionality constant A is equivalent to the coefficient of variation associated with Δ () measurement and, therefore, . These results are noteworthy inasmuch as they provide a straightforward empirical link between stochastic sampling error and the aforementioned Cvs. Finally, we demonstrate that all attendant empirical measures of Δ are reasonably small (e.g., ) when an environmental microbiome was well-sampled: n = 16 - 18 observations with μ∼3 isolates per observation. These colony counting results were supported by the fact that the two major isolates’ relative abundance was reproducible in the four most probable composition observations from one common population.

1. Introduction

There are various analytical procedures for enumerating organisms in environmental samples which diverge in their experimental approach yet are mathematically inter-related. Thus, if V represents the sample volume and V e the volume occupied by a test entity of interest (e.g., colony forming units or CFUs), the probability that one particular V e will not contain this entity at concentration δ [1] is

( V / V e V δ V / V e ) = ( 1 V e δ ) ;

i.e., V / V e ―maximum possible number of entities in V and V δ ~the actual number of objects present.

Assuming that many V e aliquots have been combined to generate V, the probability that no organism will be contained in V is [1]

P = [ 1 V e δ ] V V e

therefore

ln [ P ] = V V e ln [ 1 V e δ ] .

Since

ln [ 1 ψ ] ~ ψ ψ 2 2 ψ 3 3 ψ 4 4

then, if ψ = V e δ ,

ln [ P ] ~ V V e ( V e δ V e 2 δ 2 2 V e 3 δ 3 3 V e 4 δ 4 4 ) ~ V δ ( 1 + V e δ 2 + V e 2 δ 2 3 + V e 3 δ 3 4 + ) .

For V e 0 (e.g., E. coli [2] has a V e ~ 0.6 μ m 3 ~ 6 × 10 13 mL ),

ln [ P ] ~ V δ

P = exp [ V δ ] = exp [ μ ]

therefore

P + = 1 P = 1 exp [ V δ ] = 1 exp [ μ ] . (1)

In certain circumstances it is only possible to determine an organism’s δ by diluting the sample to such an extent that only a fraction of the n “technical” replicates tested are positive ( x + ) for the presence of the entity, or microbe, in question [3] [4]. This technique is referred to as the “dilution method” [1] since it involves diluting a test sample’s content to extinction ( δ 0 ). This enumeration protocol is also known as the most probable number (MPN) method and entails sampling from a liquid source, making serial dilutions from this, distributing an aliquot of each of these dilutions into separate receptacles, incubating these under suitable growth conditions, and observing if any growth has occurred based upon some organism-specific detection method [5] [6]. The MPN enumeration procedure is particularly useful when sampling from environmental sources, such as foods, since damaged cells frequently recover in liquid media [7].

For example, were one to obtain a food sample containing ~14 CFU of a particular organism per 50 g, the cells would typically be washed from the food matrix, concentrated to a few mL (e.g., via centrifugation), and brought up to some appropriate volume (say 40 mL = Vsample) with media [5]. From this, eight 4 mL (V) samples could be randomly selected and distributed into 8 separate receptacles (n = 8 with a dilution factor of 1; i.e., undiluted). Of the remaining 8 mL, 4 could be further diluted with 36 mL (40 mL total) liquid media, mixed and distributed into another set of 8 containers. This set of dilutions has a dilution factor of 0.1 relative to the original. With the remaining 8 mL from the 0.1 dilution, 4 mL could be diluted again with 36 mL media, mixed and distributed into yet another eight 4 mL replicates (dilution factor = 0.01). After incubation the most likely number (Equation (2), below) of positive occurrences (e.g., presence of a specific gene [5] ) observed would be x + = 6, 1, and 0 (out of n = 8 observations per dilution) for dilution factors of 1, 0.1, and 0.01, respectively, and the calculated MPN (±s) per 50 g sample would = 13.8 ± 5.56. Note the relatively large error term. For a 4-fold proportional (200 g, 160 mL Vsample) experiment with n = 32, the calculated MPN is 13.8 ± 2.78 per 50 g sample.

For MPN-based organism detection and subsequent enumeration, the number of positive occurrences of growth in any jth experiment out of n observations = x j + = i = 1 n θ i j (θ = either 1 [presence] or 0 [absence]) can be estimated as

x + ~ n P + = n ( 1 exp [ V δ ] ) (2)

whereupon x + is integral (=ROUND( n P + , 0) in Excel). The probability of observing x + successes out of n Bernoulli trials [8] each of volume V from a population of δ entities per V is

P b = n ! x + ! ( n x + ) ! ( P ) n x + ( P + ) x +

which is also known as the binomial PDF. Since n P + = the population average (real) [9] number of positive responses out of n tests ( μ + ), the above can be also written as

P b = n ! x + ! ( n x + ) ! ( 1 μ + n ) n x + ( μ + n ) x + . (3)

The multiple dilution MPN calculation itself is determined by finding the value of δ at the maximum in the product of the P b s from all l t h dilutions ( l P b , l ) and is easily achieved by adding the scaled sum of all dilutions’ δ P b ÷ P b values to an initial guess for δ (i.e.,

δ m + 1 = δ m + λ m × l { δ P b , l ÷ P b , l } m = δ m + λ m × l { ( x l + n + ( x l + ÷ ( exp [ V δ m 0.1 l ] 1 ) ) ) V 0.1 l } for any particular

th one-to-ten dilution and m iterations; λ is a monotonically changing, with m, scaling function) then solving for the MPN recursively [1] [4] [5] [10] which minimizes the summation.

At the limit n → ∞, Equation (3) simplifies to what is known as the Poisson PDF

P P = μ x exp [ μ ] x ! . (4)

Under these circumstances, x is the observed and μ is the population average number of counts in/on the tested volume, surface, chosen time period, etc. This PDF is applicable to all analytical systems involving, essentially, the counting of objects. However this PDF is applied, the most conspicuous aspect [11] [12] of any Poisson process is that the variance ( σ 2 or second moment)

σ 2 = x = 0 ( x μ ) 2 P P = μ

equals the population mean ( μ or first moment)

μ = x = 0 x P P .

The last probability density function utilized in this stochastic sampling exercise is also related to P b , Equation (3). This is the Gaussian PDF which we use to quantitatively examine the effects of n and σ (fixed μ ) on the variability of sample means ( x ¯ ) which have been created by randomly sampling from a population of real-valued variables (x; e.g., doubling time [13] ) which are normally distributed as

P G = Area σ exp [ 1 2 ( x μ σ ) 2 ] ; (5)

in this relationship the Area term (~ Δ x k = 1 K f k ; for large K) is the approximate area under the fitting function f (frequently taken to be 1 since Δ x is often = 1 and f is always ~1). There are several derivations of PG but none are as persuasive as the fact that this PDF is simple and has been experimentally shown to be the most likely probability distribution associated with most experimental observations [9] [12].

The original purpose of our sampling-related investigations [7] was to estimate a nominal value for n needed to achieve accurate most probable foodborne bacterial isolate enumeration, combined with 16S rDNA-based identification, for quantitative metagenomic purposes. The relationships were developed by examining the results of 6 × 6 colony counting (Poisson PDF) of highly diluted bacteria [14] [15] as a function of n and μ as well as by generating counts (x) derived from P P to simulate what occurred in the lab [15] [16] but which avoided other forms of experimentally based error [5]. We were able to establish that n min = n μ 1 ÷ μ 3 where n μ 1 is the number of observations necessary to accurately enumerate a population average of 1 count per volume tested. Based mainly on colony counting experience we estimate n μ 1 is somewhere in the range n ~ 20 - 30 observations.

Herein we model stochastic sampling errors associated with all the aforementioned PDFs and empirically demonstrate that the resultant mathematical models are, in part, a consequence of the “central limit theorem” [17] (CLT). In general, the CLT states that a distribution of sample means ( x ¯ ), regardless of parent PDF, approaches a normal distribution analytically equivalent to P G , Equation (5), with x = x ¯ , μ = μ x ¯ , and with the σ 2 term = σ x ¯ 2 (= σ 2 ÷ n ) as the number of separate n-samplings increases. We also have elaborated on empirical findings developed previously [5] [15] [16] for predicting errors associated with the random sampling of microorganisms as well as comparing the internal variations associated with the three different sampling error data types derived from the Gaussian, binomial (MPN), and Poisson relationships. Thus, new results have been created using the aforementioned probability distributions, Equations (2), (4), and (5), and have been highly replicated since each “experiment”, comprising n (= 3, 6, 9, 12, or 24) observations, were repeated 100 times.

2. Materials and Methods

2.1. Poisson-Based Data: Equation (4), Figure 1

All counting data were created by multiplying Equation (4) by 360 in order to produce a large number of integral-valued repeats (=ROUND ( 360 P P , 0)) for any particular count x: e.g., for μ = 1 particle per test volume, area, length of time, etc., there would be, most probably, 132 repeats of x = 0, 132 repeats of x = 1, 66 repeats of x = 2, 22 repeats of x = 3, 6 repeats of x = 4 and 1 repeat of x = 5 entities per test. From this pool of 360 counts for each μ, an n number of x values were randomly selected based upon random number tables created with Mathematica.

Table [ i = Random [ Integer, { 1,360 } ] , { i , n } ] (6)

Figure 1. (A) relationship of average Δ j ( Δ ¯ ) for Poisson-based data using Equation (7) (P-I: black symbols and curves) or Equation (8) (P-II: red symbols and curves) as a function of n (= 3, 6, 9, 12, 24) and various values for μ (= 1, 2, 4, 8, 16). Gauss-Newton least squares minimization-based curve-fitting [18] of data was performed [19] to fit to the equation Δ ¯ = Δ n n a ¯ (averages for a are provided ± s; averaged across 5× μ). (B) Non-linear relationship of individual Δ n values from (A) for P-I- and II-based data as a function of μ whereupon curve-fitting of data was also performed using the algebraic form Δ n = A μ a (values for A and a are provided ± ASE). (C) and (D) Present linearized forms ( X = n 2 in (C) and X = μ 2 in (D)) of data reported in Figure 1(A) and Figure 1(B) based upon all values of a = −1/2. Slopes of the lines in Figure 1(C) and Figure 1(D) are equivalent to Δ n and A, respectively.

which generates n random numbers between 1 and 360. Thus, 100 such random number sets were utilized for the twenty-five n (= 3, 6, 9, 12, 24) × μ (= 1, 2, 4, 8, 16) combinations. Briefly, each procedure involved arranging the aforementioned 360 x values (one set for each μ) in one column of a spreadsheet followed by filling in n adjacent columns with formulae which refer to the calculated x values but where each row’s reference number was taken from the Mathematica-generated random number, Equation (6), next in sequence. MPN- and Gaussian-based data arrays were treated in an identical fashion. The formula (P-I: normalized deviations of s j from σ = μ ) for calculating our empirical measure of Poisson stochastic sampling error (Δ) was

Δ j = | μ s j | μ (7)

whereupon the s j term is the experimental standard deviation ( ( n 1 ) 1 i = 1 n ( x i j x ¯ j ) 2 or “=STDEV.S ( x i j -array )” in Excel) for each j th

( j = 1 , 2 , , J ; J = 100) experiment and i th ( i = 1 , 2 , , n ) x. The average across

100× experiments, regardless of formulation, were symbolized as Δ ¯ (= J 1 j = 1 J Δ j or “=AVERAGE ( Δ j -array )”). A second form for the Poisson-based measure of Δ was also calculated (P-II: normalized deviations of x ¯ j from known μ) from these same data

Δ j = | μ x ¯ j | μ . (8)

Here the x ¯ j is the observed arithmetic mean for each j th counting experiment.

2.2. MPN Experiments: Equation (1), Figure 2

All MPN data were created by multiplying Equation (1) by 360 to produce the number (“=ROUND ( 360 P + , 0)”) of positive responses (θ = 1) for any particular level of V δ (=μ); e.g., for μ = 0.1 entity per volume tested there would be 34 repeats of θ = 1 and 326 repeats of θ = 0. From such a column of 360 θ values (one column for each μ), n were randomly selected based upon Mathematica tables, Equation (6), and treated similar to the Poisson data above. Thus, for each combination of n (= 3, 6, 9, 12, or 24) × μ (= 0.1, 0.2, 0.4, 0.8, 1.6), 100

Figure 2. (A) Relationship of average Δ j ( Δ ¯ ) for MPN-based data using Equation (9) as a function of n (= 3, 6, 9, 12, or 24) and variable μ (= 0.1, 0.2, 0.4, 0.8, 1.6). Gauss-Newton least squares minimization-based curve-fitting [18] of data was performed [19] to fit the algebraic form Δ ¯ = Δ n n a ¯ (averages for a are provided ± s; averaged across 5 × μ) to these results. (B) Relationship of individual Δ n values from (A) for MPN-based data as a function of μ where curve-fitting of data was performed also to the algebraic form Δ n = A μ a (values for A and a are provided ± ASE). (C) and (D) Represent linearized forms ( X = n 2 in (C) and X = μ 2 in (D) of data reported in Figure 2(A) and Figure 2(B) based upon the assumption that a = −1/2. Slopes of the lines in Figure 2(C) and Figure 2(D) are equivalent to Δ n and A, respectively.

random n-selections were performed. The formula for calculating our empirical measure of MPN sampling error was

Δ j = | n P + i = 1 n θ i j | n P + = | μ + x j + | μ + ; (9)

where θ = either a “1” (a positive occurrence) or a “0” (a negative occurrence). As before, the average Δ j across J = 100 experiments (each of n observations) = Δ ¯ . The MPN value for x ¯ j + = ln [ n ÷ ( n x j + ) ] and provides the average MPN or CFU per sample; a rearrangement of Equation (2).

2.3. Gaussian-Based Data: Equation (5), Figure 3

All Gaussian PDF data were produced by multiplying Equation (5) ( Δ x = 1 ) by 360 producing an integral number of observations (“=ROUND ( 360 P G , 0)”) for each value of x as a function of μ (fixed at 20) and σ (= 1, 1.5, 2, 3, 4). For instance, for σ = 1 there would be 2 repeats of x = 17, 19 repeats of x = 18, 87 repeats of x = 19, 144 repeats of x = 20, 87 repeats of x = 21, 19 repeats of x = 22, and 2 repeats of x = 23. From this column of 360 values of x, n (= 3, 6, 9, 12, or

Figure 3. (A) Relationship of average Δ j ( Δ ¯ ) for Gaussian-based data using Equation (10) as a function of n with variable σ (=1, 1.5, 2, 3, 4; μ = 20). Gauss-Newton least squares minimization-based curve-fitting [18] of data was performed [19] to fit the equation Δ ¯ = Δ n n a ¯ (averages for a are provide ± s; averaged across 5× σ) to these results. (B) and (D) Relationship of individual Δ n values from (A) and (C) for Gaussian-based data as a function of μ-normalized standard deviations ( X = σ ÷ μ ). Linear regression-based fitting of data was performed to the algebraic form A σ ÷ μ . Figure 3(C): linearized forms ( X = n 2 ) of data reported in Figure 3(A) based on a = −1/2. Slopes of the lines in Figure 3(C) are equivalent to Δ n and plotted in Figure 3(D).

24) were randomly selected based upon Equation (6) and treated identically to the Poisson and MPN data sets. Thus, for each combination of n × σ 100× n-based selections were performed. The formula for calculating our empirical measure of Gaussian sampling error, similar to Equation (7), was

Δ j = | σ s j | μ . (10)

As usual, the average Δ j across J = 100 such sets of experiments each of n observations = Δ ¯ .

2.4. Other Calculations

All curve-fitting was based upon a modified Gauss-Newton algorithm by least squares [18] minimization performed on a Microsoft Excel spreadsheet: [19] some of these results were fit to the algebraic form f [ X ] = constant X a . However, certain MPN data ( x ¯ + and x + ) were also fit to a Gaussian (Equation (5): P G [ x ¯ + ] or P G [ x + ] ) with Δ x used as one of the parameters to be iteratively resolved (i.e., deconvolved). Where appropriate, confidence limits (CL) have been calculated using an approach applicable to any hypothetical fitting function f k = f [ X k ; π p ] : k = 1 , 2 , , K rows of the observed X-Y data sets with up to P (typically ≤ 3) fitting parameters π p ( p = 1 , 2 , , P ). In this procedure we use the propagation of error method [9] [20] for estimating the standard error associated with each f k ( s f k ; illustrated below for P = 2 fitting parameters) data point

C L = t s f k = t s π 1 2 [ π 1 f k ] 2 + s π 2 2 [ π 2 f k ] 2 + 2 s π 1 π 2 2 π 1 f k π 2 f k

where, for any particular fitting parameter ω , s ω = s Y 2 [ Z T Z ] ω ω 1 = “asymptotic standard error” [19] (ASE; s Y 2 = residual sum of squares ÷ [K − P]), and the π p f k terms symbolize f k / π p . The above equation simplifies to

C L = t 0.01 s f k = t 0.01 s Y 2 ( Z k [ Z T Z ] 1 Z k T ) .

In all the above relationships Z is the partial first derivative matrix of f k with respect to the parameters π 1 and π 2 (i.e., a 2-parameter fit) such that

Z = [ π 1 f 1 π 2 f 1 π 1 f 2 π 2 f 2 π 1 f K π 2 f K ] ,

Z T is the transpose of Z, Z k = [ π 1 f k π 2 f k ] (K row vectors), and s Y 2 [ Z T Z ] 1 is the variance-covariance matrix [21]. CL were not used for all results since they might have muddled analytical aspects of the compositions.

2.5. Microbiome Sampling Data

For the food microbiome sampling experiment ~25 g of commercial, pre-thawed (~15 min at room temperature), frozen vegetables were washed with a volume of phosphate buffered saline (PBS; 10 mM Na2HPO4 + 2 mM NaH2PO4 + 137 mM NaCl; pH 7.4 ± 0.2; Boston BioProducts, 159 Chestnut Street, Ashland, MA 01721) equivalent to double the mass of the sample. In order to assist in the detachment of plant tissue-bound cells, 0.075% [w/v] Tween-20 (Sigma-Aldrich, 3050 Spruce St., St. Louis, MO 63103) was added to the PBS and filter sterilized. All washing was performed in sanitized plastic zip-lock bags wherein the formerly frozen vegetables and buffer wash were gently agitated at 80 rpm for approximately 20 min and immediately passed through a 40 μm nylon filter (BD Falcon; Becton Dickinson Biosciences, Bedford, MA) to remove large particles.

Directly sampled washes (5 mL Control = Observation I [cultured at 30˚C] and III [cultured at 37˚C]) as well as hollow fiber microfilter-concentrated (each 5 mL sample was diluted to ~100 mL PBS + Tween, concentrated, then washed with another 100 mL buffer, and eluted with ~5 mLs PBS + Tween = Observation II [cultured at 30˚C] and IV [cultured at 37˚C]) samples were collected and enumerated using the 6 × 6 drop plate method [14] but using 1:2 serial dilutions for colony selection on Brain Heart Infusion agar (BHI + 2% [w/v] agar). Briefly, this drop plate method involved loading 400 μL of each wash (either control or concentrated samples brought back to the control sample’s original volume = 5 mL) filtrate into the first well (row A) of a 96-well microtiter plate. Two-fold serial dilutions were made by transferring 200 μL (multichannel pipette, Rainin, Emeryville, CA) from the first row (row A; dilution 0) into 200 μL of diluent (PBS) in the 2nd row (row B; dilution 1), mixing 10 times while continuously stirring, and repeating the process until five 1:2 dilutions were produced; pipette tips were changed between dilutions. Based on a previous analysis of 6 × 6 drop plate sampling error [15] , we sampled n = 16 - 18 seven μL volumes from each of the 6 dilutions (dilutions 0 - 5; overall dilution factors of 0.50 = 1 to 0.55 = 0.03125) and drop-plated these onto BHI agar media using a multichannel pipette. After plating, the droplets were allowed to dry, inverted and then incubated at two temperatures (either 30˚C or 37˚C; 3 plates for each temperature and treatment combination). Colonies were counted after 16 - 24 hours. Colony collection for our 16S rDNA bacterial identification protocol [7] involved selecting all colonies from dilution 2 (0.52 = 0.25 dilution; x ¯ = 2.79 ± 1.52 colonies per drop; ± s; the fact that x ¯ = 1.67 ~ s might argue for an appropriately sampled population).

Each colony (n total) was carefully removed from the agar plate’s surface using a Rainin L20 tip, dispersed into 200 μL BHI in a 96-well plate and incubated at 30˚C for 16 - 24 hours. These cultures were restreaked onto solid media and incubated at 30˚C overnight. One colony from each of the original n plates was selected, suspended into 25 μL of Ultra PrepMan (Applied Biosystems, Foster City, CA) in a PCR tube and heated in a thermocycler at 99˚C for 15 min. Upon cooling, samples were centrifuged 10 min. to separate the DNA solution from the cell debris. A sample of supernatant was transferred to a new tube for the DNA amplification step (end-point PCR). Once the 16S rRNA “gene” amplification, sequencing reactions (EubA and EubB primers) and Sanger sequencing were performed, DNA sequences were edited, and contigs assembled using Sequencher software as explained in detail previously [7].

3. Results and Discussion

Figure 1 shows results related to averages of 100 × Δ j values ( Δ ¯ ) derived from Equations (7) (P-I, black data) or (8) (P-II, red data) as a function of n (Figure 1(A)) and μ (Figure 1(B)). The least squares curve-fitting results show that the Figure 1(A) data follow the general form Δ ¯ = Δ n n a ¯ whereupon a ¯ (averaged across 5 n-based fits) = −0.556 ± 0.00986 (black data sets; ±s) or a ¯ = 0.529 ± 0.0387 (red data). These findings suggest that Δ ¯ changes as the inverse square root of n for all values of μ. Figure 1(C) displays these same results on a linearized scale (X-axis = n 2 ) whereupon the slopes ( Δ ¯ / [ n 2 ] ) ~ Δ n . Figure 1(B) illustrates that the Δ n values derived from Figure 1(A) non-linear regression change as the inverse square root of μ: i.e., Δ n = A μ a where a = −0.547 ± 0.0179 (black data) or −0.503 ± 0.0374 (red data); a ± ASE. Figure 1(D) shows Figure 1(B) results plotted on an appropriately linearized scale (X-axis = μ 2 ) as indicated by the above analysis whereupon the slope ( Δ n / [ μ 2 ] ) ~ A . Combining results from Figure 1(A) and Figure 1(B) we see that Δ ¯ ~ A n μ 2 . The average value for A was 0.804 ± 0.0460 (P-I & P-II curve-fitting results ±s).

Figure 2 displays MPN-based enumeration data, Equation (9), manipulated in a similar fashion as that of the above Poisson-based results with a nearly identical result. The least squares curve-fitting shows that the data in Figure 2(A) once again follow the general form Δ ¯ = Δ n n a ¯ with a ¯ = −0.554 ± 0.0499 (±s) which is the average a from 5× μ-based data sets. Figure 2(C) shows these same findings graphed on a linearized scale ( X = n 2 ) whereupon the slopes = Δ n . Figure 2(B) also shows that the Δ n values, derived from Figure 2(A) non-linear regression, change as the inverse square root of μ: Δ n = A μ a where a = −0.515 ± 0.0910 (±ASE). As previously observed, when these results are presented on a linearized scale ( X = μ 2 ; Figure 2(D)) the slope is equivalent to the parameter A. Combining fitting results from Figure 2(A) and Figure 2(B) we again note that Δ ¯ ~ A n μ 2 (A = 0.807 ± 0.139; ± ASE).

Completely homologous relationships to the Poisson and MPN findings were also noted with Gaussian-based data (Figure 3) whereupon the least squares curve-fitting in Figure 3(A) shows that these data obey, again, the general form Δ ¯ = Δ n n a ¯ whereupon a ¯ = −0.561 ± 0.0276 (±s; averaged across all σ since μ was fixed). Figure 3(C) has these same findings plotted on a linear scale ( X = μ 2 ) where the slopes = Δ n . Figure 3(B) and Figure 3(D) also show that the Δ n values derived from Figure 3(A) and Figure 3(C) non-linear regression change linearly with σ ÷ μ : i.e., Δ n = A σ ÷ μ (A = 0.725 ± 0.0977; ± ASE). All Gaussian-based data fitting results combined indicate that Δ ¯ = A σ n 2 μ 1 = A σ x ¯ μ 1 = A C V [ x ¯ ] whereupon C V [ x ¯ ] is the coefficient of variation for a population of means associated with x.

3.1. Equivalence of Sampling Errors Associated with Any PDF

The counting results alluded to above (P-I, P-II, & MPN) are similar to those observed previously: [5] [15] [16] i.e., stochastic sampling errors associated with microbiological colony counting and MPN data are proportional to the inverse square root of n × μ. Also, the Poisson population-based results compare favorably with those obtained from actual colony counting experiments [14]. Thus, for all Poisson-based data (Figure 1)

Δ ¯ 1 n μ = 1 n μ μ = σ n 1 μ = σ x ¯ μ = C V [ x ¯ ] (11)

because σ = μ . We have simplified the expression by utilizing the term σ x ¯ [22] (= σ ÷ n ) which can be derived using the propagation of errors method [20]. Such nomenclature exemplifies the utilization of P G , as an approximation for P P , associated with a population of sample means ( x ¯ ) of mean μ x ¯ and standard deviation σ x ¯ . However, for MPN results, does σ ~ μ as an approximation? This question is addressed in detail (Figures 4-6).

Figure 4. (A) & (C) Frequency of observing each set of MPN-based calculated number of entities per sample tested ( x ¯ + = ln [ n ÷ ( n x + ) ] ; μ = 0.8 for (A) & (B); μ = 0.4 for (C) & (D) fit to Equation (5) (i.e., P G [ x ¯ + ] as a function of x ¯ + ). (B) and (D) shows that σ f i t ~ μ f i t ÷ n ~ σ x ¯ (i.e., for MPN, σ = μ ) for all modeled n-samplings. Error bars are = t0.05 × the experimental (overall) s x ¯ = E M S ÷ n .

Figure 5. (A) & (B) Frequency of observing each set of MPN-based number of positive counts x + tested: μ = 0.8 and n = 3, 6, 9, 12, 24; (A) data points [red] = frequency of observed x + , (B) data points [blue] = calculated frequency of x + using Mathematica P b [ x + ] = Table [ N [ ( e μ ) n x + ( 1 e μ ) x + n ! ( n x + ) ! x + ! ] , { x + , 0 , n + 1 } ] from P b , Equation (3), fit to a Gaussian probability distribution: e.g., P G [ x + ] , Equation (5). (C) Demonstrates that σ f i t + μ f i t + . Linear fit showing slope (A) and intercept (I) ± ASE. The non-linear fits were σ f i t + = A ( μ f i t + ) 0.550 ± 0.0463 . Best fit curves shown ± P = 0.05 CL.

Figure 6. Demonstration that d Δ ¯ / d ( σ x ¯ / μ ) ~ A ~ d Δ ¯ / d ( n μ 2 ) . All data are plotted ± P = 0.001 CL. (A) is related to P-I data (A = 0.741 ± 0.0203; ± ASE). (B) is related to P-II data (A = 0.827 ± 0.0133). (C) is related to MPN data (A = 0.861 ± 0.0273). (D) is related to Gaussian data (A = 0.637 ± 0.0280). All data are merged in (E): slope of this relationship which involves all three PDFs is 0.767 ± 0.0990.

In Figure 4(A) and Figure 4(C), we have examined some of our MPN data (μ = 0.8 per sample in Figure 4(A) and μ = 0.4 per sample in Figure 4(C) at the various levels of n-sampling) by converting the total number of positive occurrences ( x j + ) in n observations to the most probable number of entities in the hypothetical sampled aliquot ( x ¯ j + = ln [ n / ( n x j + ) ] ) and curve-fit the frequency of occurrence of each x ¯ j + to Gaussian PDFs (Equation (5); P G [ x ¯ + ] ). From these curve fits we extracted the parameters σ f i t and μ f i t . In Figure 4(B) and Figure 4(D) we show that the average σ f i t n ~ μ f i t (i.e., σ f i t = μ f i t ÷ n = σ x ¯ ) and, therefore, σ = μ . This finding indicates that Equation (11) can be applied to both Poisson and MPN results as a reasonable approximation. We have confirmed the MPN results in Figure 2 and Figure 4 by showing that the frequency distribution of x + which we have observed in these experiments closely follows Equation (3) (compare Figure 5(A) with Figure 5(B)) whereupon we establish that σ f i t + , the standard deviation associated with the distribution of x + via the Gaussian approximation, was proportional to μ f i t + (Figure 5(C)) for both observed (red data) and calculated (blue data) x + with a proportionality constant numerically similar to A (=0.735 ± 0.0543; ±ASE) alluded to above.

The equality in Equation (11) is also visually confirmed by the results shown in Figure 6 where one can see that all values of Δ ¯ closely follow the linear expression Δ ¯ = A X (for X = σ x ¯ ÷ μ or n μ 2 ; A = 0.781 ± 0.0107; ±ASE) showing that

Δ ¯ [ n μ 2 ] = Δ ¯ [ σ x ¯ ÷ μ ] .

Since the combined data in Figure 6 are linear with a near-zero intercept (−0.0168 ± 0.00443), then

Δ ¯ n μ 2 = Δ ¯ σ x ¯ ÷ μ

therefore cross-multiplying gives

Δ ¯ σ x ¯ ÷ μ = Δ ¯ n μ

and dividing both sides by Δ ¯ produces the equality

σ x ¯ ÷ μ = n μ 2 .

All sampling error-related findings are summarized in Figure 7.

3.2. Demonstration That A = s Δ j / Δ ¯ = C V [ Δ j ]

Lastly, all these assertions are substantiated by the observation (Figure 8) that the standard deviations associated with all our sampling error measurements ( s Δ j ) change linearly as a function of the 4 (P-I, P-II, MPN, Gaussian) sets of Δ ¯ data with an average slope (i.e., average of the 4 s Δ j / Δ ¯ values = 0.716 ± 0.0739) equivalent to the various values for A in Figures 1-3, Figure 5 and Figure 6. In fact, the slope in Figure 8 defines the coefficient of variation in Δ ¯ ( C V [ Δ j ] ) and, if equal to A, then

s Δ j Δ ¯ = Δ ¯ X (12)

where X = either n μ 2 or σ x ¯ ÷ μ . Since s Δ j in Figure 8 and Δ ¯ in Figure 6 are linear functions with a near zero intercept then, assuming Equation (12) is true,

s Δ j Δ ¯ = Δ ¯ X .

Substituting Δ ¯ with A X

Figure 7. Summary of curve-fitting results associated with each PDF and method for calculating empirical stochastic sampling error ( Δ ¯ ). Each constant of proportionality A is presented ± ASE. For binomial data (MPN) μ = V δ (the population average number of entities in V) and n P + = μ + (the population average number of positive responses out of n observations).

s Δ j A X = A X X

s Δ j X = A 2

s Δ j = A 2 X = A ( A X ) = A Δ ¯

and therefore

s Δ j Δ ¯ = C V [ Δ j ] = A

The above equality establishes that the coefficient of variation associated with Δ ¯ ( C V [ Δ j ] ) is equivalent to the proportionality constant A seen in Figures 1-3 and Figure 6. Thus sampling errors can be estimated from the relationship Δ ¯ = C V [ Δ j ] × C V [ x ¯ ] whereupon C V [ Δ j ] ~ 0.75 for all PDFs we have tested.

3.3. Minimized Errors Associated with a Well-Sampled Food Microbiome via Most Probable Composition [7]

Based upon these results, the estimation of C V [ x ¯ ] (i.e., s x ¯ ÷ x ¯ ) should be germane in determining if data have been appropriately sampled. Figure 9 illustrates that all stochastic errors associated with native aerobic bacteria surviving

Figure 8. (A)-(D): Dependency of the standard deviation (plotted ± P = 0.05 confidence limits) derived from each experimental Δ j array ( Y = s Δ j ; j = 1 , 2 , , 25 ) on their averages ( X = Δ ¯ ): Figure 8(A) = P-I data (Spearman’s coefficient of rank correlation: [22] ρ S = 0.996 ; P 10 3 ); Figure 8(B) = P-II data ( ρ S = 0.988 ; P 10 3 ); Figure 8(C) = MPN data ( ρ S = 0.979 ; P 10 3 ); Figure 8(D) = Gaussian data ( ρ S = 0.994 ; P 10 3 ). The average slopes associated with these 4 relationship = 0.716 ± 0.0739 (± s). All points (25× Δ ¯ per set) from (A) through (D) are combined in the bottom-most figure ( d s Δ j / d Δ ¯ = 0.661 ± 0.0186 ; ± ASE). The value d s Δ j / d Δ ¯ is equivalent to an experimental coefficient of variation for Δ ¯ = C V [ Δ j ] .

on commercially available, frozen vegetables were sufficiently sampled using an n = 16 - 18 inasmuch as the C V [ x ¯ ] -values associated with the normalized colony counts (CFU g−1 averaged across all l dilutions = x ¯ l ÷ 0.007 mL per drop ÷ 0.5 l dilution factor × 57.2 mL total original sample volume ÷ 28.6 g total frozen vegetable mass) were appropriately small (ranging between ca. 2% to 4%). In a

Figure 9. Estimation of the stochastic sampling errors ( x ¯ l ÷ x ¯ l 1 ~ calculated dilution factors; s ~ x ¯ l ; C V [ x ¯ ] for all counts~4% across all dilutions ℓ) associated with a well-sampled [15] (n = 16 - 18) Poisson population (native bacteria on frozen vegetables: 28.6 grams rinsed with 57.2 mL PBS + Tween 20). All the colonies in ℓ = 2 (Control & grown at 30˚C = 55 colonies; Hollow Fiber Concentrated & grown at 30˚C = 49 colonies; Control & grown at 37˚C = 41 colonies; Hollow Fiber Concentrated & grown at 37˚C = 41 colonies) were collected and identified using 16S rDNA Sanger sequencing (EubA and EubB primers) as described previously [7]. Bacterial compositions were nearly identical for all samplings and treatment combinations.

similar vein, it is pertinent that the observed (s) and calculated ( x ¯ l ) standard deviations associated with the counts per drop were equivalent since the average deviation ( | s x ¯ l | ) from ideality varied only 15.7% ± 3.54% ( ± s x ¯ ). Lastly it is also significant that the dilution factors calculated from the ratios of average plate counts ( x ¯ l ÷ x ¯ l 1 ) were very close to ½ (average 0.523 ± 0.0172) which also argues for a minimized Δ .

Across the 4 observational sets (I, II, III, and IV) depicted in Figure 9, the total number of collected colonies (from l = 2 ) was 55 (n = 16), 49 (n = 16), 42 (n = 17), and 41 (n = 18), respectively. Bacteria identifications for each of these colonies were based upon rDNA sequence matching 1200 - 1400 basepair contigs searching against NCBI’s GenBank database. The rRNA “gene” sequencing results for the 2 major isolates (making up 88.3% ± 3.28% of the total sampled colonies) show that the 4 sets of observed bacterial compositions were nearly identical (43.6% ± 8.05% Luconostoc and 44.6% ± 13.3% Lactococcus; ±s) [23]. The remainder of the colonies was mainly Acinetobacter (3.74% ± 3.34%) and Streptococcus (4.17% ± 2.75%) with small amounts of diverse isolates (e.g., Staphylococcus, Arthrobacter, Sphingobacterium, Enterococcus, Kocuria, Raoultella, and Bacillus: averaging 1.49% ± 1.09% each). Such variability is expected for the relatively rare isolates (≤4%) due to errors associated with random sampling. The two major species sampled were relatively repeatable because of their abundance, adequate sampling, and very little treatment effect. The minor constituents would have to have been sampled 2.77 ± 0.647-fold more (n > 44) for an equivalent accuracy to the Luconostoc and Lactococcus fractions since the requisite number of samplings for the low count fractions, above, is proportional to the inverse cube root [5] [16] of the number of counts per sampled volume (~ x ¯ m a j o r 3 ÷ x ¯ m i n o r 3 ).

4. Summary

We have performed analyses associated with empirical stochastic sampling errors linked to data generated from 3 common probability density functions. We have used these to describe the limiting behavior of Δ by generating models which suggest a generalized, and facile, mathematical solution. Based upon all our experiments, the common algebraic solution, regardless of parent distribution, is that experimental sampling errors are proportional to σ x ¯ ÷ μ . This generalized relationship is intuitively reasonable inasmuch as this is the C V for any population of sample means ( C V [ x ¯ ] ) and describes how closely x ¯ values approach μ as n increases. The proportionality constant for all these findings was found to be mathematically related to C V [ Δ j ] or s Δ j / Δ ¯ , which is the coefficient of variation associated with the error measurement itself. Lastly, using estimates of these sampling-associated errors ( C V [ x ¯ ] ~ s x ¯ ÷ x ¯ ), we show that when a test microbiome was sufficiently sampled, several measures of stochastic sampling error were reasonably small for both counting and DNA sequence-based results.

Definitions

Indices = i ( = 1 , 2 , , n ) observations per experiment; j ( = 1 , 2 , , J = 100 ) experiments with n observations each; k ( = 1 , 2 , , K ) rows of X-Y values; l ( = 1 , 2 , , L ) dilutions; m ( = 1 , 2 , , M ) iterations; p ( = 1 , 2 , , P ) parameters

Δ j = j th experimental measure of sampling error out of J = 100 experiments: Equations (7)-(10).

Δ ¯ = average sampling error in J = 100 observations of Δ j

A = proportionality constant associated with Δ ¯ curve-fitting to n, μ (or σ)

s Δ j = standard deviation associated with Δ j measurement; for this work there are 25 ( n × μ or n × σ for the Gaussian populations) such s Δ j for each PDF type (2 types of Poisson, MPN or binomial, Gaussian)

μ = for either Poisson PDF or MPN assays ( μ = V δ ), the population average number of biological entities, or other analytes, per test; for Gaussian PDF, the population’s average of any real-valued, randomly changing variable

V = the sample volume to be tested

V e = volume of the biological entity, or other analyte, being tested

δ = concentration of the biological entity (count ÷ V) or other analyte

μ + = population average number of positive growth responses (MPN) out of n observations; μ + = n P +

σ + = the standard deviation associated with the probability density of x + ; the Gaussian approximations for σ + are plotted in Figure 5(C) as a function of Gaussian best fits for μ +

P = probability that V e will NOT contain the biological entity, or other analyte, being tested

P + = probability that V e will contain the biological entity, or other analyte, being tested; P + = 1 P ; Equation (1)

X f [ X ] = f [ X ] / X

x i j = for Poisson populations, the i th observation’s number of counts per tested volume, surface area, etc. for each j th experiment; for Gaussian populations, any real-valued, randomly changing variable

x ¯ j = 1 n i = 1 n x i j

x j + = j th experiment’s number of positive growth responses out of n observations; x j + = i = 1 n θ i j where θ = 1 (positive) or 0 (negative)

x ¯ j + = j th experiment’s number of positive counts in V volume; x ¯ j + = ln [ n ÷ ( n x j + ) ] ; the x-bar symbol is used here because this relations contains a parameter, x j + , which is the result of a summation across all θ i j ; it just isn’t normalized to n

n = number of technical replicates in each j th experiment; for MPN, number of observations each of volume V; for Poisson populations we have found [15] that the minimal number of replicates per assay was n c a l c = n μ 1 μ 3 where n μ 1 is the number of replicates necessary to enumerate a population with μ = 1

σ = population standard deviation associated with μ

σ x ¯ = standard deviation of a population of sample means ( x ¯ ); the formula for the σ x ¯ statistic can be derived from the propagation of errors method [20] without covariance

σ x ¯ = ( x ¯ x 1 ) 2 σ x 1 2 + ( x ¯ x 2 ) 2 σ x 2 2 + + ( x ¯ x n ) 2 σ x n 2 = n σ 2 n 2 = σ n

since

x ¯ x 1 = x ¯ x 2 = = x ¯ x n = 1 n

and

σ x 1 2 = σ x 2 2 = = σ x n 2 = σ 2 .

s j = any jth experiment’s estimation of population standard deviation

s x ¯ = estimation of σ x ¯ from a limited number of x ¯ j ; s x ¯ = s j ÷ n

C V [ x ¯ ] = coefficient of variation for a population of means; C V [ x ¯ ] = σ x ¯ ÷ μ x ¯ = σ x ¯ ÷ μ estimated as s x ¯ ÷ x ¯

C V [ x ] = coefficient of variation for any set of observations x; C V [ x ] = σ μ estimated as s x ¯

C V [ Δ j ] = s Δ j / Δ ¯ ~ s Δ j ÷ Δ ¯ if the s Δ j vs. Δ ¯ intercept ~ 0

CLT = central limit theorem: the mean ( μ x ¯ ) of a population of observed means ( x ¯ ) will be approximately equal to the mean of the sampled population (μ) and the standard deviation of this population of means will be approximately equal to σ x ¯ ; Equation (5) with x = x ¯ , μ = μ x ¯ = μ , and σ = σ x ¯

PDF = probability density function or probability distribution function

P b = binomial PDF: Equation (3)

P P = Poisson PDF: Equation (4)

P G = Gaussian PDF: Equation (5)

CL = confidence limit = t-statistic × s f k = t s f k

ASE = asymptotic standard error [19] ; for any fitting parameter ω ,

A S E = s ω = s Y 2 [ Z T Z ] ω ω 1 ; s Y 2 = residual sum of squares ÷ (K − M) where M =

the number of fitting parameters π p ( p = 1 , 2 , , P )

s f k = kth row standard error of fitting function fk; s f k = s Y 2 ( Z k [ Z T Z ] 1 Z k T )

Z = partial first derivative matrix of f k with respect to associated fitting parameters π 1 , π 2 , , π P

Z T = transposition of Z

Z k = [ π 1 f k π 2 f k ] for f k = f [ X k ; π p ]

Cite this paper
Irwin, P. , He, Y. and Chen, C. (2019) Deconvolution of the Error Associated with Random Sampling. Advances in Pure Mathematics, 9, 205-227. doi: 10.4236/apm.2019.93010.
References
[1]   Halvorson, H.O. and Ziegler, N.R. (1933) Application of Statistics to Problems in Bacteriology. I. A Means of Determining Bacterial Population by the Dilution Method. Journal of Bacteriology, 25, 101-121.

[2]   Kubitschek, H.E. (1990) Cell Volume Increase in Escherichia coli after Shifts to Richer Media. Journal of Bacteriology, 172, 94-101.
https://doi.org/10.1128/jb.172.1.94-101.1990

[3]   Barkworth, H. and Irwin, J.O. (1938) Distribution of Coliform Organisms in Milk and the Accuracy of the Presumptive Coliform Test. Journal of Hygiene, 38, 446-457.
https://doi.org/10.1017/S0022172400011311

[4]   Best, D.J. (1990) Optimal Determination of Most Probable Numbers. International Journal of Food Microbiology, 11, 159-166.
https://doi.org/10.1016/0168-1605(90)90051-6

[5]   Irwin, P., Reed, S., Nguyen, L., Brewster, J. and He, Y. (2013) Non-Stochastic Sampling Error in Quantal Analyses for Campylobacter Species on Poultry Products. Analytical and Bioanalytical Chemistry, 405, 2353-2369.
https://doi.org/10.1007/s00216-012-6659-2

[6]   Irwin, P., Gehring, A., Tu, S.-I., Brewster, J., Fanelli, J. and Ehrenfeld, E. (2000) Minimum Detectable Level of Salmonellae Using a Binomial-Based Ice Nucleation Detection Assay. Journal of AOAC International, 83, 1087-1095.

[7]   Irwin, P.L., Nguyen, L.-H.T., Chen, C.-Y. and Paoli, G. (2008) Binding of Nontarget Microorganisms from Food Washes to Anti-Salmonella and anti-E. coli O157 Immunomagnetic Beads: Most Probable Composition of Background Eubacteria. Analytical and Bioanalytical Chemistry, 391, 525-536.
https://doi.org/10.1007/s00216-008-1959-2

[8]   de St. Groth, S.F. (1982) The Evaluation of Limiting Dilution Assays. Journal of Immunological Methods, 49, R11-R23.
https://doi.org/10.1016/0022-1759(82)90269-1

[9]   Bevington, P.R. and Robinson, D.K. (1992) Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill, Boston, 17-23 and 41-43.

[10]   Irwin, P., Fortis, L. and Tu, S.-I. (2001) A Simple Maximum Probability Resolution Algorithm for Most Probable Number Analysis Using Microsoft Excel. Journal of Rapid Methods and Automation in Microbiology, 9, 33-51.
https://doi.org/10.1111/j.1745-4581.2001.tb00226.x

[11]   Gosset, W.S. (1907) “Student” on the Error of Counting with a Haemocytometer. Biometrika, 5, 351-360.
https://doi.org/10.1093/biomet/5.3.351

[12]   Fisher, R.A. (1922) On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society, London, Series A, 222, 309-368.
https://doi.org/10.1098/rsta.1922.0009

[13]   Irwin, P.L., Nguyen, L.-H.T., Paoli, G.C. and Chen, C.-Y. (2010) Evidence for a Bimodal Distribution of Escherichia coli Doubling Times below a Threshold Initial Cell Concentration. BMC Microbiology, 10, 207.

[14]   Chen, C.-Y., Nace, G.W. and Irwin, P.L. (2003) A 6×6 Drop Plate Method for Simultaneous Colony Counting and MPN Enumeration of Campylobacter jejuni, Listeria monocytogenes, and Escherichia coli. Journal of Microbiological Methods, 55, 475-479.
https://doi.org/10.1016/S0167-7012(03)00194-5

[15]   Irwin, P.L., Nguyen, L.-H.T. and Chen, C.-Y. (2008) Binding of Nontarget Microorganisms from Food Washes to Anti-Salmonella and Anti-E. coli O157 Immunomagnetic Beads: Minimizing the Errors of Random Sampling in Extreme Dilute Systems. Analytical and Bioanalytical Chemistry, 391, 515-524.
https://doi.org/10.1007/s00216-008-1961-8

[16]   Irwin, P.L., Nguyen, L.-H.T. and Chen, C.-Y. (2010) The Relationship between Purely Stochastic Sampling Error and the Number of Technical Replicates Used to Estimate Concentration at an Extreme Dilution. Analytical and Bioanalytical Chemistry, 398, 895-903.
https://doi.org/10.1007/s00216-010-3967-2

[17]   Trotter, H.F. (1959) An Elementary Proof of the Central Limit Theorem. Archiv der Mathematik, 10, 226-234.
https://doi.org/10.1007/BF01240790

[18]   Hartley, H.O. (1961) The Modified Gauss-Newton Method for Fitting of Non-Linear Regression Functions by Least Squares. Technometrics, 3, 269-280.
https://doi.org/10.1080/00401706.1961.10489945

[19]   Irwin, P.L., Damert, W.C. and Doner, L.W. (1994) Curve Fitting in Nuclear Magnetic Resonance Spectroscopy: Illustrative Examples Using a Spreadsheet and Microcomputer. Concepts in Magnetic Resonance, 6, 57-67.
https://doi.org/10.1002/cmr.1820060105

[20]   Beers, Y. (1957) Introduction to the Theory of Error. Addison-Wesley Publishing Company, Inc., Reading, 29-30.

[21]   Salter, C. (2000) Error Analysis Using the Variance-Covariance Matrix. Journal of Chemical Education, 77, 1239-1243.
https://doi.org/10.1021/ed077p1239

[22]   Steel, R.G.D. and Torrie, J.H.D. (1960) Principles and Procedures of Statistics. McGraw-Hill, New York, 409.

[23]   Irwin, P., Capobianco, J., Nguyen, L., He, Y., Gehring, M., Gehring, A. and Chen, C.-Y. (2019) Bacterial Cell Recovery after Hollow Fiber Microfiltration Sample Concentration and Washing: Most Probable Bacterial Composition in Frozen Vegetables.

 
 
Top