Working with Aggregate Data: An Excel Macro for Pairwise Comparison Using Z Test for Two Proportions

Show more

1. Introduction

Aggregate data refers to numerical or non-numerical information that is: 1) collected from multiple sources and/or on multiple measures, variables, or individuals; and 2) compiled into data summaries or summary reports, typically for the purposes of public reporting or statistical analysis―i.e., examining trends, making comparisons, or revealing information and insights that would not be observable when data elements are viewed in isolation [1] . Because the unit of analysis in aggregated data is no longer at the individual entity level, researchers must exercise care in trying to conduct correlational or inferential statistics to avoid spurious results.

Aggregate data might still yield important information by moving to the next higher unit of analysis that provides a grouping unifier. Various versions of Chi Square, time series and proportional analyses may still be performed on aggregate datasets where a proper unifier exists.

Proportional aggregate analysis is the focus of this paper. A method and an Excel VBA macro is demonstrated that compares and contrasts a spreadsheet (above row to row) for unique and non-repeating pairwise row comparisons. This procedure incorporates the familiar Z-Test for Two Proportions to test paired data for statistical significance at α ≤ 0.05 [2] ._{ }

2. Fundamental Principles

The VBA macro uses an “up one row”, “down one row” iteration that populates the variables for p_{i} and p_{j}. The built-in Z-test for proportions has a two-tailed null hypothesis of no statistically significant difference between two proportions, H_{0}: P_{i} = P_{j}. The alternate hypothesis is Ha: P_{i} ≠ P_{j}. There are three assumptions inherent in this procedure: 1) sampling independence; 2) sufficient size (≥5; the macro will reject if violated); and 3) randomness of selection. A pooled proportion is used to compute the standard error of the sampling distribution, using the individual proportions, p_{i} and p_{j} and the associated population for each, n_{i} and n_{j}. The test statistic is a Z-score which is the ratio of the absolute proportion difference divided by the standard error. Significance is determined as Z ≥ 1.96, the two-tailed critical value for a normal distribution.

3. An Illustration

For illustration purposes, a mock research question was created that asked if there were any statistically significant between-county differences in the proportion of registered voters for the Green Party within the state of Arizonain January 2017 [3] . After minor cleansing, the data were inserted into a blank Excel macro-enabled (pairwise.xlsm) spreadsheet which incorporates the pairwise macro described in this report. The order of insertion must be followed exactly (Group Name, Sample Size and Total) starting in cell “A1” which is required by the macro (Figure 1).

The goal for this mock research question was to determine if there were any statistically significant proportional differences of Green Party registered voters between compared counties. For example, is the proportion of Green Party registered voters in Apache County significantly different from the proportions of Green Party registered voters in other counties? How many matched pairings of county-county data would be significantly different? This information could be pursued to investigate trends and patterns.

Because of the requirement of the Z-test for proportional differences, the minimum number of registered voters per county was 5. Only one county, Greenlee,

Figure 1. The correct order of data insertion starting at Cell “A1”. Note: As of January 2017 per https://www.azsos.gov/elections/voter-registration-historical-election-data/voter-registration-counts

failed to meet the minimum sample size and all of its combinations were eliminated.

4. Results

Output begins in cell “F1” and continues for k(k − 1)/2 rows. For the fifteen rows illustrated, an output of 105 rows is generated (Table 1). The output grows exponentially and while the macro can accommodate very large datasets, there is a practical output limitation. For example, 50 rows of input would create 1225 matched pairs of unique data. The size of the input range is the researcher’s choice.

This exercise was primarily for illustration but it did use real data which produced real results. Of the 105 county-county combinations, 59 (56%) showed statistically significant differences. Questions need to be asked of the data so that the differences in Green Party registered voters could perhaps be explained. For those in the social or political sciences, these differences might be important to pursue.

5. The Macro Methodology

The VBA macro uses an “up one row”, “down one row” iteration that populates the upper row/lower row variables with their respective proportions, P_{1} and P_{2}. With these values, the null hypothesis (P_{1} = P_{2}) can be tested using the following standard proportion equations.

1) The pooled proportion:

$p=\frac{{p}_{i}\ast {n}_{i}+{p}_{j}\ast {n}_{j}}{{n}_{i}+{n}_{j}}$

Table 1. Results of pairwise comparison of between-county Z-test of proportions for green party registered voters.

Note: All comparisons using Greenlee county were rejected because of small sample size (less than or equal to 5).

where:

p = the pooled sample proportion,

p_{i} = first proportion,

p_{j} = second proportion,

n_{i} = population size associated with the first proportion,

n_{j} = population size associated with the second proportion.

2) The standard error of the weighted samples:

$s{e}_{pi-pj}=\sqrt{p\left(1-p\right)\ast \left[\left(\frac{1}{{n}_{i}}\right)+\left(\frac{1}{{n}_{j}}\right)\right]}$

where:

se_{pi-p}_{j} = the standard error,

p = the weighted estimate of two populations,

n_{i} = sample size associated with the first proportion,

n_{j} = sample size associated with the second proportion.

3) The determination of the Z-score:

$Z=\frac{\left|{p}_{i}-{p}_{j}\right|}{se}$

where:

Z = the Z-score,

p_{i} = first proportion,

p_{j} = second proportion,

se_{pi-p}_{j} = the standard error.

The null hypothesis is rejected if the Z-score exceeds 1.96, the two-tailed critical value that is associated with a p-value ≤ 0.05.

6. Conclusions

An Excel macro procedure has been demonstrated as a screening tool to reveal patterns within aggregate data. It creates unique within-column pairwise comparisons and tests the data for proportional statistical significance. This method could be applied where aggregated data is available that includes, as a minimum, the named group, a proportion or count of a desired variable and a total for each row. The exponential growth of the output as the number of rows (k) increases will be a practical limiting factor.

The Excel macro can be saved as an Excel macro file (*.xlsm) and various internet references can be accessed for instructions for using an Excel macro files as an add-in.

This macro is also available for download at http://www.viclandry.com/pairwise-comparison.html

The VBA Macro

Sub Pairwise()

Dim i As Integer

Dim j As Integer

Dim k As Integer

Dim lastrow As Long

Dim answer As Variant

Dim n1 As Variant

Dim n2 As Variant

Dim p As Variant

Dim p1 As Variant

Dim p2 As Variant

Dim z As Variant

Dim se As Variant

Dim r As Variant

MsgBox ("You must have HEADERS with category names in Column A; place data in Column B; place interval COUNTS in Column C")

lastrow = (Cells(Rows.Count, "A").End(xlUp).Row)-1

Range("f1").Value = "Compared Groups"

Range("f1").Offset(0, 1) = "Group 1"

Range("f1").Offset(0, 2).Value = "N1"

Range("f1").Offset(0, 3).Value = "P1"

Range("f1").Offset(0, 4).Value = "Group 2"

Range("f1").Offset(0, 5).Value = "N2"

Range("f1").Offset(0, 6).Value = "P2"

Range("f1").Offset(0, 7).Value = "Z-Score"

Range("f1").Offset(0, 8).Value = "Result"

For i = 1 To lastrow

For j = i + 1 To lastrow

k = k + 1

Range("f1").Offset(k, 0).Value = (Range("a1").Offset(i, 0).Value & " - " & Range("a1").Offset(j, 0).Value) 'first row header

p1 = Range("a1").Offset(i, 1).Value/Range("a1").Offset(i, 2).Value 'value for first proportion

p2 = Range("a1").Offset(j, 1).Value/Range("a1").Offset(j, 2).Value 'value for second proportion

r = (Abs(p1 - p2)) 'find absolute difference

n1 = Range("a1").Offset(i, 2).Value

n2 = Range("a1").Offset(j, 2).Value

p = ((p1 * n1) + (p2 * n2))/(n1 + n2)

se = Sqr((p * (1 - p)) * ((1/n1) + (1/n2)))

z = r/se

Range("f1").Offset(k, 1).Value = Range("a1").Offset(i, 1).Value 'first count

Range("f1").Offset(k, 2).Value = n1 'first total

Range("f1").Offset(k, 3).Value = Round(p1, 4) 'first proportion

Range("f1").Offset(k, 4).Value = Range("a1").Offset(j, 1).Value 'second count

Range("f1").Offset(k, 5).Value = n2 'second total

Range("f1").Offset(k, 6).Value = Round(p2, 4) 'second proportion

Range("f1").Offset(k, 7).Value = Round(z, 4) 'z score

If z > 1.96 Then

Range("f1").Offset(k, 8).Value = "Sig."

Else

Range("f1").Offset(k, 8).Value = "NS"

End If

If n1 * p1 < 6 Or n2 * p2 < 6 Then

Range("f1").Offset(k, 8).Value = "N<=5"

End If

Next j

Next i

Range("f1:m1").EntireColumn.AutoFit

End Sub

References

[1] Concepts, L. (2015) Aggregate Data Definition.

http://edglossary.org/aggregate-data/

[2] StatisticsLectures.com (2017) Z-Test for Proportions, Two Samples.

http://www.statisticslectures.com/topics/ztestproportions/

[3] Arizona Secretary of State (2017) Voter Registration Counts.

https://www.azsos.gov/elections/voter-registration-

historical-election-data/voter-registration-counts