Estimating total usage incidence from samples
I’m trying to use Excel to estimate total unique visitors and total 1, 2 and 3+ time users from three (non-exclusive) samples of varying size and %s. Then, to estimate the same values from all three samples combined – an example follows (the data are attached in a spreadsheet too)
#1:
Total count: 300,000
Sample available: 30,000
Sample Size % of Total: 10.0%
Similarly, groups 2 & 3 lay out the same way:
Group 2:
5,000 (total)
2,000 (sample)
40.0% (sample %)
Group 3:
40,000
10,000
25.0%
Total
345,000
42,000
12.2%
Next, I use individual names to count duplicates within each group to get numbers like the following:
Group 1:
Uniques: 24,000
Single Use Sample: 20,000
Dual Use Sample: 3,000
3+ Use Sample: 1,000
…using the same layout for other groups:
Group 2:
1,150 (uniques)
1,000 (1x)
100 (2x)
50 (3+x)
Group 3:
7,100
6,000
1,000
100
Adding across the groups, provides an incorrect “additive total” as follows:
“Additive Total”:
32,250
27,000
4,100
1,150
If I actually combine the raw, individual name datafrom all three groups, I might get true, deduped total numbers like:
30,900
25,000
4,400
1,500
You’ll notice that the increased incidence of multi-use persons (versus the “additive total”) when combining the three different groups as some of the same users exist across the different groups. This, in turn, reduces the one-time use people and uniques relative to the “additive total” figures.
I point this out because the next step, and ultimate objective, is to estimate the uniques and single, dual and 3+ use for each of the groups individually and in total: Here again, I can’t use straight-line math to get to a total as when the quantities of users increase, the uniques and single-use persons will be reduced versus a linearly-estimated total.
As an example: I can’t take 24,000 uniques in the Group 1 sample and divide by the 10% sample size to get an estimated 240,000 total uniques as many of the sampled uniques would show up again if we actually had the full roster of 300,000 Group 1 people; the real uniques value, from the full dataset might be 180,000 or 210,000.
Similarly, when combining the samples across groups, I can’t apply the 12% factor to the combined total sample uniques of 30,900 to get an estimated total of 253,821 as the multi-use counts would increase, decreasing the uniques and one-time use individuals.
So that was a long, but (hopefully) illustrative, way of getting to the question of “What’s an Excel command or formula that can be applied to the individual sample groups, and to the combined sample totals, to estimate the full data set values?"
It's worth noting that sample sizes might vary from 10-100% in the groups and across different sets of other groups of users that I need to compare to. The bottom line is I need estimates for this set of groups, and several other varied sets of groups so I can compare the absolute, estimated uniques, 1, 2 and 3+ numbers across the sets.
Thanks for your help!
Attachment | Size |
---|---|
Usage Incidence puzzle.xlsx | 11.49 KB |
Recent comments
5 years 42 weeks ago
6 years 28 weeks ago
6 years 40 weeks ago
6 years 42 weeks ago
6 years 43 weeks ago
6 years 49 weeks ago
7 years 5 weeks ago
7 years 5 weeks ago
7 years 5 weeks ago
7 years 5 weeks ago