Market size and Population can be confusing when we select "complete cases"

With some frequency we hear questions about the "Market Size" or "Population Estimates" in Crunch.

Explaining this is best done with an example, so let's use the one below.

Screen Shot 2019-06-06 at 10.36.45 AM.png

Our mini dataset has 7 respondents, labeled by the "id" column. These 7 people answered 3 questions: "Q1", "Q2", and "Q3". They could only answer the questions with: "TRUE", "FALSE", or they could skip the question. If someone skipped a question we need to record something for them, so we record "NA".

Let's make a copy of our mini dataset and think through how we get a percentage.


How many people said "TRUE" to any of the 3 questions? Let's color their text Blue. The answer is 4.

How many people answered any of the 3 questions? Let's shade those rows Yellow. The answer is 6.

So what percentage answered "TRUE"? (4/6) = 67%

Now let's make a second copy of our mini dataset, except.... What if we restrict our analysis down to just the people who completed answers to all 3 questions (aka "complete cases")?


Of the "complete cases", how many people said "TRUE" to any of the 3 questions? Let's color their text Red. The answer is 3.

Of the "complete cases", how many people answered all the 3 questions? Let's shade those rows Green. The answer is 4.

So what percentage answered "TRUE"? (3/4) = 75%

And here is where the results can be surprising! When we changed which rows count by restricting our analysis down (in this case to "complete cases"), we got fewer people (3 is less that 4) and that's intuitive.

But, we also restricted the number of people overall even more, in this case down from 6, to 4. And because this number is the denominator in our percentage, that is why the results can be go up surprisingly. In our example, we restricted the number of rows down, but percentage went up because 3/4 is 75%.

Now coming back to "Population Estimates". Crunch calculates this by multiplying the percentages we calculated above by the total Population dataset size (e.g. the population size of the US or UK). For our examples let's pretend the US population is 250M. In the first table we calculated a percentage of 67%. 67% multiplied by 250M is 167.5M people. In the second table (where we restricted the analysis down) we calculated a percentage of 75%, and so 75% multiplied by 250M is 187.5M people. More people!