Sampling is the basis for all survey research, and correspondingly, for a lot of research in the social and biomedical sciences. However, how to sample is not necessarily pondered at the level necessitated by the research question. In the below post, I illustrate this problem.
A suggested sampling frame.Let’s say we want to investigate some quantity X in the population – we’re interested in the entire population. We want a representative sample for a number of reasons (hint: CLT), and therefore we adopt the sampling frame suggested below:
- Every household has one and only one landline telephone.
- All landline telephone numbers are listed in the phone registry.
- Pick a number at random from the phone registry.
- Dial that number, and if the phone is picked up, ask to talk to the person in the household, whose birthday is up next.
- Repeat until desired sample size is achieved.
Sounds pretty random, doesn’t it – at least under the assumptions mentioned? Take a moment to think about it.
The point is, of course, that this frame isn’t random. You’re sampling every person with a probability inversely proportional to the number of persons in their household. But bad can that be?
Let’s say that we’re hired by a municipality interested in knowing to what degree people use the public pools. In a municipality of 120,000 people, the household distribution is that 57% of households contain one person, 29% two persons and 14% four persons .
Persons living in single-persons households tend to be students, young adults and the elderly. They don’t frequent the pools that much: 50% of them don’t visit the pool at all, while the other half only visit the pools once a year.
Persons in two-person households are young pairs and pairs, where the children have moved out. While the first category don’t really go to the pool, the other category tends to be heavy users. 25% never go to the pool, 25% go to the pool once a year, 25% go five times a year and 25% go eight times.
Persons in the four-person household categories are families; they use the public pools extensively. 25% of them go there five times a year, 25% of them go there six times a year and 50% go there eight times a year.
This of course gives us a mean of 3.58 with a standard deviation of 3.25.
Let’s draw a simple random sample of these people (n = 1000) and look at the results. We find a mean of 3.53 and a 95% CI of [3.33;3.73] – not bad.
However, if we sample according to the scheme above, we would have found a mean of 2.24 and a 95% CI of [2.06;2.42] – rather far from the true mean, which isn’t contained within the CI.
In sum: sampling frames matter. And they matter more than you think.
) I’m making all these numbers up.