The Not-So-Random Sampling

After a very stimulating #s4pm chat (transcript here), I remembered a story that my first year Epidemiology professor told the class when talking about “random” sampling, making it clear to the lay that random sampling isn’t really all that random. The man works for a Naval Unit here in San Diego, this being the source of his story, which went like this:

Imagine the line of cars waiting to enter a military base. On that base, the gate guard has the duty to select cars “at random” to fully check for illegal materials, weapons or explosives. Lets say that they will search random cars, at a rate of every other 4th car. That means the 4th car, 8th car, 12th car and so forth will be checked. While this may seem random to most drivers coming into the base, there is a logic behind it. It’s every other 4th car. The big issue here isn’t the obvious lack of “randomness,” but the potential bias this method could introduce: if I as a driver knew the “every other 4th car” system, and I noted that I was one of the 4th cars, I could simply drive out of line, let the guy behind me go in front of me, thereby becoming the 5th car in line. I just bypassed the guard’s system, potentially driving onto the base without getting checked.

Perhaps that was too straightforward of an example. Lets examine a more complex situation of non-random randomness: lets say a study wants to determine the average number of voters per household in the State of California. Random Digit Dialing (RDD) might be a way to do this (probably much easier than door-to-door). Since California has a population of over 37 mil, it would be virtually impossible to query every single household. And so, RDD it is! Here’s a strategy:

  1. Within California, we take a list of all area codes (although with 37mil people, it might take a while).
  2. Then lets say we sample two area codes per county, with a probability proportional to the number of telephone numbers in that area code.
  3. Now that we have our two area codes, we could sample 100 telephone numbers per area code that will be contacted.

While completely overwhelming, the issue in this endeavor is not caused by its complexity, it’s caused by the lack of randomness. We call it Random Digit Dialing, but we took steps to weed out all the other phone numbers to be left with just 100 per sampled area code, and only 2 area codes per county. We set rules from the beginning of the study, potentially introducing bias that will skew our results. An even more important thing to note is the fact that our sample may not necessarily be representative of the population of California.

So there you have it…there NO SUCH THING as random! We have tools set up to help us be as random as possible, but can we ever be truly random?

Leave a Reply

Next ArticleMy Dreamforce '12 Agenda