Wednesday, July 25, 2007

PL 20/07: NTC – The idea of sampling

Filed under: statistics — plinius @ 10:30 am

gresshoppe.jpgI want to measure what is happening at my library. I want to know, say, about:

  1. the circulation of stock: which items are in heavy and which are in low demand
  2. the number of people who visit the library
  3. the ways they actually use it: study, recreational, IT, meeting friends, etc,

For the sake of the argument, I assume that the library works on a manual basis: no automated catalogue and no electronic counter at the entrance. But the idea of sampling is the same in the manual and the automated case.

It is clear that I cannot monitor and write down what is going on a continuous basis. That would more or less double the total work load.

Statisticians – masters of the universe

Data collection is work – hard, disciplined work – and should be kept to a minimum. Statistical sampling is a technique for collecting data that minimizes the work, while providing the answers we need. When we sample, we take a selection from a larger total – using a particular (and strictly enforced) technique – and treat the sample as if it were the total.

The total is is often called the universe or the population.

Perfect accuracy is seldom needed. It does not matter whether we had 13.415 or 13.671 visitors in 2006. But the difference between thirteen and fifteen thousand visitors matter. As a rule of thumb, I would say:

  • do not bother about a difference of 1-2 percent
  • a difference of 3-4 percent is small, but may be meaningful
  • a difference of 5-15 percent is real and interesting
  • anything more is very interesting

We can usually get information that is good enough for practical decision-making, from a sample of a few hundred items. Thge greater the sample, the greater the accuracy.

A very basic, and also very surprising, statistical rule is: The size of the original population does not matter. Accuracy only depends of the size of the sample.

First example: selecting books

Let me apply the idea of sampling to the book collection.

My library has, say, ten thousand books. I want to know how up-to-date my collection is, by looking at the year of publication.

Checking ten thousand cards and writing down ten thousand numbers does not appeal to me. I appeal to statistics and take a sample of – say – two hundred cards instead. This could, for obvious reasons, be called a two percent sample.

The big idea in statistical sampling lies in the way you go about selecting the sample from the total.

You should not pull 200 consecutive cards from the nearest drawer. Nor should you rummage around, taking one here and one there as the mood takes you. The sample should

  • come from the whole population
  • not depend on human choice

There are many ways of achieving this. The simplest is probably to take (look at) every fiftieth card and write down the year of publication.

The distribution of these two hundred numbers will provide a good approximation to the true distribution baed on all ten thousand publication years.

Second example: selecting days

My library is open – say – six days a week. We open at 9 am, take a break from noon til 2 pm, and open again from 2 till 6 pm. On Saturday, there is no afternoon session. The library is also closed for a total of four weeks during holidays.

This means that the library is open 6 * (52 – 4) = 288 days a year.

It is open for three hours on 48 Saturdays – which gives a total of 144 “Saturday hours”.
It is open for seven hours on 240 weekdays – giving a total of 1680 “weekday hours”

The total number of hours is 1680 + 144 = 1724 hours per year.

I want to know the number of visitors we have in a year. I know, from expeiernce, that library use tends to vary systematically during the day, during the week and during the year.

If I want to know the true number of visitors,

I cannot take my “best hour” – and multiply by 1724.
I cannot take my best day – and multiply by 288
I cannot take my best week- and multiply by 48

I have to choose my sample from the “whole population” and in a proper “mechanical” way.

There are, as before, many ways of achieving this. The easiest is probably to select a small number of “counting days” throughout the year. On these days all visitors are counted.

You may, for instance start with the first Monday in January – and continue with the first Tuesday in February, the first Wednesday in March and so on.

This approach will give you 12 days, or two full weeks, covering the whole year. Since the library keeps open 48 weeks a year, you find the total number of visitors by multiplying the observed number with 24.

Third example: selecting users

Concepts are important. If we want to study users, we must first decide the limits of the population.

For instance, does user mean:

  1. The people that visit the library on a regular basis?
  2. The people that have visited the library at least once during the last year?
  3. The people that have visited the library at least once during the last five years?
  4. The people that are registered as users?
  5. The people that are registered as users and have borrowed materials during the last year?
  6. The people that are registered as users and have borrowed materials during the last five years?

And so on.

In this example I define user as a person that is registered as a user. My population consists of a set of registration cards.

I want to check the social impact of the library by looking at the geographical distribution of the users. Where do these people live? Are there places we have “missed” – and where we might do some extra marketing?

Let us say we have 6.000 registered users. I choose 200 at random. The selection procedure follows the first example. Since 6.000/200 = 30, I may simply take every thirtieth card, write down the addresses and plot them on a local map.



Complete teaching materials for Numbers that count – as a single file (Google Docs). – 18 pp.



  1. […] 1445-1515. Use sampling to minimize the work load: how to select books, days and users. The idea of sampling. […]

    Pingback by PL 18/07: NTC - Program « Pliny the Librarian — Wednesday, July 25, 2007 @ 10:36 am

  2. […] PL 20/07: NTC – The idea of sampling […]

    Pingback by PL 18/07: NTC - Numbers that count « Pliny the Librarian — Sunday, July 29, 2007 @ 1:56 pm

  3. […] slike utvalg kan lages, har jeg skrevet litt om til et halvdagskurs i Stellenbosch 15. august. Det er ikke spesielt vanskelig og heller ikke så veldig arbeidskrevende […]

    Pingback by Kombinerte løsninger « Samarbeidsutvalget for bibliotekstatistikk — Tuesday, August 7, 2007 @ 3:16 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Create a free website or blog at

%d bloggers like this: