- the circulation of stock: which items are in heavy and which are in low demand
- the number of people who visit the library
- the ways they actually use it: study, recreational, IT, meeting friends, etc,
For the sake of the argument, I assume that the library works on a manual basis: no automated catalogue and no electronic counter at the entrance. But the idea of sampling is the same in the manual and the automated case.
It is clear that I cannot monitor and write down what is going on a continuous basis. That would more or less double the total work load.
Statisticians – masters of the universe
Data collection is work – hard, disciplined work – and should be kept to a minimum. Statistical sampling is a technique for collecting data that minimizes the work, while providing the answers we need. When we sample, we take a selection from a larger total – using a particular (and strictly enforced) technique – and treat the sample as if it were the total.
The total is is often called the universe or the population.
Perfect accuracy is seldom needed. It does not matter whether we had 13.415 or 13.671 visitors in 2006. But the difference between thirteen and fifteen thousand visitors matter. As a rule of thumb, I would say:
- do not bother about a difference of 1-2 percent
- a difference of 3-4 percent is small, but may be meaningful
- a difference of 5-15 percent is real and interesting
- anything more is very interesting
We can usually get information that is good enough for practical decision-making, from a sample of a few hundred items. Thge greater the sample, the greater the accuracy.
A very basic, and also very surprising, statistical rule is: The size of the original population does not matter. Accuracy only depends of the size of the sample.
First example: selecting books
Let me apply the idea of sampling to the book collection.
My library has, say, ten thousand books. I want to know how up-to-date my collection is, by looking at the year of publication.
Checking ten thousand cards and writing down ten thousand numbers does not appeal to me. I appeal to statistics and take a sample of – say – two hundred cards instead. This could, for obvious reasons, be called a two percent sample.
The big idea in statistical sampling lies in the way you go about selecting the sample from the total.
You should not pull 200 consecutive cards from the nearest drawer. Nor should you rummage around, taking one here and one there as the mood takes you. The sample should
- come from the whole population
- not depend on human choice
There are many ways of achieving this. The simplest is probably to take (look at) every fiftieth card and write down the year of publication.
The distribution of these two hundred numbers will provide a good approximation to the true distribution baed on all ten thousand publication years.
Second example: selecting days
My library is open – say – six days a week. We open at 9 am, take a break from noon til 2 pm, and open again from 2 till 6 pm. On Saturday, there is no afternoon session. The library is also closed for a total of four weeks during holidays.
This means that the library is open 6 * (52 – 4) = 288 days a year.
It is open for three hours on 48 Saturdays – which gives a total of 144 “Saturday hours”.
It is open for seven hours on 240 weekdays – giving a total of 1680 “weekday hours”
The total number of hours is 1680 + 144 = 1724 hours per year.
I want to know the number of visitors we have in a year. I know, from expeiernce, that library use tends to vary systematically during the day, during the week and during the year.
If I want to know the true number of visitors,
I cannot take my “best hour” – and multiply by 1724.
I cannot take my best day – and multiply by 288
I cannot take my best week- and multiply by 48
I have to choose my sample from the “whole population” and in a proper “mechanical” way.
There are, as before, many ways of achieving this. The easiest is probably to select a small number of “counting days” throughout the year. On these days all visitors are counted.
You may, for instance start with the first Monday in January – and continue with the first Tuesday in February, the first Wednesday in March and so on.
This approach will give you 12 days, or two full weeks, covering the whole year. Since the library keeps open 48 weeks a year, you find the total number of visitors by multiplying the observed number with 24.
Third example: selecting users
Concepts are important. If we want to study users, we must first decide the limits of the population.
For instance, does user mean:
- The people that visit the library on a regular basis?
- The people that have visited the library at least once during the last year?
- The people that have visited the library at least once during the last five years?
- The people that are registered as users?
- The people that are registered as users and have borrowed materials during the last year?
- The people that are registered as users and have borrowed materials during the last five years?
And so on.
In this example I define user as a person that is registered as a user. My population consists of a set of registration cards.
I want to check the social impact of the library by looking at the geographical distribution of the users. Where do these people live? Are there places we have “missed” – and where we might do some extra marketing?
Let us say we have 6.000 registered users. I choose 200 at random. The selection procedure follows the first example. Since 6.000/200 = 30, I may simply take every thirtieth card, write down the addresses and plot them on a local map.
- PL 24/07: NTC – Listen to your public
- PL 23/07: NTC – What happens inside libraries?
- PL 22/07: NTC – Visitors and users
- PL 21/07: NTC – How to measure lending
- PL 19/07: NTC – Program
Complete teaching materials for Numbers that count – as a single file (Google Docs). – 18 pp.