Answer:

The sample standard deviation is the deviation using the sample mean, that is, the average of the samples that you have taken. The population standard deviation is the deviation using the population mean, which is the average over the whole population. In practice, you usually don't know and can't even find the true population mean, and so you use the sample mean and the sample standard deviation as estimators for the "true," i.e., population standard deviation. I will try to make this all clear (and will supply you with explicit formulae) with a simple example.

I will demonstrate in two ways why we end up with a formula that contains \((N-1)\) instead of \(N\) in the denominator. The first way is very sloppy and non-rigorous, but (I hope) rather intuitive. The second method, based on Bayesian marginalization, is more rigorous but may be harder to follow.

Either way, we will see that the change in the denominator from \(N\) to \((N-1)\) is an artifact of our ignorance: because we don't know the true average of the population, we need to average over all possible averages and the result of this fuzziness in our knowledge transforms \(N\) into \((N-1)\).

As a matter of notation and terminology, the variance of a sample is:

\[ \sigma^2 = \frac{1}{N}\sum_{n=1}^{N} (x_n -\mu)^2, \]

where \(\mu\) is the true average over the whole population and \(N\) is the number of data points.

The positive square root \( +\sqrt{ \sigma^2} = \sigma\) of the variance is the standard deviation. So if you know the value of the standard deviation you also know the variance and vice-versa.

In what follows it will sometimes be convenient to use \(\sigma\) in a formula, and sometimes \(\sigma^2\). But I will usually refer only to the "standard deviation." So if you see \(\sigma^2\) then just take the positive square root.

Weighing Cuttlefish

Suppose we are marine biologists and we are given the task of weighing a sample of cuttlefish. We take our boat out onto the ocean, put something that cuttlefish like to eat into our trap, and then wait for the cuttlefish. Suppose ten cuttlefish enter the trap. We lift the trap out of the water, carefully remove each cuttlefish, weigh it, and then throw it back. When we are done we have a set of ten points: the weights of ten cuttle fish. We write:

\[ (\text{weight of cuttlefish number 3}) := x_3, \ \text{weight of cuttlefish number n}:= x_n, \ \text{etc.} \]

\[ \{\text{Set of weights of the ten cuttlefish }\} := \{ x_n\}_{1}^{10} \]

The Sample Mean \(\bar{x}\) of these data, is simply the average of these ten weights:

\[ \bar{x} = \frac{1}{10} \sum_{n=1}^{10} x_n \]

We also have the Sample Deviation \(S\) defined by:

\[ S := \sum_{1}^{10} ( x_n - \bar{x} )^2 \]

Note that \(\bar{x}\) is not the true mean \(\mu\), i.e., the true average weight of all cuttlefish. In order to find \(\mu\) we would have to weigh every cuttlefish in the ocean. Even if we tried to do this, in the time it would take us to scour the ocean for cuttlefish, weigh them and throw them back, many new generations of cuttlefish would be born and die and so the true population would not be the one we had measured. Finding \(\mu\) exactly, by empirical means, is impossible.

If we did know \(\mu\) we could use this knowledge to find the variance (and hence the standard deviation) of our sample, which would then be:

\[ (\text{variance}) = (\text{ standard deviation})^2 = \sigma^2 := \frac{1}{10} \sum_{1}^{10} ( x_n - \mu )^2 \]

But of course we don't know \(\mu\), so we ask the question: how can we use what we do know, namely \(\bar{x}\) and \(S\), to approximate \(\sigma\)?

To that end we write:

\[ (x_n - \mu) = (x_n - \bar{x} ) + (\bar{x} - \mu),\]

then

\[ (x_n - \mu)^2 = (x_n - \bar{x} )^2 + 2(x_n - \bar{x} ) (\bar{x} - \mu) + (\bar{x} - \mu)^2 .\]

Now we sum all these terms:

\[ \sum_{n= 1}^{10} (x_n - \mu)^2 = \sum_{n= 1}^{10}(x_n - \bar{x} )^2 + 2 (\bar{x} - \mu) \sum_{n= 1}^{10}(x_n - \bar{x} ) + 10 (\bar{x} - \mu)^2 .\]

The first term on the right is equal to \(S\); the second term can be written as:

\[ 2 (\bar{x} - \mu) \sum_{n= 1}^{10}(x_n - \bar{x} ) = 2 (\bar{x} - \mu) 10 \left(\left( \frac{1}{10} \sum_{n=1}^{10} x_n \right) - \bar{x} \right) = 0 \]

(from the definition of \(\bar{x}\))

Putting this all together we find:

\[ \sum_{n= 1}^{10} (x_n - \mu)^2 = S + 10 (\bar{x} - \mu)^2 \]

The term \( ( \bar{x} - \mu)\) can be written as:

\[ ( \bar{x} - \mu) = \frac{1}{10}\sum_{n=1}^{10} x_n - \mu = \frac{1}{10} \sum_{n=1}^{10} (x_n - \mu), \]

so that:

\[ ( \bar{x} - \mu)^2 = \frac{1}{10^2} \left( \sum_{n=1}^{10} (x_n -\mu)^2 + \sum \sum_{n \neq m} (x_n - \mu)(x_m - \mu) \right) \]

The first term on the right is equal to:

\[ \frac{1}{10^2} \left( \sum_{n=1}^{10} (x_n -\mu)^2 \right) = \frac{1}{10} \sigma^2, \]

we will refer to the second term as \(G\):

\[ \left( \sum \sum_{n \neq m} (x_n - \mu)(x_m - \mu) \right) := G .\]

So that the equation:

\[ \sum_{n= 1}^{10} (x_n - \mu)^2 = S + 10 (\bar{x} - \mu)^2, \]

becomes:

\[ \sum_{n= 1}^{10} (x_n - \mu)^2 = S + \sigma^2 + \frac{1}{10} G .\]

Or, in general, for a sample with \(N\) points:

\[ \sum_{n= 1}^{N} (x_n - \mu)^2 = S + N (\bar{x} - \mu)^2, \]

\[ \sum_{n= 1}^{N} (x_n - \mu)^2 = S + \sigma^2 + \frac{1}{N} G .\]

Now note that \(S\) is equal to zero only if all the \(x_n \) are equal. That is, only if all the cuttlefish weight exactly the same. If any two cuttlefish vary in weight, then \(S\) is automatically greater than zero, and then so is \(\sigma^2\).

If we assume that not all our cuttlefish weigh exactly the same, then both \(S\) and \(\sigma^2\) are greater than zero.

\(G\), on the other hand, contains terms that can be either greater than or less than zero. If we consider all possible values of \(\mu\), then "on average," \(G\) will be zero.

Therefore as a first approximation we set \(G =0\) and obtain:

\[ \sum_{n= 1}^{N} (x_n - \mu)^2 \approx S + \sigma^2 ,\]

so that,

\[ \frac{1}{N} \sum_{n= 1}^{N} (x_n - \mu)^2 \approx \frac{1}{N} (S + \sigma^2). \]

The left-hand side of this equation is \(\sigma^2\) and so we find:

\[ \sigma^2 \approx \frac{1}{N} (S + \sigma^2),\]

\[ N \sigma^2 \approx (S + \sigma^2),\]

\[ ( N-1) \sigma^2 \approx S,\]

\[ \sigma^2 \approx \frac{S}{N-1}.\]

From what we know, namely the weights, \(\{x_n\}\), the sample mean \(\bar{x}\), and the sample deviation \(S\), we can't find the true standard deviation \(\sigma\), but we can use \(S\) as an approximation to \(\sigma^2\). We refer to \(S\) as an estimator for \(\sigma\).

Bayesian Marginalization

Now for a more rigorous but also more mathematically involved derivation of the \((N-1)\) term, I will give you a brief rundown of the Bayesian approach. For the full story, please consult David MacKay's wonderful Book: "Information Theory, Inference, and Learning Algorithms."

In the Bayesian case, we will start with the assumption that the probability distribution is Gaussian. Contrary to popular belief, the reason for this is not that Gaussian distributions actually are ubiquitous in the "real world." In fact, Gaussian distributions have extremely thin tails, so they tend to underestimate the probability of rare events. The frequency of stock market crashes, for example, is notoriously underestimated by Gaussian distributions. The real reason that Gaussian distributions are used as first approximation "best guesses" in Bayesian inference, is that they have maximum entropy; that is, they represent our ignorance of the true probability distribution.

So any result we derive from a Gaussian will contain artifacts of our ignorance. This is a difficult and serious topic. For more information please consult E. T. Jaynes' posthumous masterpiece "Probability Theory: The Logic of Science," and MacKay's "Information Theory, Inference, and Learning Algorithms."

Let us return to our example with the cuttlefish. We have weighed \(N\) cuttlefish. We will start with the assumption that the probability distribution for obtaining these \(N\) weights, given \(\mu\) and \(\sigma\), is a Gaussian. We write this as:

\[ (\text{probability for obtaining the weights} \ \{ x_n\} \text{ given}\ \mu \text{ and} \ \sigma ) \]

\[ = P \left( \{x_n\} | \mu, \sigma \right) = \left(2\pi \sigma^2\right)^{-\frac{N}{2} } \exp{ \left( -\frac{\sum_{n=1}^{N}(x_n- \mu)^2}{2\sigma^2} \right) }. \]

We will use this probability to find the probability, \(P \left( \{x_n\} | \sigma \right)\), of obtaining the weights \(\{x_n\}\) given \(\sigma\). That is, we will marginalize (integrate) over all values of \(\mu\). Then we will find the value of \(\sigma\) that maximizes

\[ (\text{probability of obtaining the weights} \ \{ x_n\} \text{ given}\ \ \sigma ) = P \left( \{x_n\} | \sigma \right), \]

and hence find the most probable \(\sigma\).

Now recall the formula:

\[ \sum_{n= 1}^{N} (x_n - \mu)^2 = S + N (\bar{x} - \mu)^2. \]

For the Gaussian probability distribution that we are using for \( P \left( \{x_n\} | \mu, \sigma \right) \), this formula leads to the result:

\[ P \left( \{x_n\} | \mu, \sigma \right) = (2\pi \sigma^2)^{-\frac{N}{2} } \exp { \left( \frac{-S}{2\sigma^2} \right) } \exp{ \left( -\frac{N(\bar{x} - \mu)^2}{2\sigma^2} \right) } \]

We integrate this Gaussian distribution with respect to \(\mu\) to find \( P \left( \{x_n\} | \sigma \right)\) :

\[ (2\pi \sigma^2)^{-\frac{N}{2} } \exp { \left( \frac{-S}{2\sigma^2} \right) } \int_{- \infty}^{\infty} \exp{ \left( -\frac{N(\bar{x} - \mu)^2}{2\sigma^2} \right) } d\mu \]

This is a strange looking Gaussian, because usually \(\mu\), the population average would be a constant, and \(\bar{x}\) would be an integration variable, but here we are averaging over all possible averages, so \(\mu\) is the integration variable.

(Of course you might argue that no cuttlefish could have negative weight; this is a good point. So really we should think of the \(x_n\) and \(\mu\) as the logarithms of the weight in this case.)

The result of the integration is:

\[ P\left( \{x_n\} | \sigma \right) = \frac{\sigma}{\sqrt{N}} \sqrt{2\pi} \left(2\pi \sigma^{2} \right)^{-\frac{N}{2} } \exp { \left( \frac{-S}{2\sigma^2} \right) } \]

The information associated with a probability is minus the log of that probability:

\[ ( \text{information of} \ P ) := - \ln{P} . \]

Therefore,

\[ \left( \text{ information of} \ P\left( \{x_n\} | \sigma \right) \right) = -\ln{\sigma} + \frac{1}{2}\ln{N} - \frac{1}{2}\ln{2\pi} + N\ln{\sigma} + \frac{N}{2}\ln{2\pi} + \frac{S}{2\sigma^2}. \]

In order to find the "best," or most probable \(\sigma,\) we need to find a critical point of the information with respect to \(\sigma\).

That is, we need to find a point where a slight change in \(\sigma\) won't change the information. To this end, we differentiate the information with respect to \(\sigma\) and then set the result equal to zero.

\[ \frac{\partial(- \ln{P})}{\partial\sigma} = \frac{-1}{\sigma} + \frac{N}{\sigma} - \frac{S}{\sigma^3} = 0 .\]

Therefore:

\[ - 1 + N - \frac{S}{\sigma^2} = 0,\]

and,

\[ \frac{S}{\sigma^2} = N -1 ,\]

so that,

\[ \sigma^2 = \frac{S}{N-1}. \]

So now we have an equality instead of an approximation, but once again, the \((N-1)\) term is an artifact of our ignorance. Here our ignorance is reflected both in the fact that we chose a Gaussian as our assumed initial probability distribution and in the fact that we then had to integrate over all possible values of values of \(\mu\).

(David MacKay: Information Theory Inference and Learning Algorithms )

(E. T. Jaynes: Probability Theory: The Logic of Science )