I think most of the replies, here and on stack exchange, are answering slightly the wrong question.
It is fair to ask why the likelihoods are useful if they are so small, and it's not a good answer to talk about how they could be expressed as logs, or even to talk about the properties of continuous distributions.
I think the answer is:
Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.
However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.
Much like how the average is unlikely to be the exact value of a new sample from the distribution, but it's a good way of describing what to expect. (And gets better if you augment it with some measure of dispersion, and so on). (If the distribution is very dispersed, then while the average is less useful as an idea of what to expect, it still minimises prediction error in some loss; but that's a different thing and I think less relevant here).
> It is fair to ask why the likelihoods are useful if they are so small
The way the question demonstrates "smallness" is wrong, however. They quote the product of the likelihoods of 50 randomly sampled values - 9.183016e-65 - as if the smallness of this value is significant or meant anything at all. Forget the issue of continuous sampling from a normal distribution, and just consider the simple discrete case of flipping a coin. The combined probability of any permutation of 50 flips is 0.5 ^ 50, a really small number. That's because the probability is, in fact, really small!
Right - and so the more appropriate thing to do is not look at the raw likelihood of any one particular value but instead look at relative likelihoods to understand what values are more likely than other values.
For the discrete case, it seems that a better thing to do is consider the likelihood of getting that number of heads, rather than the likelihood of getting that exact sequence.
I am not sure how to handle the continuous case, however.
Of course you ignore irrelevant ordering of data points. That's not the issue.
The issue, for discrete or continuous (which are mathematically approximations of each other), is that the value at a point is less important than the integral over a range. That's why standard deviation is useful. The argmax is a convenient average over a weightable range of values. The larger your range, the greater the likelihood that the "truth" is in that range.
If you only need to be correct up to 1% tolerance, the likelihood of a range of values that have $SAMPLING_PRECISION tolerance is not importance. Only the argmax is, to give you a center of the range.
Yes - the most enlightening concept for me was "Highest Probability Density Interval" which basically always is clustered around the mean. But you can choose any interval which contains as much probability mass!
It's a fairly common "mistake" to assume that the MLE is useful as a point estimate and without considering covariance/spread/CI/HPDI/FIM/CRLB/Entropy/MI/KLD or some other measure of precision given the measurement set.
> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.
This may be true for low dimensions but doesn’t generalise to high dimensions. Consider a 100-dimensional standard normal distribution for example. The MLE will still be at the origin but most of the mass will live in a thin shell of distance roughly 7 units from the origin.
However, TobyTheCamel's point is valid in that there are some parameter spaces where the MLE is going to be much less useful than others.
Even without having to go to high dimensions, if you've got a posterior that looks like a normal distribution, the MLE is going to the you a lot, whereas if it's a multimodal distribution with a lot of mass scattered around, knowing the MLE much less informative.
But this is a complex topic to address in general, so I'm trying to stick to what I see as the intuition behind the original question!
Concentration of mass is density. A shell is not dense.
If I am looking for a needle in a hyperhaystack, it's not important to know that it's more likely to be "somewhere on the huge hyperboundary" than "in the center hypercubic inch".
A lot of why large corporations fail to make products that people enjoy is tied up in this behavior and that mass is not independently distributed along each distribution — you end up with “continents of taste” your centroid product sucks for equally.
This is similar to how they originally tried to build fighter jet seats for the average pilot, but it failed because it turned out there were no average pilots, so they had to make them adjustable.
And yet your parent comment was right in saying that it won't be true that "a lot of the probability mass - an amount that is not small - will be concentrated" in the center hypercubic inch.
> Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.
Can you elaborate? An MLE is never going to come up with the exact parameters that produced the samples, but in the original example, as long as you know it's a normal distribution, MLE is probably going to come up with a mean between 4 and 6 and a SD within a similar range as well (I haven't calculated it, just eyeballing it) -- when the original parameters were 5 and 5.
I guess I don't know what you mean by "correct", but that's as correct as you can get, based on just 50 samples.
Right - I think this is what's at the heart of the original question.
I know they asked with a continuous example, but I don't interpret their question as limited to continuous cases, and I think it's easier to address using a discrete example, as we avoid the issue of each exact parameter having infinitesimal mass which occurs in a continuous setting.
Let's imagine the parameter we're trying to estimate is discrete and has, say, 500 different possible values.
Let's say the parameter can have the value of the integers between 1 and 500 and most of the mass is clustered in the middle between 230 and 270.
Given some data, it would actually be possible that MLE would come up with the exact value, say 250.
But maybe given the data, a range of values between 240 and 260 are also very plausible, so the likelihood of exactly 250 has a fairly low probability.
The original poster is confused, because they are basically saying, well, if the actual probability is so low, why is this MLE stuff useful?
You are pointing out they should really frame things in terms of a range and not a point estimate. You are right; but I think their question is still legitimate, because often in practice we do not give a range, and just give the maximum likelihood estimate of the parameter. (And also, separately, in a discrete parameter setting, specific parameter value could have substantial mass.)
So why is the MLE useful?
My answer would be, well, that's because for many posterior distributions, a lot of the probability mass will be near the MLE, if not exactly at it - so knowing the MLE is often useful, even if the probability of that exact value of the parameter is low.
I agree with your points and thats why it's useful to compare a MLE to an alternative model via a likelihood ratio test, in which case one sees how much better the generative model performs as compared to the wrong model.
Similarly, AIC values do not make a lot of sense on an absolute scale but only relative to each other, as written in [1].
[1] Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociological methods & research, 33(2), 261-304.
> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.
This is a Bayesian point of view. The other answers are more frequentist, pointing out that likelihood at a parameter theta is NOT the probability of theta being the true parameter (given data). So we can't and don't interpret it like a probability.
That's not a Bayesian point of view. You can re-word it in terms of a confidence interval / coverage probability. It is true that in frequentist statistics parameters don't have probability distributions, but their estimators very much do. And one of the main properties of a good estimator is formulated in terms of convergence in probability to the true parameter value (consistency).
They are useful because the integral of the likelihoods is not infinitesimal.
The probability that your yard stick measures 1.0000000000000 yards is basically zero, but the probability that it's within one inch of that is close to one.
We generally prefer to use probability density functions with the property that most of the probability density is close to the maximum likelihood.
So, in the yard stick example, the yard stick lengths are probably gaussian, so if check enough lengths, you'll get a mean (== length with the maximum likelihood) that approaches 1.00000000000 yards (you'd hope) with some small standard deviation (probably less than an inch).
You flip a possibly-biased coin 20 times and get half heads, half tails, e.g. "THHHTTTTTHTHTTHHHTHH".
Under the model where the bias is 0.5—a fair coin—the probability of that sequence is (0.5)^20 or about one in a million. In fact, the probability of any sequence you could observe is one in a million.
Under the model where the bias is 0.4 the probability is (0.4)^10 × (0.6)^10 or about one in two million.
That is, the sequence we observed supplies about twice as much evidence in favor of bias = 0.5 as compared with bias = 0.4—this is likelihood.
Likelihood ratios are all that matter.
Morals:
- The more complex the event you're predicting (the rarer the tyical observed result) the smaller the associated likelihoods will tend to be
- It's possible that every observed result has a tiny probability under every model you're considering
- Nonetheless it makes sense to use the ratios of these numbers to compare the models
- This has nothing to do with probability densities or logarithms, though the fact that we often work with densities also makes absolute likelihood values relative to the choice of units
Added in edits:
- You could summarize the sequence with the number of heads or tails and then the likelihood values would be larger but the ratios would remain the same (it's a sufficient statistic). Similarly in the CrossValidated question one could summarize the data with the mean and sum of squares. But this doesn't work in general, e.g. if we have i.i.d. draws from a Cauchy distribution.
You have given a nice clean answer that does not make any errors (such as talking about the likelihood as a density in parameters, which of course it is not). Thanks for writing it down.
The only other thing worth adding to what you have written is that the likelihood is a product of N factors.
As such, it will essentially always diverge toward infinity (if the density factors are on average greater than 1) or collapse fast towards zero (if the factors are on average less than 1, as in your example and in OP).
So this very structure (arising from the IID observations) implies that no “stable” density will pop out. It’ll always blow up or down!
One way to stabilize things is to take (1/N) times the log of the likelihood. Then you will indeed converge to something familiar - the entropy, - E log p(x).
Not quite; the probability of n/2 successes in n trials is given as Binomial(n,p) not p^n. p^n is correct for a single sequence but there are many possible sequences that result half heads, half tails and so you have a factor of "N choose X" or the so called "Binomial Coefficient".
> (0.4)^20 × (0.6)^20
and I think you mean (0.4)^10 × (0.6)^10 or more generally p^x*(1-p)^n-x.
1 in a million is the probability of correctly predicting a unique sequence of 20 coin flips, in the exact order. (E.g. first 10 flips heads, 2nd 10 flips tails, in that order - 1 in a million)
I'm surprised people are conflating the Binomial distribution with OP's statement. He is talking about one specific outcome of half heads/half tails (where order matters). There is exactly one way to get that outcome.
It is very strange that this is on a main page. The key thing is likelihood is probability density of your data! I.e. if your probability density is a Gaussian N(0,0.00001), then the likelihoods of data-points next to the mean will be very large, if your PDF is N(0,10000) they'll be very small.
Furthermore the amount of data matters as likelihoods will be multiplied for each datapoint, so if they were small in the beginning, they'll be even smaller, if they were large they'll be larger.
I think there is some language misunderstanding going on. Likelihood function is not a probability density. Likelihood function evaluated for D is equal to probability density of D (by definition). In other words, f(x;theta) as a function of x is a probability density function. f(x;theta) as a function of theta is a likelihood function. But f(x;theta) for given x and theta is just a value, which, one can say, is both likelihood and probability density.
Right. But if you make the notation slightly more explicit, then the integral of L(data, params) over data is 1. This follows from the independence assumption.
So we ARE working with a probability function. Its output can be interpreted as probabilities. It's just that we're maximizing L = P(events | params) with respect to params.
The likelihood function is a function of params for a fixed value of data and it is not a probability function.
There is another function - a function of data for fixed params - which is a probability density. That doesn’t change the fact that the likelihood function isn’t.
The independence has nothing do with the integral being 1 to be honest. You could write a model where the observations are not independent but the (multivariate) integral over their domain will still be 1.
If by “joint probability” you mean function(params, data) there is no joint probability here in general.
L(params, data) is constructed from a family density functions p(data) for each possible value of param. The integral of L(params, data) may be anything or diverge. You don’t need any extra independence assumption either.
Or maybe you mean “joint probability” as p(data1, data2) when data is composed of two observations, for example. But you don’t need any independence assumption for that probability density to integrate to one! It necessarily does that - whether you can factorize it as p’(data1)p’’(data2) or not.
That's exactly the reason why frequentist approach sucks by the way;) Parameters are treated specially and there is no internal consistency - to have it you need to introduce priors...
A likelihood could be referring to data drawn from discrete distribution though and this wouldn’t change much about how it’s treated and it would be a proper probability not prob-density
I'm surprised it's here as well. Of all the interesting questions on CV, I would not consider this one of them. I wonder if this was sent through the second-chance pool.
The weirdest “likelihood” conversation I ever had, the putative team lead didn’t want to change priorities to fix a bug because, “how often does that happen?”
My reply was, “it happens to every user, the first time they use the app.” And then something about how frequency has nothing to do with it. Every single user was going to encounter this bug. Even if they only used the app once.
I already had a toolbox full of conversations about how bad we are at statistics, but that one opened up a whole new avenue of things to worry about. One that was reinforced by later articles about the uselessness of p95 stats - particularly where 3% of your users are experiencing 100% outage.
But the one that is more apropos to the linked question, vs HN in general, is how people are bad at calculating the probability that “nothing bad happens” when there are fifty low probability things that can go wrong. Especially as the number of opportunities go up.
And the way that, if we do something risky and nothing bad happens, we estimate down the probability of future calamity instead of counting ourselves lucky and backing away.
I've seen this exact same fallacy happen several times throughout my career, which isn't even very long.
I think in many cases it boils down to some subtype not being identified and evaluated on its own. As in your case it's especially impactful, and yet IME also usually where these kinds of things get improperly prioritized, when it's a user's first impression or when it occurs in a way that causes a user to have to just sit and wait on the other end as these are often "special" cases with different logic in your application code.
OTOH sometimes users try to weird/wrong/adversarial shit and so their high failure rate is working as intended. But it pollutes your stats such that it can hide real issues with similar symptoms and skew distributions.
Yeah, and I'd also add that the total # of bugs in an application will always be greater than the total # of 'known' bugs. Tracking down and fixing the oddball bugs usually prevents a larger set of related issues from popping up later.
This is interesting, I wonder if there are principles from other types of engineering, civil, structural, aeronautical etc which provide some additional thoughts around probabilities of failure and now to deal with them
Particularly where you have a lot of low probability bugs
The center of a normal distribution has high likelihood (e.g. 1000000) if the standard deviation is small or low likelihood if the standard deviation is large (e.g. 1/1000000.)
This effect is amplified when you are working with products of likelihoods. They can be infinitesimal or astronomical.
Giant likelihoods really surprised me the first time I experienced them but they’re not uncommon when you work with synthetic test data in high dimensions and/or small scales.
They still integrate to the same magnitude because the higher likelihood values are spread over shorter spans.
Another issue is that likelihoods associated with continuous distributions very often have units. You can’t meaningfully assign “magnitude” to a quantity with units. You can always change your unit system to make the likelihood of, say, a particular height in a population of humans to be arbitrarily large.
Another thing to note is that you're multiplying probabilities together. Since each probability is between 0 and 1, youre always shrinking the likelihood with each new data point. When you're doing this kind of analysis, the question you're asking is "given a model with these parameters, what's the probability I get exactly this sample?" Which, when you phrase it that way, it becomes more apparent why the likelihood is so small.
Multiplying them together certainly magnifies the effect, but it would magnify it the other way if the likelihoods were larger than one. (Easy to get, just tweak the variance of the normal distributions to be smaller). Likelihoods are more like infinitesimal fractions of a probability, that need to be integrated over some set of events to get back a probability. In the case of the joint distribution of 50 Gaussian, you can think of the likelihood having "units" of epsilon^50.
For discrete distributions you indeed cannot, but for continuous distributions all you need is sufficiently small variance. Try for example a Gaussian with variance 1e-12
The value of a continuous probability density distribution at a specific point is pretty meaningless though; You have to talk about the integral between two values and that won't go above one.
Because it works well in practice. And to elaborate, usually when something works well in practice it's because it has multiple desirable properties - the one you "ask for", but also other ones you get for free.
In this case maximum likelihood approximate bayesian estimation with a mostly reasonable prior. Furthermore you could look at the convergence properties which are good.
You could probably design some degenerate probability distribution that ml-estimation behaves really badly for, but those are not common in practice.
The question is misguided as stated. It's like asking why chemists care about density for measuring mass.
If you are looking at the likelihood of any particular outcome of a continuous random variable, then you do not understand how probability works.
The probability of any particular real number arising from a probability distribution on the real numbers is exactly 0. It's not an arbitrarily small epsilon greater than zero, it's actually zero. This definition is in fact required for probability to sense mathematically.
You might ask questions like why does maximum likelihood work as an optimization criterion, but that's very different from asking why we care about likelihood at all.
The comments on the original question do a good job of cutting through this confusion.
I appreciate your response but I don't really agree. They say that likelihood can be multiplied by any scale factor or that it's only the comparative difference that matters, or we can make a little plot, but they don't actually explain why.
I can try to make an explanation from the bayesian framework(but as I mentioned it's not the only relevant one)
Likelihood is P(measurement=measurement'|parameter=parameter'). This is a small value. Given a prior we can P(parameter=parameter'|measurement=measurement'). This is also small. But when we compute P(parameter'-k<parameter<parameter'+k|measurement=measurement') then all the smallness cancels see the formulation of bayes that reads
P(X_i|Y) = (P(X_i)P(Y|X_i)/(sum_j P(X_j)P(Y|X_j))
I'm obviously skipping a lot of steps here because I'm sketching an explanation rather than giving one.
> The probability of any particular real number arising from a probability distribution on the real numbers is exactly 0. It's not an arbitrarily small epsilon greater than zero, it's actually zero.
Nitpicking somewhat, but e.g. `max(1, uniform(0, 2))` has a very non-zero probability of evaluating to 1.
> You could probably design some degenerate probability distribution that ml-estimation behaves really badly for, but those are not common in practice.
Probability is the probability mass distributed over your data with fixed parameters, and likelihood is mass distributed over your model parameters with fixed data. The absolute most important thing to know about likelihood is that it is not a measure of probability, even thought it looks a lot like probability.
If I look at coin flip data, I know the data comes from a coin flip, but any specific count of heads vs tails becomes less and less likely the more flips we do. So likelihood being small tells us nothing on its own.
The value of likelihood comes from the framework you use in. If I wanted to make a best guess at what the balance of the coin is then I could find the maximum of the likelihood over all coin balances to get the most representative version of my model. Similarly, I can compare two specific coin biases and determine which is more likely, but that alone can’t tell me anything about the probability of the coin being biased.
Well, because any specific outcome from sampling a random distribution is indeed very unlikely.
In practice, that means that if you have an alternate "non-random" or "less random" explanation for the data, you'll be convinced that it's almost surely the correct one after just a few samples (via the obvious Bayesian decision framework, or just "common sense").
For example, imagine that you are rolling a die and 3 always comes up. On every roll, the likelihood of the die being a fair random die (as opposed to a loaded die) is divided by the number of sides, so with a 6-sided die, you'll usually be convinced that it's loaded after just 3-20 times of giving the same result (depending on your prior on it being loaded vs fair and your decision threshold).
Likewise, if only 1 and 2 come up you'll quickly be convinced that it's an unusual die that only has 1 and 2 symbols on the face.
Or another way to look at it is that processes are usually not random at all (rather they are usually deterministic, but the initial state is unknown) so a random distribution is a very bad model, and having any information about the initial state at all will drastically increase the likelihood and thus make the model with information strongly preferred; the likelihood of the random model is so low because that model is very bad, even though it may be the best available.
Working on nlme models for work these days - it does become a bit of a headache when asking “how much better” the model with -2LL=8000 is from the model with -2LL=7995 obviously one is better “more likely given the data” but what if the better one used 2 more parameters and is hence more complex and might be overfitting the dataset? Well then there are all these “heuristics” to look at, AIC, BIC, some sort of trick with a chi^2 distribution function- these are all just ways to penalize the objective function based on the # of parameters but it’s somewhat debatable which one to apply when and I have read that some parameter estimation softwares don’t even compute these values in the exact same way - I am not a statistician by training I just apply “industry standard practices” in as reasonably intuitive a way as I can but my impression has always been that if you wander far enough into the weeds you’ll find that stats often becomes a debate between many approximations with different sorts of tradeoffs and much of this is smoothed over by the fancy scientific software packages that get used by non-statistics-researchers one of the most frustrating parts of my job is reproducing SAS output (an extensively used statistics product) using free R language tools since a SAS license costs more than some sports cars… But what is SAS actually doing? And it’s never just taking a mean or pooling variance in the standard way you’d read about in an intermediate stats textbook it’s always doing some slight adjustment based on this or that approximation or heuristic
This tangent may have been unrelated or irrelevant but I’ve long concluded that in practice statistics is far less solved than people might expect if they’ve never had to reproduce any of the numbers given to them by statistical analysis
For infinite probability spaces, likelihood has no interpretable meaning akin to probability in the case of finite spaces. This can be a source of confusion.
> Thus, it appears to be very unlikely in a certain sense that these numbers came from the very distribution they were generated from.
“It appears to be very unlikely in a certain sense that this comment is written in English. Yes, this sequence of characters is much more likely in English than in French. But you can’t even fathom how unlikely it was to be ever written in English!”
> For example, what amount of effort is appropriate to prevent a one time event which kills you with say 1 in ten thousand times?
if you value being killed at a massively negative value, then 1/10,000 times that value is still a massively negative value, so the answer is "a huge amount of effort"
For some value of "massive", sure. But for any value of massive, it's 1/10,000th that value. Then you factor in the value derived from taking that risk, and there's your choice.
The reality is that I don't expend huge amounts of effort avoiding tail risks, and you don't either. You might for the ones you're explicitly aware of, but a risk is a risk whether or not you know you're taking it.
That's good, assuming the cost of harm is a cost in utility and not in money, otherwise it starts having issues. Make sure you have a good utility function.
For questions like these I sometimes prefer translating everything to thermodynamics.
In this case the question then becomes "Why are we looking at energy if at most ordinary temperatures systems aren't anywhere near 0 energy?". The overly simplified answer is "Because bowling balls still roll down, even at room temperature."
Which is actually kind of interesting to think about. Why do bowling balls roll down when they have vastly more thermal energy than gravitational potential? To get an answer you have to invoke the second law of thermodynamics. Which is annoyingly a bit hard to really get to understand. In short it implies that energy likes to become more disorganised, so something like gravitational potential which is very organised will eventually devolve into heat which is (by definition) disorganised. So bowling ball roll down.
Yes. Statistics are fragile indicators well beyond the Central Limit Theorem minimal sufficient boundaries. They work pretty good when you have tons of data and run tons of repetitions, but for moderate sized data and repetitions you need very high certainty levels for statistics to help much.
You can play perfect blackjack and card count at a table with good rules and lose plenty because your advantage is small (< 2%) and your repetitions are limited.
Statistics get even worse when the probabilities are chained because the weakest estimator bounds the rest.
Essentially, if you always follow statical advice you should do better than average, if you're lucky. There are better heuristics than statistics in most fields of human decision making.
As OP noticed likelihoods often do show up in a comparative context. In that context one is asking which thing or sequence is most likely to occur by chance relative to another, under an (over simplistic, sure) IID assumption. In practice, the ordering of such things is often (hand-waving, sure) robust enough that, given no other information than the marginals, it is useful. So I think OP almost answered his/her own question: they are often quite useful in a comparative context and with no additional information.
You work with probability density functions because the probability of observing any given value in a continuum is zero. Density functions may be reasonable to work with if they have some nice properties (continuity, unimodality, ...) The question and answers here seem to be from people that don't understand calculus.
I think the correct answer is that it is mostly bogus, but likelihood based statistical methods mostly work for well-behaving distributions, especially for Gaussian.
Maximum likelihood estimation has some weird cases when the distribution is not "well behaving".
The actual probability is 0, but the probability density is not 0. Same reason why the probability that I pick 0.5 from a uniform distribution from 0 to 1 is 0, but the value of the probability density function of the distribution at 0.5 is 1.
I'll give the mathematical explanation. So if X is a continuous random variable, the probability that X takes on any particular value x is 0, i.e. P(X = x) = 0. However, it still makes sense to talk about P(X < x) --- this is clearly not 0. For example, suppose X is a random variable of the uniform distribution from 0 to 1. P(X = 0.5) = 0, clearly, but P(X < 0.5) = 0.5, clearly. (There's a 50% chance that X takes on a value less than 0.5). We can talk about P(X < x) as a function of x---in the case of the uniform distribution, P(X < x) = x. (There's a 30% chance that X takes on a value less than 0.3, there's a 80% chance that X takes on a value less than 0.8, etc.) This is called the cumulative distribution function---it tells us the cumulative probability (accumulating from -infinity to x). The probability density function is the rate of change---the derivative---of the cumulative distribution function. At a particular x, how "quickly" is the cumulative distribution function increasing at that point? That is the question that the probability density function answers, if that makes sense.
In the case of the cumulative distribution function of the uniform distribution from 0 to 1, since the derivative of x is 1, the probability distribution function is 1 from 0 to 1 and 0 elsewhere. This makes sense; the probability P(X < x) isn't increasing faster at one point than any other---with the exception of x outside of 0 and 1 having a probability density value of 0, since e.g. P(X < 2) is 100% and increasing the value of x=2 does not change this (it's still 100% because X only takes on values within [0,1]) .
That's interesting and intuitive for a uniform distribution. What does it then mean on a non-uniform distribution for an value to be very small? Is there some interpretation for that? The Stack Overflow post actually mentions values that are extremely close to zero.
So, just to be sure, even for a uniform distribution, the values can be small. Consider the uniform distribution from 0 to 10^100. The CDF for this distribution is P(X < x) = x/10^100. The derivative of this (the PDF) is p(x) = 1/10^100. At any particular point, p(x) is 1/10^100. But this is true for any x (again, unless it is outside the range [0, 10^100]), which makes sense because the "speed" with which the probability is increasing is constant regardless of the x. Why are these values smaller than for the uniform distribution on [0,1]? It's because the probability increases much more slowly per unit of x on the uniform distribution from [0, 10^100] than it is on the uniform distribution from [0, 1]. P(X < 0) to P(X < 1) for Uniform(0, 10^100) only increases the probability by 1/10^100, while it increases the probability by 1 for Uniform(0, 1).
So PDFs can have small values regardless of whether they are uniform or not. What a small PDF at a point x indicates is that the CDF is increasing very "slowly" at that x. I'll emphasize this point - PDF values are not probabilities. They are rates of change of the CDF.
For some further understanding of the stack overflow post, let's consider Uniform(0, 2). The PDF is p(x) = 1/2. Suppose the author of the stack overflow post drew 50 samples from this distribution. Regardless of what those 50 samples were, the value he would have gotten would have been (1/2)^50 = 1/(2^50), something on the order of 10^-16. Why is this so small?
(I'll give a rather loose and informal explanation here, but I can be more formal if you'd like, if this doesn't make sense.) Think back to Uniform(0, 1) vs. Uniform(0, 10^100). Recall that the probability that a particular x falls in [0, 1] for the former distribution is the same as the probability that a particular x falls in [0, 10^100]---i.e. 1 (100%). In the case of the latter distribution, that 1 has had to be "spread out" across a larger space, which should give some intuition as to why the PDF is low---for a particular unit in space that we "travel", since the probability has been spread out so thinly across the space, the CDF isn't increasing that much, i.e. the PDF isn't that high.
When we're looking at PDF values when we're looking at the space of possibilities covered by 50 samples, it's going to be a lot "larger" than the space covered by 1 sample (over one sample, the space is [0,2], covering 2 units of space. over two samples, the space is the square [0,2] x [0,2], with an area of 4. over 50 samples, the space is the hypercube [0,2]^50, with a 50-dimensional volume of 2^50---a huge space.) But the total probability is still 1, so it's going to be "spread out" very thinly across this larger space, hence much smaller values. And so, the probability we accumulate as we move across this space per unit is going to be very low, hence a low likelihood value.
So when we draw many samples from a distribution, the likelihood of these samples is going to be very small (mostly---there might be spikes where they're high).
I've spoken a little loosely and informally, but hopefully this makes sense.
I just don't quite understand why more samples mean that the "space" gets higher dimensional and consequently less dense. Aren't the samples just estimating the underlying PDF, such that more samples shouldn't decrease the magnitude of the PDF? So if he drew those samples from Uniform(0, 2), shouldn't the resulting PDF simply approximate a value of 1/2=0.5 everywhere? I'm probably misunderstanding something basic here.
Consider a coin flip. 50% chance heads, 50% chance tails. This distribution is called Bernoulli, specifically Bernoulli(0.5). If we sample from this distribution, we get 1 (representing heads) with a 50% probability, or 0 (representing tails) with a 50% probability.
Now consider taking two samples, and calculating the likelihood of those two samples. Suppose we draw two samples from this distribution, HT (heads followed by tails). What is the probability that we got exactly these two samples from the distribution? Trivially, it's 0.5 * 0.5 = 0.25. Notice how this isn't the same as the probability of drawing any single sample (the probability of drawing any particular sample, that is, either heads or tails, is just 0.5).
The distribution representing the probability of a single sample of a coin flip lies in {0, 1}. You can think of this as a single-dimensional table, [0.5, 0.5], where each element represents the probability of the sample taking on the index of that element. (the probability of the sample taking on the value 0, which represents tails, is the 0th element of this array, 0.5. Similarly for 1, which represents heads).
Now think of the distribution for two samples. There are no longer two possibilities, but four - {0, 1} x {0, 1} = {(0, 0), (0, 1), (1, 0), (1, 1)} = {TT, TH, HT, HH}. We think of this as not a one-dimensional table but a two-dimensional table:
second sample = 0(T) [[0.25, 0.25],
second sample = 1(H) [0.25, 0.25]]
first sample = 0(T), first sample = 1(H)
Here, the element at row i and column j represents the probability that the first sample takes on value j and the second sample takes on value i.
For three samples, the distribution becomes three-dimensional, with the space of possibilities being {0,1}^3 = {(0, 0, 0), (0, 0, 1), (0, 1, 0), ...}.
For any of these tables, each element represents the probability that a sample takes on the corresponding value at the element's position. So, clearly, if we add up all the values in a table, no matter how many dimensions, it must sum up to 1. There is a 100% chance that a sequence of n samples takes on some value, after all.
What you're saying about drawing multiple samples approximating the underlying PDF is still true here (though we are not talking about the PDF in the discrete case, but rather the PMF - probability mass function - since each element in this table is actually a probability, not merely a measure of density). If you draw N samples from this distribution and plot it on a histogram (one bar for the number of heads you draw divided by N, one bar for the number of tails you drew divided by N), then this will approximate the underlying PMF, namely [0.5, 0.5]. But that is separate from the fact that drawing a particular sequence of N samples becomes decreasingly smaller as N increases. For N = 2, the probability of drawing any two particular samples (TT, TH, HT, or TT) is (1/2)^2. In general, it is (1/2)^N. One way to think about why it is (1/2)^N is that the distribution for N sample lies on the space {0, 1}^N, whose size is 2^N. The total probability, which is always 1, (no matter how large N is, it's still true that there's a 100% chance that a sequence of N samples is some sequence), needs to be distributed across the space 2^N. Every possibility is equally likely, so it's evenly distributed, so the probability is 1 / 2^N = (1/2)^N.
The same idea roughly applies in the continuous case, but importantly, in the continuous case, we're no longer talking about raw probabilities for a particular sample (the probability of drawing the value 0.3 from the distribution Uniform(0, 1) is exactly 0), but we're talking about probability density values. The same principle still applies though - if we "sum" (integrate) up all the PDF values for a distribution, since the PDF is the derivative of the CDF, by the fundamental theorem of calculus, we still should get 1. (The PDF is p(x) = d/dx P(X < x). Integrating both sides across all possible values X can take on, we will get integral from min possible value of x to max possible value of x of p(x) dx = P(X < max possible value of x) - P(X < min possible value of x) = 1 - 0 = 1). This total probability, 1, needs to be distributed across some space. The bigger the space is, the less densely it's going to be distributed, which is reflected in the lower value of the PDF.
To be sure, the space getting higher dimensional doesn't necessarily mean the PDF must be less dense. Consider Uniform(0, 0.5). When looking at the likelihood of two samples being from Uniform(0, 0.5), the probability is spread across [0, 0.5] x [0, 0.5], whose area is 0.50.5 = 0.25. Since the area is less than 1, the probability is actually more* dense---specifically, the PDF is 4 at any point in [0, 0.5] x [0, 0.5], but for just one sample, the PDF is 2 at any point in [0, 0.5]. Whether the probability gets less or more dense in the higher dimensional space representing the likelihood of multiple samples from the same distribution depends on the volume of the domain. For Uniform(0, 2), the space of two samples is [0,2] x [0,2], whose area is 2 * 2 = 4---this is larger than it is for just one sample, since the space for just one sample is [0,2], which covers 2 units of space. Accordingly, for just one sample, the PDF is 0.5, while for two samples, the PDF is 0.25. The larger this space, the less dense the probability is concentrated in this space, and vice versa. If we're thinking about uniform distributions, notice the space gets bigger for Uniform(0, L) if L > 1, and gets smaller for L < 1 since powers of L (representing the size of higher dimensional spaces, e.g. L^2 represents the size of [0, L] x [0, L], the space on which the PDF for two samples from the distribution must lie) get smaller if L < 1 but get bigger if L > 1. For the stack overflow post, the distribution in question is Gaussian, which takes on positive values on (-infinity, infinity), which you can think of as being more than large enough for the size of higher-dimensional spaces to be increasing, hence causing the PDF values to become smaller and smaller.
I hope I didn't make the problem more confusing by saying all that. If you're still confused, I can try to clear up things further, or I can point you to a better resource if you'd prefer that.
Everyone seems to be missing the point here. The SO post says:
> As we can see, even from the correct distribution, the likelihood is very, very small. Thus, it appears to be very unlikely in a certain sense that these numbers came from the very distribution they were generated from.
The person who asked the question is simply confused between likelihoods and posterior probabilities. The likelihood of d values from a Normal Distribution is defined to be the probability of sampling those d values given the parameters of the Normal. It is not the probability that those numbers came from that Normal Distribution. To answer the latter question, you need to say what other possibilities you're considering (perhaps some other parameter values) and use Bayes Rule. The other answers mention that ratios will be involved, but the way to see why ratios are involved is to look at Bayes Rule.
Statistician here. There's a deep idea called the likelihood principle https://en.wikipedia.org/wiki/Likelihood_principle that says all the information we can get from the data about model parameters is contained in the likelihood function.
We're talking about the whole likelihood surface here, not just the single point that's the maximum likelihood estimator. The MLE is a method for choosing a valid point estimator from the likelihood function; it has some good properties, like being consistent (if you have enough data it converges to the truth) and asymptotically efficient (converges smallest possible variance) so long as some criteria are met.
But the MLE is not the only choice; for any given model, other procedures can be admissible estimators https://en.wikipedia.org/wiki/Admissible_decision_rule - it's just they also have to be procedures based on the likelihood function. In other words, your procedure doesn't have to be "take the likelihood function and find its maximum" but it has to be "take the likelihood function and... do something sensible with it."
So the MLE is popular in the frequentist world where you have to make the decision rules using the likelihood directly; in the Bayesian world, you take the likelihood and combine it with a prior, to make an actual probability distribution. Then you get things like like MAP (mode of the posterior) or the Bayes estimate (expectation of the posterior) - alternatives to MLE that still use the likelihood surface.
Of course this all works only if the underlying probabilistic model is literally true. So, in the machine learning world where the models are judged on being useful on usefulness and not expected to reflect mathematical reality, you're allowed to do things inconsistent with likelihood principle, like regularization tricks. In some physics situations (astronomical imaging comes to mind) where the probability model really is governed by the rules of nature, sticking to likelihood principle actually matters.
As to the question of being small, well, the likelihood is the probability (density) of the exact data you observe given parameters. Let's say you know the true parameter (the mean and standard deviation) and you observe a thousand draws from a normal distribution. Of course the probability of observing the very same pattern of a thousand values again is overwhelmingly unlikely. But if the mean was way different, that pattern would be proportionally even more unlikely.
We should only care about relative probabilities. What's the probability that the universe evolved in exactly such a way that your cat will have exactly this fur pattern? Astronomically small. What's the probability that the universe evolved in such a way, and some of that fur ends up on your furniture? Another unimaginably small number. But what's the probability that, in a universe where you and your cat exist as you are, his fur will get everywhere? That's pretty much a certainty.
In a continuous distribution the probability of any number on that distribution being generated is effectively zero. If R was generating the true probabilities it should give you zero for every single number.
Think about it. That distribution is continuous over an infinite amount of numbers. If you select any number the chances of that number being generated will be essentially zero. According to the theory there is no possibility for any number on the distribution to be generated. This is correct.
Yet when you use the random number generator you get a number even though that number technically is impossible to exist due to zero probability. Does this mean there is a flaw in the theory when applied to the number generated?
Yes it does. The theory is an approximation of what's going on itself. No random number generator in a computer is selecting a number from a truly continuous set of numbers. It is selecting it from a finite set of numbers from all available numbers in a floating point specification.
Even if it's not a computer when you select a random number by intuition from a continuous distribution you are not doing it randomly.
Think about it. Pick a random number between 0 and 1. I pick 0.343445434. This selection is far from random. It is biased because there is an infinite amount of significant figures yet I arbitrarily don't go past a certain amount. I cut off at 9 sigfigs and bias towards a cutoff like that because picking a random number with say 6000 sigfigs is just too inconvenient. You really need to account for infinite sigfigs for the number to be truly random which is impossible.
So even when you pick numbers randomly you are actually picking from a finite set.
In fact I can't think of anything in reality that can truly be accurately described with a continuous distribution. Nothing is in the end truly continuous. Or maybe it does exist, but if it does exist how can we even confirm it? We can't verify anything in reality to a level of infinite sig figs.
If R was accurately calculating likelihood it should give you zero for each number. And the random number generator should not even be able to exist as how do even create a pool of infinite possibilities to select from? Likely R is giving some probability over a small interval of numbers.
That's where the practicality of the continuous distribution makes sense when you measure the probability of a range of values. You get a solid number in this case.
Anyway the above explanation is probably too deep. A more practical way of thinking about this is like this:
It is unlikely for any one person to win the lottery. Yet someone always wins. The probability of someone winning is 100 percent. The probability of a specific someone winning is 1 over the total number of people playing.
Improbable events in the universe happen all the time because that all that's available. It's highly improbable for any one person to win the lottery but if someone has to win, then there is a 100 percent chance that an arbitrary improbable event will occur.
This is more easily seen in a uniform discrete distribution rather then the normal continuous distribution.
In the case of the normal distribution it is confusing. In a normal distribution It is far more likely for an improbable event to occur then it is for the single most probable event to occur.
Think of it like this. I have a raffle. There are 2 billion participants. Each person has one ticket in the bag, except me. I have 100,000 tickets in the bag I am the most likely person to win.
But it is still far more likely for anyone else but me to win even when I am the most likely person to win. An arbitrary improbable event is more likely to occur then the single most probable event.
> If R was accurately calculating likelihood it should give you zero for each number.
You have some good points but this is false. The probability of any point for a continuous distribution is indeed zero. That doesn't mean that the density at this point is also zero.
It is fair to ask why the likelihoods are useful if they are so small, and it's not a good answer to talk about how they could be expressed as logs, or even to talk about the properties of continuous distributions.
I think the answer is:
Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.
However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.
Much like how the average is unlikely to be the exact value of a new sample from the distribution, but it's a good way of describing what to expect. (And gets better if you augment it with some measure of dispersion, and so on). (If the distribution is very dispersed, then while the average is less useful as an idea of what to expect, it still minimises prediction error in some loss; but that's a different thing and I think less relevant here).