It is very strange that this is on a main page. The key thing is likelihood is probability density of your data! I.e. if your probability density is a Gaussian N(0,0.00001), then the likelihoods of data-points next to the mean will be very large, if your PDF is N(0,10000) they'll be very small.
Furthermore the amount of data matters as likelihoods will be multiplied for each datapoint, so if they were small in the beginning, they'll be even smaller, if they were large they'll be larger.
I think there is some language misunderstanding going on. Likelihood function is not a probability density. Likelihood function evaluated for D is equal to probability density of D (by definition). In other words, f(x;theta) as a function of x is a probability density function. f(x;theta) as a function of theta is a likelihood function. But f(x;theta) for given x and theta is just a value, which, one can say, is both likelihood and probability density.
Right. But if you make the notation slightly more explicit, then the integral of L(data, params) over data is 1. This follows from the independence assumption.
So we ARE working with a probability function. Its output can be interpreted as probabilities. It's just that we're maximizing L = P(events | params) with respect to params.
The likelihood function is a function of params for a fixed value of data and it is not a probability function.
There is another function - a function of data for fixed params - which is a probability density. That doesn’t change the fact that the likelihood function isn’t.
The independence has nothing do with the integral being 1 to be honest. You could write a model where the observations are not independent but the (multivariate) integral over their domain will still be 1.
If by “joint probability” you mean function(params, data) there is no joint probability here in general.
L(params, data) is constructed from a family density functions p(data) for each possible value of param. The integral of L(params, data) may be anything or diverge. You don’t need any extra independence assumption either.
Or maybe you mean “joint probability” as p(data1, data2) when data is composed of two observations, for example. But you don’t need any independence assumption for that probability density to integrate to one! It necessarily does that - whether you can factorize it as p’(data1)p’’(data2) or not.
That's exactly the reason why frequentist approach sucks by the way;) Parameters are treated specially and there is no internal consistency - to have it you need to introduce priors...
A likelihood could be referring to data drawn from discrete distribution though and this wouldn’t change much about how it’s treated and it would be a proper probability not prob-density
I'm surprised it's here as well. Of all the interesting questions on CV, I would not consider this one of them. I wonder if this was sent through the second-chance pool.