1.5 - Maximum Likelihood Estimation

One of the near fundamental concepts of modern statistics is that of likelihood. In each of the detached random variables nosotros have considered thus far, the distribution depends on one or more parameters that are, in most statistical applications, unknown. In the Poisson distribution, the parameter is \(\lambda\). In the binomial, the parameter of interest is \(\pi\) (since due north is typically fixed and known).

The likelihood role is substantially the distribution of a random variable (or articulation distribution of all values if a sample of the random variable is obtained) viewed as a office of the parameter(due south). The reason for viewing it this way is that the data values will exist observed and can be substituted in, and the value of the unknown parameter that maximizes this likelihood role can then be establish. The intuition is that this maximizing value is the one that makes our observed information near probable.

Bernoulli and Binomial Likelihoods Section

Consider a random sample of \(n\) Bernoulli random variables, \(X_1,\ldots,X_n\),each with PMF

\(f(ten)=\pi^10(1-\pi)^{i-x}\qquad 10=0,ane\)

The likelihood function is the joint distribution of these sample values, which we can write by independence

\(\ell(\pi)=f(x_1,\ldots,x_n;\pi)=\pi^{\sum_i x_i}(1-\pi)^{n-\sum_i x_i}\)

We interpret \(\ell(\pi)\) every bit the probability of observing \(X_1,\ldots,X_n\) every bit a function of \(\pi\), and the maximum likelihood guess (MLE) of \(\pi\) is the value of \(\pi\) that maximizes this probability role. Equivalently, \(L(\pi)=\log\ell(\pi)\) is maximized at the same value and tin be used interchangeably; more oft than not, the loglikelihood function is easier to piece of work with.

Yous may have noticed that the likelihood role for the sample of Bernoulli random variables depends only on their sum, which we tin write as \(Y=\sum_i X_i\). Since \(Y\) has a binomial distribution with \(northward\) trials and success probability \(\pi\), we tin can write its log likelihood role as

\(\displaystyle L(\pi) = \log {due north\choose y} \pi^y(1 - \pi)^{n-y}\)

The but difference between this log likelihood function and that for the Bernoulli sample is the presence of the binomial coefficient \({n\choose y}\). But since that doesn't depend on \(\pi\), it has no influence on the MLE and may be neglected.

With a little calculus (taking the derivative with respect to \(\pi\)), nosotros can testify that the value of \(\pi\) that maximizes the likelihood (and log likelihood) function is \(Y/n\), which we denote as the MLE \(\hat{\pi}\). Not surprisingly, this is the familiar sample proportion of successes that intuitively makes sense equally a good approximate for the population proportion.

Instance: Binomial Case i Section

If in our before binomial sample of 20 smartphone users, nosotros observe 8 that apply Android, the MLE for \(\pi\) is then \(8/20=.4\). The plot beneath illustrates this maximizing value for both the likelihood and log likelihood functions. The "dbinom" office is the PMF for the binomial distribution.

                        likeli.plot = role(y,n) {     50 = office(p) dbinom(y,n,p)     mle = optimize(L, interval=c(0,1), maximum=Truthful)$max     p = (ane:100)/100     par(mfrow=c(2,ane))     plot(p, 50(p), type='l')     abline(five=mle)     plot(p, log(Fifty(p)), type='l')     abline(v=mle)     mle } likeli.plot(8,20)                      
Figure 1.7: Likelihood and loglikelihood plots for \(y=8\) and \(due north=20\)

Example: Binomial Example 2 Section

Nosotros know that the likelihood function achieves its maximum value at the MLE, but how is the sample size related to the shape? Suppose that nosotros detect \(X = 1\) from a binomial distribution with \(n = 4\) and \(\pi\). The MLE is then \(1/4=0.25\), and the graph of this function looks like this.

Figure one.viii: Likelihood plot for \(n=4\) and \(\hat{\pi}=0.25\)

Hither is the program for creating this plot in SAS.

                                                  data for_plot; do x=0.01 to 0.8 by 0.01; y=log(ten)+3*log(1-x);   *the log-likelihood role; output; end; run;  /*plot options*/ goption reset=all colors=(black); symbol1 i=spline line=1; axis1 order=(0 to ane.0 past 0.2);   proc gplot data=for_plot; plot y*ten / haxis=axis1  ; run;  quit;                                              

At present suppose that we observe \(X = 10\) from a binomial distribution with \(n = 40\). The MLE is once again \(\hat{\pi}=10/40=0.25\), but the loglikelihood is now narrow:

Figure 1.ix: Likelihood plot for \(north=50\) and \(\lid{\pi}=0.25\)

Finally, suppose that nosotros observe \(10 = 100\) from a binomial with \(n = 400\). The MLE is still \(\chapeau{\pi}=100/400=0.25\), but the loglikelihood is now narrower notwithstanding:

Effigy one.ten: Likelihood plot for \(n=500\) and \(\hat{\pi}=0.25\)

Equally \(north\) gets larger, nosotros observe that \(Fifty(\pi)\) is becoming more sharply peaked around the MLE \(\hat{pi}\), which means the true parameter lies close to \(\lid{\pi}\). If the loglikelihood is highly peaked—that is, if it drops sharply every bit nosotros move abroad from the MLE—then the evidence is stiff that \(\pi\) is near the MLE. A flatter loglikelihood, on the other hand, ways that more values are plausible.

Poisson Likelihood

Suppose that \(10 = (X_1, X_2, \dots, X_n)\) are iid observations from a Poisson distribution with unknown parameter \(\lambda\). The likelihood part is

\begin{aligned} L(\lambda) =\prod\limits_{i=i}^{n} f\left(x_{i} ; \lambda\right) =\prod\limits_{i=i}^{n} \dfrac{\lambda^{x_{i}} e^{-\lambda}}{x_{i} !}  =\dfrac{\lambda^{\sum_i x_{i}} e^{-n \lambda}}{x_{1} ! x_{2} ! \cdots x_{due north} !} \end{aligned}

The corresponding loglikelihood function is

\(\sum\limits_{i=1}^{n} x_i\log\lambda-n\lambda-\sum\limits_{i=1}^{north} x_i!\)

And the MLE for \(\lambda\) can then exist found by maximizing either of these with respect to \(\lambda\). Setting the kickoff derivative equal to 0 gives the solution:

\(\chapeau{\lambda}=\sum\limits_{i=ane}^{n} \dfrac{x_i}{north}\).

Thus, for a Poisson sample, the MLE for \(\lambda\) is only the sample mean.