More on Mixture Models, Maximum Likelihood, and Direct Search Methods

This post describes some issues that I encountered while trying to calculate likelihood values using the direct search approach. Direct search methods calculate the likelihood value for each specified parameter combination and compare that value to the values of all other specified combinations. This approach has the advantage of simplicity, but it requires careful consideration of the parameter values for which likelihood values are to be calculated and compared.

Consider a model with three variables. A likelihood value can be calculated for any particular combination of parameter values. By systematically varying the values of these three parameters, a three-dimensional picture of likelihood values can be created that shows how the likelihood responds to changes in parameter values. Local peaks or optima in likelihood values may exist in addition to the maximum likelihood value found at the combination of parameter values that constitutes the maximum likelihood estimates for those parameters. Direct search methods may avoid being fooled by local optima in likelihood value under certain conditions.

Two conditions must be satisfied to avoid settling at local optima. First, likelihood values should be calculated at sufficiently at narrow interval of the values for each parameter. Narrow intervals ensure that the calculation of likelihood values does not skip over parameter values that are at or close to the maximum likelihood estimates. In addition, the range of values explored for each parameter must be sufficiently broad to encompass the maximum likelihood values.

The direct search approach often requires that a balance be struck between precision and power. Choosing narrow search intervals and a broad range of values over which to search provides greater opportunities to find the maximum likelihood estimates or values close to them. The cost of employing narrow search intervals and a broad range of values is computing time. Narrowing the interval or broadening the range increases the number of parameter value combinations for which likelihood values must be computed, slowing the process of finding the maximum likelihood estimates. Searching over too broad a range of parameter values also risks other potential problems.

My work modeled fish vertebrae size-frequency data as a mixture of two lognormal distributions. For my data, the direct search method would sometimes find that the maximum likelihood estimates included a distribution with a very small log standard deviation (less than 0.05). Such estimates occurred when one distribution represented a small proportion of the mixture distribution (less than 0.25). For convenience of reference, I have termed these fitted distributions as “vestigial” distributions; they are part of the mixture distribution but don’t add much information to it.

These results seem analogous to the “overfitting” that occurs when a model with an unnecessarily large number of parameters is fit to the data. Models with large numbers of parameters will likely fit the current data very well but will be unable to fit future data sets accurately. Some of the model terms are likely to be accommodating noise in the data and not the process of interest. Similarly, cases where the maximum likelihood method results produce a vestigial distribution may indicate that the vestigial distribution is primarily accommodating some noise in the data. The resulting estimates for the mixture distribution do fit the data. My suspicion, however, is that this mixture distribution would fit another sample from the same population poorly.

Results that included these vestigial distributions also seem unrealistic. Under what circumstances would one portion of the mixture distribution have such a small log standard deviation, so all the individuals from that distribution are essentially of the same size? Such occurrences could possibly result if the data was not generated by independent processes. This circumstance could apply to my fish vertebrae data if I was I not successful in screening out multiple vertebrae from the same individual fish from my data set.

The vestigial distributions were most likely to be generated when modeling data sets with relatively small sample sizes. This pattern suggests that they do derive from noise in the data. I would otherwise have a hard time explaining variability in the occurrence of these vestigial distributions. For most of my assemblages, both components of the mixture distributions had relatively large log standard deviations.

I thus constrained the direct search to only consider distributions with a log standard deviation greater than 0.07. With this constraint in place, the mixture distribution model produced results that seemed reasonable and realistic. In most cases, the mixture models produced a significantly better fit to the data than a single lognormal distribution. The exceptions occurred, as might be expected, in the case of assemblages with relatively small samples sizes.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2010.