## Posts Tagged ‘maximum likelihood’

### Comparing the Fit of Distributions to Experimental Flake Size Data I

January 28, 2014

To determine whether the power law distribution provides an appropriate model for flake-size distributions, I obtained experimentally-generated lithic assemblages and fit a number of common distributions to the data. The initial experiments comprised the reduction of cores to produce flakes that could be further shaped into a tool. I used the maximum likelihood method to find the optimal parameter values for each distribution. The following table shows some initial results of this work, focusing on the flake-size data generated during a single core reduction episode.

Results of Fitting Various Distributions to a Single Experimentally-Generated Flake Assemblage

The models selected for comparison comprise a list of commonly-used distributions for modeling heavy-tailed data. Other model-fitting approaches, for example, found the Weibull distribution to fit flake-size distributions. While the maximum likelihood method provides a means by which to fit each of these models to the same data, this approach must be supplemented with some method for comparing the results.

One approach to model comparison, the Akaike Information Criterion (AIC), comes from information theory. The information-theoretic approach quantifies the expected distance between a particular model and the “true” model. The distance reflects the loss of information resulting from use of the focal model, which naturally fails to capture all of the variability in the data. The best model among a series of candidate models is the model that minimizes the distance or information loss. Obviously, the true model is unknown, but the distance can be estimated using some clever math. The derivation of AIC is beyond my powers to explain in any additional detail. After much hairy math, involving various approximations and simplifications and matrix algebra, a very simple formula emerges. AIC quantifies the distance using each model’s likelihood and a penalty term. AIC for a particular model can be defined as follows:

AIC = -2L + 2k, where L is the log-likelihood and k refers to the number of parameters in the model.

For small samples, a corrected version of AIC – termed AICc – is sometimes used:

AICc = AIC + 2k(k+1)/(nk-1)

The best-fitting model among a series of candidates will have the lowest value of AIC or AICc. Unlike other approaches, the information-theoretic approach can simultaneously compare any set of models. Likelihood ratio tests, another common model-comparision technique, are limited to pairwise comparisons of nested models; models are nested when the more complex model can be reduced to the simpler model by setting some parameters to a particular value, such as zero. The models compared in this example are not all nested, since the lognormal – for example – cannot be reduced to any of the other models by setting a parameter to a particular value.

The AIC and AICc for the power law distribution are both lower than the values for any of the other modeled distributions. The power law distribution thus fits this experimentally-produced flake size data better than other common distributions. These preliminary results support the work of Brown (2001) and others.

Note that the best-fitting model among all candidates may still provide a poor fit to the data. Thus, the power law distribution could still provide a poor fit to the data. A couple options exist for evaluating model fit. A simple approach would be to plot the data versus the theoretical distribution. There is also a way to measure the fit of the model to the data, which I will detail in a subsequent post.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2014

### Power Law Distributions and Model-fitting Approaches

November 18, 2013

A power law distribution has a heavy tail. The following simulation results depict the form of a power law distribution. In this example, the value of $\alpha$ is 3.6, the minimum value is 8.9, and the sample size is 75. The histogram shows the values that resulted from the simulation run, while the red line shows the theoretical distribution. Note that a few simulation values are much, much larger than the rest. To preview some results to be presented later, the distribution of debitage size in lithic reduction experiments looks similar. Such similarity is suggestive but not definitive.

As discussed previously, linear regression analysis has sometimes been used to evaluate the fit of the power law distribution to data and to estimate the value of the exponent, $\alpha$. This technique produces biased estimates, which the next simulation results illustrate. In the simulation, a random number generator produced a sample of numbers drawn from a power law distribution at a particular value of $\alpha$. I then analyzed this artificial data set using the linear regression approach described by Brown (2001) and using maximum likelihood estimation through a direct search of the parameter space. For a simple power law distribution, the maximimum likelihood estimate could also be found analytically. I used a direct search approach, however, in anticipation of using this approach for more complex mixture models. I repeated the random number generation and analysis 35 separate times for several different combinations of $\alpha$ and sample sizes. The following histograms show the estimates for $\alpha$ using the linear regression analysis and maximum likelihood estimation to find that value. In this particular case, $\alpha$ was set to 3.6 in the random number generator, the minimum value was set to 8.9, and the sample size was 500.

The histograms clearly suggest that the maximum likelihood estimates center closely around the true value of alpha, while the regression analyses skew to a lower value. The simulations that I performed at other parameter values and sample sizes displayed similar results. Other, more extensive simulation work also supports these impressions (Clauset et al. [2009] provides these results as part of a detailed, comprehensive technical discussion). Consequently, I used maximum likelihood estimation to fit probability distributions to data in the subsequent analyses.

References cited

Brown, Clifford T.
2001 The Fractal Dimensions of Lithic Reduction. Journal of Archaeological Science 28: 619-631.

Clauset, Aaron; Cosma Rohilla Shalizi; and Mark E. J. Newman
2009. Power-law distributions in empirical data. SIAM Review 51(4): 661-703.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013

### Identification of Lithic Reduction Strategies from Mixed Assemblages

November 11, 2013

This post is the first in a series that will try to characterize lithic debitage assemblages formed from more than one reduction strategy. The primary goals are to estimate the proportions of the various reduction strategies represented within these mixed assemblages and to quantify the uncertainty of these estimates. I plan to use mixture models and the method of maximum likelihood to identify the distinct components of such assemblages.

Brown (2001) suggests that the distribution of debitage size follows a power law. Power law distributions have the following probability density function:

$f(x\vert \alpha) = C*x^{\alpha}$,

where C is a constant that normalizes the distribution, so the density integrates to one. The value of C thus depends entirely on the exponent $\alpha$.

Based on analysis of experimentally-produced assemblages, Brown further suggests that the exponent, $\alpha$, of these power law distributions varies among different reduction strategies. Thus, different reduction strategies produce distinctive debitage size distributions. This result could be very powerful, allowing reduction strategies from a wide variety of contexts to be characterized and distinguished. The technique used by Brown to estimate the value of the exponent, however, has some technical flaws.

Brown (2001) fits a linear regression to the relationship between the log of flake size grade and the log of the cumulative count of flakes in each size grade. In its favor, this approach seemingly reduces the effects of small sample sizes and can be easily replicated. The regression approach, on the other hand, also produces biased estimates of the exponent and does not allow the fit of the power law model to be compared to other probability density functions.

Maximum likelihood estimates, using data on the size of each piece of debitage, produce more reliable estimates of the exponent of a power law. Maximum likelihood estimates can also be readily compared among different distributions fit to the data, to evaluate whether a power law is the best model to describe debitage size distributions. The next post will illustrate the use of the linear regression approach and the maximum likelihood approach on simulated data drawn from a power law distribution.

Reference cited

Brown, Clifford T.
2001 The Fractal Dimensions of Lithic Reduction. Journal of Archaeological Science 28: 619-631.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013

### More on Mixture Models, Maximum Likelihood, and Direct Search Methods

January 15, 2010

This post describes some issues that I encountered while trying to calculate likelihood values using the direct search approach. Direct search methods calculate the likelihood value for each specified parameter combination and compare that value to the values of all other specified combinations. This approach has the advantage of simplicity, but it requires careful consideration of the parameter values for which likelihood values are to be calculated and compared.

Consider a model with three variables. A likelihood value can be calculated for any particular combination of parameter values. By systematically varying the values of these three parameters, a three-dimensional picture of likelihood values can be created that shows how the likelihood responds to changes in parameter values. Local peaks or optima in likelihood values may exist in addition to the maximum likelihood value found at the combination of parameter values that constitutes the maximum likelihood estimates for those parameters. Direct search methods may avoid being fooled by local optima in likelihood value under certain conditions.

Two conditions must be satisfied to avoid settling at local optima. First, likelihood values should be calculated at sufficiently at narrow interval of the values for each parameter. Narrow intervals ensure that the calculation of likelihood values does not skip over parameter values that are at or close to the maximum likelihood estimates. In addition, the range of values explored for each parameter must be sufficiently broad to encompass the maximum likelihood values.

The direct search approach often requires that a balance be struck between precision and power. Choosing narrow search intervals and a broad range of values over which to search provides greater opportunities to find the maximum likelihood estimates or values close to them. The cost of employing narrow search intervals and a broad range of values is computing time. Narrowing the interval or broadening the range increases the number of parameter value combinations for which likelihood values must be computed, slowing the process of finding the maximum likelihood estimates. Searching over too broad a range of parameter values also risks other potential problems.

My work modeled fish vertebrae size-frequency data as a mixture of two lognormal distributions. For my data, the direct search method would sometimes find that the maximum likelihood estimates included a distribution with a very small log standard deviation (less than 0.05). Such estimates occurred when one distribution represented a small proportion of the mixture distribution (less than 0.25). For convenience of reference, I have termed these fitted distributions as “vestigial” distributions; they are part of the mixture distribution but don’t add much information to it.

These results seem analogous to the “overfitting” that occurs when a model with an unnecessarily large number of parameters is fit to the data. Models with large numbers of parameters will likely fit the current data very well but will be unable to fit future data sets accurately. Some of the model terms are likely to be accommodating noise in the data and not the process of interest. Similarly, cases where the maximum likelihood method results produce a vestigial distribution may indicate that the vestigial distribution is primarily accommodating some noise in the data. The resulting estimates for the mixture distribution do fit the data. My suspicion, however, is that this mixture distribution would fit another sample from the same population poorly.

Results that included these vestigial distributions also seem unrealistic. Under what circumstances would one portion of the mixture distribution have such a small log standard deviation, so all the individuals from that distribution are essentially of the same size? Such occurrences could possibly result if the data was not generated by independent processes. This circumstance could apply to my fish vertebrae data if I was I not successful in screening out multiple vertebrae from the same individual fish from my data set.

The vestigial distributions were most likely to be generated when modeling data sets with relatively small sample sizes. This pattern suggests that they do derive from noise in the data. I would otherwise have a hard time explaining variability in the occurrence of these vestigial distributions. For most of my assemblages, both components of the mixture distributions had relatively large log standard deviations.

I thus constrained the direct search to only consider distributions with a log standard deviation greater than 0.07. With this constraint in place, the mixture distribution model produced results that seemed reasonable and realistic. In most cases, the mixture models produced a significantly better fit to the data than a single lognormal distribution. The exceptions occurred, as might be expected, in the case of assemblages with relatively small samples sizes.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2010.

### Mixture Models, Maximum Likelihood Methods, and Confidence Intervals

December 19, 2009

In an earlier post, I noted that the parameter estimates for a mixture model supplied by maximum likelihood methods were only part of the story. A measure of the precision of those estimates is also needed. Confidence intervals provide this measure. This post details a method for determining them.

Packages for the analysis of mixture models in R like mixdist generate confidence intervals automatically. The direct search approach, however, has proven to be more reliable for the data sets that I have been examining. In a direct search, the likelihood is calculated for each combination of parameter values over a plausible range of those parameter values.

The likelihood value of each combination is calculated by looping over a sequence of parameter values for each parameter. The interval between the values of a parameter used in the calculations should be relatively small. When small intervals are used, however, the number of combinations of parameter values for which likelihood values must be calculated increases rapidly. Direct search of the parameter space may not be practical for some applications. The direct search approach requires that a balance be struck between precision and manageability.

I have provided some R code that can be used, with some additional work, to generate confidence intervals for parameters of a simple mixture model. The mixture model assumes that the data comprises two lognormal distributions. Confidence intervals for the proportion of cases in the first model can be determined from the results produced by the code as written; the code can be modified to generate results relevant to other variables. The code follows:

### More Thoughts on Mixture Models and Maximum Likelihood Methods

November 11, 2009

In my previous post on this topic, I discussed two techniques for finding the combination of mixture distribution parameters that have the lowest log likelihood, direct search and the mixdist package for R. I suggested that direct search of the parameter space allowed the effects of outliers in my data to be identified more clearly. I have done additional work since that time, comparing direct search and the mixdist package. As a result of this work, I have concluded that direct search is much more effective at finding the optimal combination of parameter values.

Mixdist returned parameter values that consistently produced higher log likelihoods than I found using direct search. The differences were substantial. I can not fully explain the observed differences, but the differences were also consistent among all of my data sets.

Direct search of the parameter space is obviously not the most convenient approach. My model involved only two lognormal distributions with a total of five parameters. Direct search of the optimal parameters for more complex mixture models may not be feasible, as the number of parameter value combinations that need to be searched is too large. For simple mixture models, however, the direct search may be preferable.

The following code is the very simple program that I wrote for R to find the lowest log likelihood and corresponding parameter values. The program was written specifically for a mixture model of two lognormal distributions. The parameter value search space included the range of likely values of log mean of fish vertebra height and the log standard deviation of fish vertebrae height in both of the two distributions. I am not a professional programmer, as you will see, so any suggestions for improving and extending the code will be greatly appreciated.

#vcdata is a list of the data for which the parameter estimates and likelihood will be calculated

#pvec provides the sequence of values for the proportion of vertebrae in the mode of smaller (net-caught) fish over which the program loops.

pvec=seq(0.10, 0.99, by=0.015)

#mean1vec provides the sequence of values for the log mean of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

mean1vec = seq(0.80, 1.20, by = 0.03)

#sd1vec provides the sequence of values for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

sd1vec=seq(0.02, 0.44, by=0.03)

#mean2vec provides the sequence of values for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

mean2vec=seq(1.25, 1.70, by=0.03)

#sd2vec provides the sequence of values for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

sd2vec=seq(0.02, 0.44, by=0.03)

#loglike is the negative log likelihood, which is calculated for each combination of parameter values.

loglike=100000

#p stores the parameter value for the proportion of vertebrae in the mode of smaller (net-caught) fish.

p=0

#mean1 stores the parameter value for the log mean of vertebra height in the mode of smaller (net-caught) fish.

mean1=0

#sd1 stores the parameter value for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish.

sd1=0

#mean2 stores the parameter value for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish.

mean2=0

#sd2 stores the parameter value for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish.

sd2=0

#the result data frame returns the value of the log likelihood at a particular combination of parameter values

result<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

#the finalresult data frame stores the log likelihood and parameter values for the combination of parameter values that returns a log likelihood that is smaller than all other log likelihoods generated.

#looping over the combination of parameter values

finalresult<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

for (j in 1:length(mean1vec)) {

for (k in 1:length(pvec)) {

for (m in 1:length(sd1vec)) {

for (n in 1:length(mean2vec)){

for (q in 1:length(sd2vec)){

#the following function returns the negative log likelihood value at a particular combination of parameter values

L= -sum(log(pvec[k]*(dlnorm(vcdata, meanlog=mean1vec[j], sdlog=sd1vec[m]))+(1-pvec[k])*(dlnorm(vcdata, meanlog=mean2vec[n], sdlog=sd2vec[q]))))

result$loglike=L result$p=pvec[k]

result$mean1=mean1vec[j] result$sd1=sd1vec[m]

result$mean2=mean2vec[n] result$sd2=sd2vec[q]

#the following comparison stores the lowest log likelihood and the corresponding combination of parameter values seen up to this point

if (L<finalresult\$loglike) finalresult=result

}

}

}

}

}

finalresult

When I employed the direct search, I ran it twice for each data set. The first time, I looped over a wider range of parameter values. The sequence of parameter values searched within each variable was spaced sufficiently far apart so the direct search would not bog down. The one exception was the proportion of fish vertebrae in each mode. For this variable, the step-size between the values of proportion for which I calculated the log likelihood was fairly small from the start. Experience with my data sets showed that this variable had the biggest effect on the log likelihood.

The second time that I ran the direct search for each data set, I focused on a narrower range of parameter values. The range of parameter values that I searched centered around the values found in the initial run. The sequence of values searched within each variable was spaced closer together in the second run.

Additional code, of course, needs to be written in order to determine the standard errors of the parameter estimates.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.