Posts Tagged ‘mixture models’

Power Law Distributions and Model-fitting Approaches

November 18, 2013

A power law distribution has a heavy tail. The following simulation results depict the form of a power law distribution. In this example, the value of \alpha is 3.6, the minimum value is 8.9, and the sample size is 75. The histogram shows the values that resulted from the simulation run, while the red line shows the theoretical distribution. Note that a few simulation values are much, much larger than the rest. To preview some results to be presented later, the distribution of debitage size in lithic reduction experiments looks similar. Such similarity is suggestive but not definitive.

20131119-060448.jpg
As discussed previously, linear regression analysis has sometimes been used to evaluate the fit of the power law distribution to data and to estimate the value of the exponent, \alpha . This technique produces biased estimates, which the next simulation results illustrate. In the simulation, a random number generator produced a sample of numbers drawn from a power law distribution at a particular value of \alpha . I then analyzed this artificial data set using the linear regression approach described by Brown (2001) and using maximum likelihood estimation through a direct search of the parameter space. For a simple power law distribution, the maximimum likelihood estimate could also be found analytically. I used a direct search approach, however, in anticipation of using this approach for more complex mixture models. I repeated the random number generation and analysis 35 separate times for several different combinations of \alpha and sample sizes. The following histograms show the estimates for \alpha using the linear regression analysis and maximum likelihood estimation to find that value. In this particular case, \alpha was set to 3.6 in the random number generator, the minimum value was set to 8.9, and the sample size was 500.

20131119-060721.jpg
The histograms clearly suggest that the maximum likelihood estimates center closely around the true value of alpha, while the regression analyses skew to a lower value. The simulations that I performed at other parameter values and sample sizes displayed similar results. Other, more extensive simulation work also supports these impressions (Clauset et al. [2009] provides these results as part of a detailed, comprehensive technical discussion). Consequently, I used maximum likelihood estimation to fit probability distributions to data in the subsequent analyses.

References cited

Brown, Clifford T.
2001 The Fractal Dimensions of Lithic Reduction. Journal of Archaeological Science 28: 619-631.

Clauset, Aaron; Cosma Rohilla Shalizi; and Mark E. J. Newman
2009. Power-law distributions in empirical data. SIAM Review 51(4): 661-703.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013

Identification of Lithic Reduction Strategies from Mixed Assemblages

November 11, 2013

This post is the first in a series that will try to characterize lithic debitage assemblages formed from more than one reduction strategy. The primary goals are to estimate the proportions of the various reduction strategies represented within these mixed assemblages and to quantify the uncertainty of these estimates. I plan to use mixture models and the method of maximum likelihood to identify the distinct components of such assemblages.

20131111-101016.jpg

Brown (2001) suggests that the distribution of debitage size follows a power law. Power law distributions have the following probability density function:

f(x\vert \alpha) = C*x^{\alpha} ,

where C is a constant that normalizes the distribution, so the density integrates to one. The value of C thus depends entirely on the exponent \alpha .

Based on analysis of experimentally-produced assemblages, Brown further suggests that the exponent, \alpha , of these power law distributions varies among different reduction strategies. Thus, different reduction strategies produce distinctive debitage size distributions. This result could be very powerful, allowing reduction strategies from a wide variety of contexts to be characterized and distinguished. The technique used by Brown to estimate the value of the exponent, however, has some technical flaws.

Brown (2001) fits a linear regression to the relationship between the log of flake size grade and the log of the cumulative count of flakes in each size grade. In its favor, this approach seemingly reduces the effects of small sample sizes and can be easily replicated. The regression approach, on the other hand, also produces biased estimates of the exponent and does not allow the fit of the power law model to be compared to other probability density functions.

Maximum likelihood estimates, using data on the size of each piece of debitage, produce more reliable estimates of the exponent of a power law. Maximum likelihood estimates can also be readily compared among different distributions fit to the data, to evaluate whether a power law is the best model to describe debitage size distributions. The next post will illustrate the use of the linear regression approach and the maximum likelihood approach on simulated data drawn from a power law distribution.

Reference cited

Brown, Clifford T.
2001 The Fractal Dimensions of Lithic Reduction. Journal of Archaeological Science 28: 619-631.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013

More on Mixture Models, Maximum Likelihood, and Direct Search Methods

January 15, 2010

This post describes some issues that I encountered while trying to calculate likelihood values using the direct search approach. Direct search methods calculate the likelihood value for each specified parameter combination and compare that value to the values of all other specified combinations. This approach has the advantage of simplicity, but it requires careful consideration of the parameter values for which likelihood values are to be calculated and compared.

Consider a model with three variables. A likelihood value can be calculated for any particular combination of parameter values. By systematically varying the values of these three parameters, a three-dimensional picture of likelihood values can be created that shows how the likelihood responds to changes in parameter values. Local peaks or optima in likelihood values may exist in addition to the maximum likelihood value found at the combination of parameter values that constitutes the maximum likelihood estimates for those parameters. Direct search methods may avoid being fooled by local optima in likelihood value under certain conditions.

Two conditions must be satisfied to avoid settling at local optima. First, likelihood values should be calculated at sufficiently at narrow interval of the values for each parameter. Narrow intervals ensure that the calculation of likelihood values does not skip over parameter values that are at or close to the maximum likelihood estimates. In addition, the range of values explored for each parameter must be sufficiently broad to encompass the maximum likelihood values.

The direct search approach often requires that a balance be struck between precision and power. Choosing narrow search intervals and a broad range of values over which to search provides greater opportunities to find the maximum likelihood estimates or values close to them. The cost of employing narrow search intervals and a broad range of values is computing time. Narrowing the interval or broadening the range increases the number of parameter value combinations for which likelihood values must be computed, slowing the process of finding the maximum likelihood estimates. Searching over too broad a range of parameter values also risks other potential problems.

My work modeled fish vertebrae size-frequency data as a mixture of two lognormal distributions. For my data, the direct search method would sometimes find that the maximum likelihood estimates included a distribution with a very small log standard deviation (less than 0.05). Such estimates occurred when one distribution represented a small proportion of the mixture distribution (less than 0.25). For convenience of reference, I have termed these fitted distributions as “vestigial” distributions; they are part of the mixture distribution but don’t add much information to it.

These results seem analogous to the “overfitting” that occurs when a model with an unnecessarily large number of parameters is fit to the data. Models with large numbers of parameters will likely fit the current data very well but will be unable to fit future data sets accurately. Some of the model terms are likely to be accommodating noise in the data and not the process of interest. Similarly, cases where the maximum likelihood method results produce a vestigial distribution may indicate that the vestigial distribution is primarily accommodating some noise in the data. The resulting estimates for the mixture distribution do fit the data. My suspicion, however, is that this mixture distribution would fit another sample from the same population poorly.

Results that included these vestigial distributions also seem unrealistic. Under what circumstances would one portion of the mixture distribution have such a small log standard deviation, so all the individuals from that distribution are essentially of the same size? Such occurrences could possibly result if the data was not generated by independent processes. This circumstance could apply to my fish vertebrae data if I was I not successful in screening out multiple vertebrae from the same individual fish from my data set.

The vestigial distributions were most likely to be generated when modeling data sets with relatively small sample sizes. This pattern suggests that they do derive from noise in the data. I would otherwise have a hard time explaining variability in the occurrence of these vestigial distributions. For most of my assemblages, both components of the mixture distributions had relatively large log standard deviations.

I thus constrained the direct search to only consider distributions with a log standard deviation greater than 0.07. With this constraint in place, the mixture distribution model produced results that seemed reasonable and realistic. In most cases, the mixture models produced a significantly better fit to the data than a single lognormal distribution. The exceptions occurred, as might be expected, in the case of assemblages with relatively small samples sizes.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2010.

More Thoughts on Mixture Models and Maximum Likelihood Methods

November 11, 2009

In my previous post on this topic, I discussed two techniques for finding the combination of mixture distribution parameters that have the lowest log likelihood, direct search and the mixdist package for R. I suggested that direct search of the parameter space allowed the effects of outliers in my data to be identified more clearly. I have done additional work since that time, comparing direct search and the mixdist package. As a result of this work, I have concluded that direct search is much more effective at finding the optimal combination of parameter values.

Mixdist returned parameter values that consistently produced higher log likelihoods than I found using direct search. The differences were substantial. I can not fully explain the observed differences, but the differences were also consistent among all of my data sets.

Direct search of the parameter space is obviously not the most convenient approach. My model involved only two lognormal distributions with a total of five parameters. Direct search of the optimal parameters for more complex mixture models may not be feasible, as the number of parameter value combinations that need to be searched is too large. For simple mixture models, however, the direct search may be preferable.

The following code is the very simple program that I wrote for R to find the lowest log likelihood and corresponding parameter values. The program was written specifically for a mixture model of two lognormal distributions. The parameter value search space included the range of likely values of log mean of fish vertebra height and the log standard deviation of fish vertebrae height in both of the two distributions. I am not a professional programmer, as you will see, so any suggestions for improving and extending the code will be greatly appreciated.

#vcdata is a list of the data for which the parameter estimates and likelihood will be calculated

vcdata<-c(…your data here…)

#pvec provides the sequence of values for the proportion of vertebrae in the mode of smaller (net-caught) fish over which the program loops.

pvec=seq(0.10, 0.99, by=0.015)

#mean1vec provides the sequence of values for the log mean of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

mean1vec = seq(0.80, 1.20, by = 0.03)

#sd1vec provides the sequence of values for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

sd1vec=seq(0.02, 0.44, by=0.03)

#mean2vec provides the sequence of values for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

mean2vec=seq(1.25, 1.70, by=0.03)

#sd2vec provides the sequence of values for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

sd2vec=seq(0.02, 0.44, by=0.03)

#loglike is the negative log likelihood, which is calculated for each combination of parameter values.

loglike=100000

#p stores the parameter value for the proportion of vertebrae in the mode of smaller (net-caught) fish.

p=0

#mean1 stores the parameter value for the log mean of vertebra height in the mode of smaller (net-caught) fish.

mean1=0

#sd1 stores the parameter value for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish.

sd1=0

#mean2 stores the parameter value for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish.

mean2=0

#sd2 stores the parameter value for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish.

sd2=0

#the result data frame returns the value of the log likelihood at a particular combination of parameter values

result<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

#the finalresult data frame stores the log likelihood and parameter values for the combination of parameter values that returns a log likelihood that is smaller than all other log likelihoods generated.

#looping over the combination of parameter values

finalresult<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

for (j in 1:length(mean1vec)) {

for (k in 1:length(pvec)) {

for (m in 1:length(sd1vec)) {

for (n in 1:length(mean2vec)){

for (q in 1:length(sd2vec)){

#the following function returns the negative log likelihood value at a particular combination of parameter values

L= -sum(log(pvec[k]*(dlnorm(vcdata, meanlog=mean1vec[j], sdlog=sd1vec[m]))+(1-pvec[k])*(dlnorm(vcdata, meanlog=mean2vec[n], sdlog=sd2vec[q]))))

result$loglike=L

result$p=pvec[k]

result$mean1=mean1vec[j]

result$sd1=sd1vec[m]

result$mean2=mean2vec[n]

result$sd2=sd2vec[q]

#the following comparison stores the lowest log likelihood and the corresponding combination of parameter values seen up to this point

if (L<finalresult$loglike) finalresult=result

}

}

}

}

}

finalresult

When I employed the direct search, I ran it twice for each data set. The first time, I looped over a wider range of parameter values. The sequence of parameter values searched within each variable was spaced sufficiently far apart so the direct search would not bog down. The one exception was the proportion of fish vertebrae in each mode. For this variable, the step-size between the values of proportion for which I calculated the log likelihood was fairly small from the start. Experience with my data sets showed that this variable had the biggest effect on the log likelihood.

The second time that I ran the direct search for each data set, I focused on a narrower range of parameter values. The range of parameter values that I searched centered around the values found in the initial run. The sequence of values searched within each variable was spaced closer together in the second run.

Additional code, of course, needs to be written in order to determine the standard errors of the parameter estimates.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.