Archive for the ‘Math and Statistics’ Category

Power Law Distributions and Model-fitting Approaches

November 18, 2013

A power law distribution has a heavy tail. The following simulation results depict the form of a power law distribution. In this example, the value of \alpha is 3.6, the minimum value is 8.9, and the sample size is 75. The histogram shows the values that resulted from the simulation run, while the red line shows the theoretical distribution. Note that a few simulation values are much, much larger than the rest. To preview some results to be presented later, the distribution of debitage size in lithic reduction experiments looks similar. Such similarity is suggestive but not definitive.

20131119-060448.jpg
As discussed previously, linear regression analysis has sometimes been used to evaluate the fit of the power law distribution to data and to estimate the value of the exponent, \alpha . This technique produces biased estimates, which the next simulation results illustrate. In the simulation, a random number generator produced a sample of numbers drawn from a power law distribution at a particular value of \alpha . I then analyzed this artificial data set using the linear regression approach described by Brown (2001) and using maximum likelihood estimation through a direct search of the parameter space. For a simple power law distribution, the maximimum likelihood estimate could also be found analytically. I used a direct search approach, however, in anticipation of using this approach for more complex mixture models. I repeated the random number generation and analysis 35 separate times for several different combinations of \alpha and sample sizes. The following histograms show the estimates for \alpha using the linear regression analysis and maximum likelihood estimation to find that value. In this particular case, \alpha was set to 3.6 in the random number generator, the minimum value was set to 8.9, and the sample size was 500.

20131119-060721.jpg
The histograms clearly suggest that the maximum likelihood estimates center closely around the true value of alpha, while the regression analyses skew to a lower value. The simulations that I performed at other parameter values and sample sizes displayed similar results. Other, more extensive simulation work also supports these impressions (Clauset et al. [2009] provides these results as part of a detailed, comprehensive technical discussion). Consequently, I used maximum likelihood estimation to fit probability distributions to data in the subsequent analyses.

References cited

Brown, Clifford T.
2001 The Fractal Dimensions of Lithic Reduction. Journal of Archaeological Science 28: 619-631.

Clauset, Aaron; Cosma Rohilla Shalizi; and Mark E. J. Newman
2009. Power-law distributions in empirical data. SIAM Review 51(4): 661-703.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013

Advertisements

On Monument Volume IV

April 29, 2013

This post evaluates burial mound volume, fitting various probability models to the data. As noted previously, the exponential distribution seems like an appropriate model to fit to the mound volume data. This model is not the only possibility, of course, so I will also consider an alternative, the gamma distribution. The exponential distribution is a simplified version of the gamma distribution.

The gamma probability density function (pdf) is:

f(x\vert \alpha , \lambda) = \frac{\lambda ^{\alpha} x^{\alpha - 1}e^{- \lambda x}}{\Gamma (\alpha)} ,

where:

\alpha is the shape parameter,

\lambda is the rate parameter,

and \Gamma is the gamma function.

The gamma function typically takes the following form:

\Gamma (\alpha) = \int_{0}^{\infty} t^{\alpha -1} e^{-t} dt

Depending on the parameter values, the graph of the gamma pdf can take a wide variety of shapes, including forms that resemble the bell-shaped curve of the normal distribution. The following illustration shows some of the possible variation.

20130429-205624.jpg

To evaluate the relationship between mound volume and mound condition (plowed and whole) under the gamma and exponential distributions, I analyzed model fit using the maximum likelihood method. The following R code details the analysis.

>library(bbmle)
>mdvol_g.mle=mle2(Allmds$Mound.Volume~dgamma(shape=shape, rate = gvar), start=list(shape = 1, gvar = 1/mean(Allmds$Mound.Volume)), data=Allmds, parameters = list(gvar~Allmds$Condition))
>mdvol_g.mle

>mdvol_e_cov.mle=mle2(Allmds$Mound.Volume~dexp(rate = avar), start=list(avar = 1/mean(Allmds$Mound.Volume)), data=Allmds, parameters = list(avar~Allmds$Condition))
>mdvol_e_cov.mle

>mdvol_e.mle= mle2(Allmds$Mound.Volume~dexp(rate = bvar), start=list(bvar = 1/mean(Allmds$Mound.Volume)), data=Allmds)
>mdvol_e.mle

In this code, Allmds refers to an R data frame containing the variables Mound.Volume and Condition. The code uses the maximum likelihood method to evaluate the fit of an exponential distribution to the data and to estimate parameter values. I performed the analysis three times. In the first analysis, I fit the gamma distribution, using Condition as a covariate. In the second and third analyses, I fit the exponential distribution to the data, once with the covariate Condition and once without the covariate.

The models are “nested”. The gamma distribution can be reduced to the exponential distribution by setting the gamma’s shape parameter to one. The exponential model without the covariate is a simplified version of the model with the covariate. Nested models can be compared using an ANOVA test to see whether the more complex model gives a significantly better fit to the data, justifying the extra complexity. The following two tables show the results of the analysis.

20130503-054055.jpg

20130503-053545.jpg
The initial results suggest that the exponential distribution with the covariate provides a significantly better fit to the data than the simpler model without the covariate. The gamma distribution does not provide a significantly better fit. Notice that the gamma’s shape parameter is estimated to be one, which reduces the gamma to the exponential distribution.

From this preliminary analysis, I offer the following conclusions. The exponential distribution appears to be an appropriate model for mound volume. In addition, plowed mounds may be distinctly smaller than whole mounds, contradicting my initial hypothesis. In subsequent posts, I will consider some archaeological implications and address some additional considerations that may help to explain these results.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013.

A Useful Resource for Understanding Combinatorics and Probability Theory

January 19, 2011

Combinatorial principles can be difficult to understand from modern textbooks. The more elegant the explanation, the more it is couched in abstract mathematical language. A book that I picked up a long time ago has helped me to bridge the gap between those elegant, abstract textbook explanations and a concrete understanding of them. The book Lady Luck by Warren Weaver provides a relatively rigorous explanation of probability in plain, clear language. I have turned to this book many times to supplement my textbooks when the textbook explanation of a concept proved elusive.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2011.

More on Mixture Models, Maximum Likelihood, and Direct Search Methods

January 15, 2010

This post describes some issues that I encountered while trying to calculate likelihood values using the direct search approach. Direct search methods calculate the likelihood value for each specified parameter combination and compare that value to the values of all other specified combinations. This approach has the advantage of simplicity, but it requires careful consideration of the parameter values for which likelihood values are to be calculated and compared.

Consider a model with three variables. A likelihood value can be calculated for any particular combination of parameter values. By systematically varying the values of these three parameters, a three-dimensional picture of likelihood values can be created that shows how the likelihood responds to changes in parameter values. Local peaks or optima in likelihood values may exist in addition to the maximum likelihood value found at the combination of parameter values that constitutes the maximum likelihood estimates for those parameters. Direct search methods may avoid being fooled by local optima in likelihood value under certain conditions.

Two conditions must be satisfied to avoid settling at local optima. First, likelihood values should be calculated at sufficiently at narrow interval of the values for each parameter. Narrow intervals ensure that the calculation of likelihood values does not skip over parameter values that are at or close to the maximum likelihood estimates. In addition, the range of values explored for each parameter must be sufficiently broad to encompass the maximum likelihood values.

The direct search approach often requires that a balance be struck between precision and power. Choosing narrow search intervals and a broad range of values over which to search provides greater opportunities to find the maximum likelihood estimates or values close to them. The cost of employing narrow search intervals and a broad range of values is computing time. Narrowing the interval or broadening the range increases the number of parameter value combinations for which likelihood values must be computed, slowing the process of finding the maximum likelihood estimates. Searching over too broad a range of parameter values also risks other potential problems.

My work modeled fish vertebrae size-frequency data as a mixture of two lognormal distributions. For my data, the direct search method would sometimes find that the maximum likelihood estimates included a distribution with a very small log standard deviation (less than 0.05). Such estimates occurred when one distribution represented a small proportion of the mixture distribution (less than 0.25). For convenience of reference, I have termed these fitted distributions as “vestigial” distributions; they are part of the mixture distribution but don’t add much information to it.

These results seem analogous to the “overfitting” that occurs when a model with an unnecessarily large number of parameters is fit to the data. Models with large numbers of parameters will likely fit the current data very well but will be unable to fit future data sets accurately. Some of the model terms are likely to be accommodating noise in the data and not the process of interest. Similarly, cases where the maximum likelihood method results produce a vestigial distribution may indicate that the vestigial distribution is primarily accommodating some noise in the data. The resulting estimates for the mixture distribution do fit the data. My suspicion, however, is that this mixture distribution would fit another sample from the same population poorly.

Results that included these vestigial distributions also seem unrealistic. Under what circumstances would one portion of the mixture distribution have such a small log standard deviation, so all the individuals from that distribution are essentially of the same size? Such occurrences could possibly result if the data was not generated by independent processes. This circumstance could apply to my fish vertebrae data if I was I not successful in screening out multiple vertebrae from the same individual fish from my data set.

The vestigial distributions were most likely to be generated when modeling data sets with relatively small sample sizes. This pattern suggests that they do derive from noise in the data. I would otherwise have a hard time explaining variability in the occurrence of these vestigial distributions. For most of my assemblages, both components of the mixture distributions had relatively large log standard deviations.

I thus constrained the direct search to only consider distributions with a log standard deviation greater than 0.07. With this constraint in place, the mixture distribution model produced results that seemed reasonable and realistic. In most cases, the mixture models produced a significantly better fit to the data than a single lognormal distribution. The exceptions occurred, as might be expected, in the case of assemblages with relatively small samples sizes.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2010.

Mixture Models, Maximum Likelihood Methods, and Confidence Intervals

December 19, 2009

In an earlier post, I noted that the parameter estimates for a mixture model supplied by maximum likelihood methods were only part of the story. A measure of the precision of those estimates is also needed. Confidence intervals provide this measure. This post details a method for determining them.

Packages for the analysis of mixture models in R like mixdist generate confidence intervals automatically. The direct search approach, however, has proven to be more reliable for the data sets that I have been examining. In a direct search, the likelihood is calculated for each combination of parameter values over a plausible range of those parameter values.

The likelihood value of each combination is calculated by looping over a sequence of parameter values for each parameter. The interval between the values of a parameter used in the calculations should be relatively small. When small intervals are used, however, the number of combinations of parameter values for which likelihood values must be calculated increases rapidly. Direct search of the parameter space may not be practical for some applications. The direct search approach requires that a balance be struck between precision and manageability.

I have provided some R code that can be used, with some additional work, to generate confidence intervals for parameters of a simple mixture model. The mixture model assumes that the data comprises two lognormal distributions. Confidence intervals for the proportion of cases in the first model can be determined from the results produced by the code as written; the code can be modified to generate results relevant to other variables. The code follows:

(more…)

More Thoughts on Mixture Models and Maximum Likelihood Methods

November 11, 2009

In my previous post on this topic, I discussed two techniques for finding the combination of mixture distribution parameters that have the lowest log likelihood, direct search and the mixdist package for R. I suggested that direct search of the parameter space allowed the effects of outliers in my data to be identified more clearly. I have done additional work since that time, comparing direct search and the mixdist package. As a result of this work, I have concluded that direct search is much more effective at finding the optimal combination of parameter values.

Mixdist returned parameter values that consistently produced higher log likelihoods than I found using direct search. The differences were substantial. I can not fully explain the observed differences, but the differences were also consistent among all of my data sets.

Direct search of the parameter space is obviously not the most convenient approach. My model involved only two lognormal distributions with a total of five parameters. Direct search of the optimal parameters for more complex mixture models may not be feasible, as the number of parameter value combinations that need to be searched is too large. For simple mixture models, however, the direct search may be preferable.

The following code is the very simple program that I wrote for R to find the lowest log likelihood and corresponding parameter values. The program was written specifically for a mixture model of two lognormal distributions. The parameter value search space included the range of likely values of log mean of fish vertebra height and the log standard deviation of fish vertebrae height in both of the two distributions. I am not a professional programmer, as you will see, so any suggestions for improving and extending the code will be greatly appreciated.

#vcdata is a list of the data for which the parameter estimates and likelihood will be calculated

vcdata<-c(…your data here…)

#pvec provides the sequence of values for the proportion of vertebrae in the mode of smaller (net-caught) fish over which the program loops.

pvec=seq(0.10, 0.99, by=0.015)

#mean1vec provides the sequence of values for the log mean of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

mean1vec = seq(0.80, 1.20, by = 0.03)

#sd1vec provides the sequence of values for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

sd1vec=seq(0.02, 0.44, by=0.03)

#mean2vec provides the sequence of values for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

mean2vec=seq(1.25, 1.70, by=0.03)

#sd2vec provides the sequence of values for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

sd2vec=seq(0.02, 0.44, by=0.03)

#loglike is the negative log likelihood, which is calculated for each combination of parameter values.

loglike=100000

#p stores the parameter value for the proportion of vertebrae in the mode of smaller (net-caught) fish.

p=0

#mean1 stores the parameter value for the log mean of vertebra height in the mode of smaller (net-caught) fish.

mean1=0

#sd1 stores the parameter value for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish.

sd1=0

#mean2 stores the parameter value for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish.

mean2=0

#sd2 stores the parameter value for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish.

sd2=0

#the result data frame returns the value of the log likelihood at a particular combination of parameter values

result<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

#the finalresult data frame stores the log likelihood and parameter values for the combination of parameter values that returns a log likelihood that is smaller than all other log likelihoods generated.

#looping over the combination of parameter values

finalresult<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

for (j in 1:length(mean1vec)) {

for (k in 1:length(pvec)) {

for (m in 1:length(sd1vec)) {

for (n in 1:length(mean2vec)){

for (q in 1:length(sd2vec)){

#the following function returns the negative log likelihood value at a particular combination of parameter values

L= -sum(log(pvec[k]*(dlnorm(vcdata, meanlog=mean1vec[j], sdlog=sd1vec[m]))+(1-pvec[k])*(dlnorm(vcdata, meanlog=mean2vec[n], sdlog=sd2vec[q]))))

result$loglike=L

result$p=pvec[k]

result$mean1=mean1vec[j]

result$sd1=sd1vec[m]

result$mean2=mean2vec[n]

result$sd2=sd2vec[q]

#the following comparison stores the lowest log likelihood and the corresponding combination of parameter values seen up to this point

if (L<finalresult$loglike) finalresult=result

}

}

}

}

}

finalresult

When I employed the direct search, I ran it twice for each data set. The first time, I looped over a wider range of parameter values. The sequence of parameter values searched within each variable was spaced sufficiently far apart so the direct search would not bog down. The one exception was the proportion of fish vertebrae in each mode. For this variable, the step-size between the values of proportion for which I calculated the log likelihood was fairly small from the start. Experience with my data sets showed that this variable had the biggest effect on the log likelihood.

The second time that I ran the direct search for each data set, I focused on a narrower range of parameter values. The range of parameter values that I searched centered around the values found in the initial run. The sequence of values searched within each variable was spaced closer together in the second run.

Additional code, of course, needs to be written in order to determine the standard errors of the parameter estimates.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

Mixture Models and Maximum Likelihood Methods

October 7, 2009

In this post, I will highlight some of the technical issues that I encountered while trying to model the variability in the sizes of fish vertebrae from a midden deposit. As described elsewhere, my goal was to distinguish fish caught by nets from fish caught by other gear. The use of these gear types should produce different distributions of fish size. Mixture models are appropriate for cases where variability in a characteristic results from the combination of two or more different distributions. I fit mixture models to the data using the maximum likelihood method. This approach is common in modern statistics but has not been widely employed within archaeology.

The maximum likelihood method addresses the question: “what are the parameter values that make the observed data most likely to occur?” The parameter estimates can be determined from the corresponding likelihood value. The likelihood is calculated from the product of the probability of observing each case in the data given a particular set of parameter values. In practice the log likelihood is usually calculated because the log probabilities can be summed. Calculating the product of many small numbers can be computationally more difficult than summing the log of those numbers. The best parameter estimates have the highest likelihood value or the lowest negative log likelihood value. Algorithms that search the parameter space are typically used to determine those values.

To use this method, a probability distribution has to be selected that is appropriate for the variability in the data. A simple linear regression, for example, is essentially a maximum likelihood analysis which assumes that the data are normally distributed with a mean of μ= a+x*b and a standard deviation of σ2. In this example, the maximum likelihood analysis finds the values of a, b, and σ that best account for the variability in the data.

The application of a mixture model to my data on fish bone size provides another example of this approach. The mixture model that I used assumes that the size-frequency distribution of fish in each assemblage was a result of the combination of two lognormal distributions. Such distributions are appropriate to the data for a couple reasons. First, those distributions can have long tails to the right, and the histograms of caudal vertebrae height for my assemblages also have long tails.

Histograms of Caudal Vertebra Height (mm) by Level from the Midden Deposit

Histograms of Caudal Vertebra Height (mm) by Level from the Midden Deposit

Second, consider the average size of those modern fish species that are also found at archaeological sites in the region. A histogram of the average size of these fish also displays a long tail to the right as shown in the following histogram.

Fish live weight histogram

Average live weight of modern fish species in study region

This distribution is probably not lognormal. Nevertheless, smaller fish species are clearly more common than larger fish. The distribution of fish sizes from which fishers obtained individual fish was likely affected by the abundance of these fish species, habitat, climate, and other factors. Better data on the modern distribution of fish size for my study area is not available, so this discussion will have to be sufficient for now.

I used the mixdist package for R to find the maximum likelihood estimate of the parameter values, including the proportion of the fish in the two modes, the mean size of fish in each mode, and standard deviation of each mode. This package uses a special algorithm to arrive at those estimates. I also searched the parameter space directly, writing a simple program in R to loop over the plausible range of values for my parameters and find the best parameter estimates.

The direct search of the parameter space proved to be the most informative approach. I could examine the results to see how the likelihood varied with parameter values. Examination of this variation showed that the likelihood value was not converging smoothly with those parameter values. Wildly different combinations of parameter values had very similar likelihoods.

The problem turned out to be the large fish at the extreme end of the distributions in my assemblages. Too many of these fish occurred in the samples for the models to readily converge on parameter estimates. These difficulties were largely hidden when I used the mixdist package to fit the mixture models. The very large fish probably belong to a third mode and may therefore have been obtained in a different manner from the techniques used to acquire the fish in the two smaller modes. Once I removed these large fish from the analysis, the likelihood varied smoothly with the parameter values.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.