Posts Tagged ‘statistics in archaeology’

Identification of Lithic Reduction Strategies from Mixed Assemblages

November 11, 2013

This post is the first in a series that will try to characterize lithic debitage assemblages formed from more than one reduction strategy. The primary goals are to estimate the proportions of the various reduction strategies represented within these mixed assemblages and to quantify the uncertainty of these estimates. I plan to use mixture models and the method of maximum likelihood to identify the distinct components of such assemblages.

20131111-101016.jpg

Brown (2001) suggests that the distribution of debitage size follows a power law. Power law distributions have the following probability density function:

f(x\vert \alpha) = C*x^{\alpha} ,

where C is a constant that normalizes the distribution, so the density integrates to one. The value of C thus depends entirely on the exponent \alpha .

Based on analysis of experimentally-produced assemblages, Brown further suggests that the exponent, \alpha , of these power law distributions varies among different reduction strategies. Thus, different reduction strategies produce distinctive debitage size distributions. This result could be very powerful, allowing reduction strategies from a wide variety of contexts to be characterized and distinguished. The technique used by Brown to estimate the value of the exponent, however, has some technical flaws.

Brown (2001) fits a linear regression to the relationship between the log of flake size grade and the log of the cumulative count of flakes in each size grade. In its favor, this approach seemingly reduces the effects of small sample sizes and can be easily replicated. The regression approach, on the other hand, also produces biased estimates of the exponent and does not allow the fit of the power law model to be compared to other probability density functions.

Maximum likelihood estimates, using data on the size of each piece of debitage, produce more reliable estimates of the exponent of a power law. Maximum likelihood estimates can also be readily compared among different distributions fit to the data, to evaluate whether a power law is the best model to describe debitage size distributions. The next post will illustrate the use of the linear regression approach and the maximum likelihood approach on simulated data drawn from a power law distribution.

Reference cited

Brown, Clifford T.
2001 The Fractal Dimensions of Lithic Reduction. Journal of Archaeological Science 28: 619-631.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013

More Thoughts on Mound Size Variability

June 29, 2013

This post begins to explore additional patterning in mound size, refining some of my earlier observations and offering some hypotheses for evaluation. Suppose mound-building groups occupied stable territories over the span of several generations or longer. Within the territory held by such groups, they built burial mounds. Many burial mounds within a given area may thus have been produced by the same group or lineage. Under these circumstances, burial mounds located in close proximity should be more likely to be the product of a single group or lineage. If the group traits that influenced mound volume were also relatively stable through time, burial mounds located near to each other should be similar in size. As an initial attempt to evaluate these claims, I looked at the relationship in mound size between mounds that were nearest neighbors and between randomly-paired mounds.

Recall that most mounds have been affected by modern plowing and other disturbances, but some mounds have been largely spared such damage. The museum records that I used characterized these undamaged mounds as “whole”. The museum records documented 287 whole mounds. To make sure that the comparisons were fair, I limited the sample of nearest neighbors to just those whole mounds that had another whole mound as its nearest neighbor. I eliminated duplicate pairings, so each pair of nearest neighbors was only considered once. The imposition of these constraints shrunk the nearest neighbor sample size to 49. Finally, I ran a simple linear regression to evaluate the relationship between the size of the mounds in these nearest neighbor pairings. Because the distribution of mound volume can be modeled as an exponential distribution, I used the log of mound volume in the regression analysis. Without this transformation, any relationship in
mound size between the nearest neighbors would be unlikely to be well approximated by a straight line.

I then sampled without replacement from the 287 whole mounds to obtain 49 randomly-selected pairs. As with the nearest neighbors, I performed a simple linear regression, using log volume. I repeated this procedure 500 times. The repeated sampling and analysis allowed me to develop a null hypothesis for the values of the regression coefficients.

I expected that the randomly-selected pairs would not have a meaningful relationship. The slope of the regression line should be close to zero for these samples. In contrast, the size of the nearest neighbor pairs should be positively correlated, so the slope of the regression line should be significantly larger than zero. The following two figures show the distribution of the regression coefficients, the intercept and slope, for the randomly-selected pairs.

20130629-183701.jpg

20130629-183806.jpg
Notice, in particular, that the distribution of the slope clusters near zero as predicted. This result indicates that the randomly-selected pairs do not have a meaningful relationship with each other, at least with respect to size.

These distributions contrast with the regression coefficients calculated for the nearest neighbors. The intercept is 0.90, and the slope is 0.75. These values are completely beyond the range of values estimated for the randomly-selected pairs. This experiment shows that the size of nearest neighbors is significantly and positively correlated. The results lend some support to the notion that stable groups produced these mounds. At the very least, the results provide encouragement to further explore the relationship between mound size and mound spatial distribution. Such work should probably make use of the spatial analysis tools available in GIS programs.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013.

On Monument Volume IV

April 29, 2013

This post evaluates burial mound volume, fitting various probability models to the data. As noted previously, the exponential distribution seems like an appropriate model to fit to the mound volume data. This model is not the only possibility, of course, so I will also consider an alternative, the gamma distribution. The exponential distribution is a simplified version of the gamma distribution.

The gamma probability density function (pdf) is:

f(x\vert \alpha , \lambda) = \frac{\lambda ^{\alpha} x^{\alpha - 1}e^{- \lambda x}}{\Gamma (\alpha)} ,

where:

\alpha is the shape parameter,

\lambda is the rate parameter,

and \Gamma is the gamma function.

The gamma function typically takes the following form:

\Gamma (\alpha) = \int_{0}^{\infty} t^{\alpha -1} e^{-t} dt

Depending on the parameter values, the graph of the gamma pdf can take a wide variety of shapes, including forms that resemble the bell-shaped curve of the normal distribution. The following illustration shows some of the possible variation.

20130429-205624.jpg

To evaluate the relationship between mound volume and mound condition (plowed and whole) under the gamma and exponential distributions, I analyzed model fit using the maximum likelihood method. The following R code details the analysis.

>library(bbmle)
>mdvol_g.mle=mle2(Allmds$Mound.Volume~dgamma(shape=shape, rate = gvar), start=list(shape = 1, gvar = 1/mean(Allmds$Mound.Volume)), data=Allmds, parameters = list(gvar~Allmds$Condition))
>mdvol_g.mle

>mdvol_e_cov.mle=mle2(Allmds$Mound.Volume~dexp(rate = avar), start=list(avar = 1/mean(Allmds$Mound.Volume)), data=Allmds, parameters = list(avar~Allmds$Condition))
>mdvol_e_cov.mle

>mdvol_e.mle= mle2(Allmds$Mound.Volume~dexp(rate = bvar), start=list(bvar = 1/mean(Allmds$Mound.Volume)), data=Allmds)
>mdvol_e.mle

In this code, Allmds refers to an R data frame containing the variables Mound.Volume and Condition. The code uses the maximum likelihood method to evaluate the fit of an exponential distribution to the data and to estimate parameter values. I performed the analysis three times. In the first analysis, I fit the gamma distribution, using Condition as a covariate. In the second and third analyses, I fit the exponential distribution to the data, once with the covariate Condition and once without the covariate.

The models are “nested”. The gamma distribution can be reduced to the exponential distribution by setting the gamma’s shape parameter to one. The exponential model without the covariate is a simplified version of the model with the covariate. Nested models can be compared using an ANOVA test to see whether the more complex model gives a significantly better fit to the data, justifying the extra complexity. The following two tables show the results of the analysis.

20130503-054055.jpg

20130503-053545.jpg
The initial results suggest that the exponential distribution with the covariate provides a significantly better fit to the data than the simpler model without the covariate. The gamma distribution does not provide a significantly better fit. Notice that the gamma’s shape parameter is estimated to be one, which reduces the gamma to the exponential distribution.

From this preliminary analysis, I offer the following conclusions. The exponential distribution appears to be an appropriate model for mound volume. In addition, plowed mounds may be distinctly smaller than whole mounds, contradicting my initial hypothesis. In subsequent posts, I will consider some archaeological implications and address some additional considerations that may help to explain these results.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013.

On Monument Volume III

April 20, 2013

For my study area, the distribution of burial mound volume for plowed and whole mounds looks similar. This distribution is also quite different from the normal distribution that characterizes so many traits in the natural world. The distribution of burial mound volume resembles the form of an exponential distribution. Exponential distributions have a peak at the extreme left end of the distribution and decline steadily and rapidly from that point. The exponential distribution has a single parameter, the rate, typically denoted by \lambda . The following function gives the probability density (sometimes called the pdf) of the exponential distribution.

f(x\vert \lambda) = \lambda e^{-\lambda x}

The pdf defines a curve. For a continuous distribution such as the exponential distribution, the area under this curve provides the probability of a sample taking on the value within the interval along the x-axis under the curve. The following illustration depicts these relationships. In the illustration, the shaded area under the curve represents the probability of a given sample falling between the two values of x.

20130420-193517.jpg

As a check on my intuition regarding the applicability of the exponential distribution, I generated a random sample of 2000 from an exponential distribution with a mean of 500. The following figure shows what such a distribution may look like. The simulation does not provide definitive proof, but it may nevertheless indicate whether a more rigorous analysis that employs the exponential distribution is worth pursuing.

20130420-165656.jpg

At least superficially, the histogram of the simulation results resembles the histograms of mound volume shown in the previous post. This simulation did not produce the apparent outliers seen in the mound data, but the resemblance suggests that burial mound volume can be modeled with an exponential distribution. I thus modeled mound volume with an exponential distribution, using mound condition (plowed or whole) as a covariate. I performed this analysis in R with the bbmle module. In the next post, I’ll present the code and initial results.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013.

On Monument Volume II

April 10, 2013

The previous post suggested that mound shape could be modeled as a spherical cap. I then proposed that the shape of those mounds may change through time, due to weathering and repeated plowing by modern agricultural equipment, but mound volume might remain the same. As illustrated in the following figure, mounds might become shorter but wider as they are weathered and plowed. In the figure, A represents the original mound shape, while B reflects mound shape after weathering and plowing. The height, h, has decreased over time, while the radius, a, has increased.

20130410-191354.jpg

Other hypotheses are possible, but I will evaluate this scenario first.

I have compiled museum data on mound condition and mound size for all recorded mounds in my study region. The museum records characterize mound condition as either “whole” or “plowed”. The records did not disclose the basis for this characterization. These records also document mound height and width. For each mound, I calculated a volume, assuming that mound shape resembles a spherical cap. The following two histograms illustrate the distribution of mound volume for plowed mounds and for whole mounds.

20130410-191534.jpg

20130410-191610.jpg

As you can see, the distributions of mound size for plowed and whole mounds look very similar. A few outliers may occur at the right tail of both distributions. These outliers represent unusually large mounds. The similarity of the histograms suggest that a single probability distribution could be used to model monument volume. The next post will evaluate monument volume more rigorously.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013.

More on Mixture Models, Maximum Likelihood, and Direct Search Methods

January 15, 2010

This post describes some issues that I encountered while trying to calculate likelihood values using the direct search approach. Direct search methods calculate the likelihood value for each specified parameter combination and compare that value to the values of all other specified combinations. This approach has the advantage of simplicity, but it requires careful consideration of the parameter values for which likelihood values are to be calculated and compared.

Consider a model with three variables. A likelihood value can be calculated for any particular combination of parameter values. By systematically varying the values of these three parameters, a three-dimensional picture of likelihood values can be created that shows how the likelihood responds to changes in parameter values. Local peaks or optima in likelihood values may exist in addition to the maximum likelihood value found at the combination of parameter values that constitutes the maximum likelihood estimates for those parameters. Direct search methods may avoid being fooled by local optima in likelihood value under certain conditions.

Two conditions must be satisfied to avoid settling at local optima. First, likelihood values should be calculated at sufficiently at narrow interval of the values for each parameter. Narrow intervals ensure that the calculation of likelihood values does not skip over parameter values that are at or close to the maximum likelihood estimates. In addition, the range of values explored for each parameter must be sufficiently broad to encompass the maximum likelihood values.

The direct search approach often requires that a balance be struck between precision and power. Choosing narrow search intervals and a broad range of values over which to search provides greater opportunities to find the maximum likelihood estimates or values close to them. The cost of employing narrow search intervals and a broad range of values is computing time. Narrowing the interval or broadening the range increases the number of parameter value combinations for which likelihood values must be computed, slowing the process of finding the maximum likelihood estimates. Searching over too broad a range of parameter values also risks other potential problems.

My work modeled fish vertebrae size-frequency data as a mixture of two lognormal distributions. For my data, the direct search method would sometimes find that the maximum likelihood estimates included a distribution with a very small log standard deviation (less than 0.05). Such estimates occurred when one distribution represented a small proportion of the mixture distribution (less than 0.25). For convenience of reference, I have termed these fitted distributions as “vestigial” distributions; they are part of the mixture distribution but don’t add much information to it.

These results seem analogous to the “overfitting” that occurs when a model with an unnecessarily large number of parameters is fit to the data. Models with large numbers of parameters will likely fit the current data very well but will be unable to fit future data sets accurately. Some of the model terms are likely to be accommodating noise in the data and not the process of interest. Similarly, cases where the maximum likelihood method results produce a vestigial distribution may indicate that the vestigial distribution is primarily accommodating some noise in the data. The resulting estimates for the mixture distribution do fit the data. My suspicion, however, is that this mixture distribution would fit another sample from the same population poorly.

Results that included these vestigial distributions also seem unrealistic. Under what circumstances would one portion of the mixture distribution have such a small log standard deviation, so all the individuals from that distribution are essentially of the same size? Such occurrences could possibly result if the data was not generated by independent processes. This circumstance could apply to my fish vertebrae data if I was I not successful in screening out multiple vertebrae from the same individual fish from my data set.

The vestigial distributions were most likely to be generated when modeling data sets with relatively small sample sizes. This pattern suggests that they do derive from noise in the data. I would otherwise have a hard time explaining variability in the occurrence of these vestigial distributions. For most of my assemblages, both components of the mixture distributions had relatively large log standard deviations.

I thus constrained the direct search to only consider distributions with a log standard deviation greater than 0.07. With this constraint in place, the mixture distribution model produced results that seemed reasonable and realistic. In most cases, the mixture models produced a significantly better fit to the data than a single lognormal distribution. The exceptions occurred, as might be expected, in the case of assemblages with relatively small samples sizes.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2010.

Mixture Models, Maximum Likelihood Methods, and Confidence Intervals

December 19, 2009

In an earlier post, I noted that the parameter estimates for a mixture model supplied by maximum likelihood methods were only part of the story. A measure of the precision of those estimates is also needed. Confidence intervals provide this measure. This post details a method for determining them.

Packages for the analysis of mixture models in R like mixdist generate confidence intervals automatically. The direct search approach, however, has proven to be more reliable for the data sets that I have been examining. In a direct search, the likelihood is calculated for each combination of parameter values over a plausible range of those parameter values.

The likelihood value of each combination is calculated by looping over a sequence of parameter values for each parameter. The interval between the values of a parameter used in the calculations should be relatively small. When small intervals are used, however, the number of combinations of parameter values for which likelihood values must be calculated increases rapidly. Direct search of the parameter space may not be practical for some applications. The direct search approach requires that a balance be struck between precision and manageability.

I have provided some R code that can be used, with some additional work, to generate confidence intervals for parameters of a simple mixture model. The mixture model assumes that the data comprises two lognormal distributions. Confidence intervals for the proportion of cases in the first model can be determined from the results produced by the code as written; the code can be modified to generate results relevant to other variables. The code follows:

(more…)

More Thoughts on Mixture Models and Maximum Likelihood Methods

November 11, 2009

In my previous post on this topic, I discussed two techniques for finding the combination of mixture distribution parameters that have the lowest log likelihood, direct search and the mixdist package for R. I suggested that direct search of the parameter space allowed the effects of outliers in my data to be identified more clearly. I have done additional work since that time, comparing direct search and the mixdist package. As a result of this work, I have concluded that direct search is much more effective at finding the optimal combination of parameter values.

Mixdist returned parameter values that consistently produced higher log likelihoods than I found using direct search. The differences were substantial. I can not fully explain the observed differences, but the differences were also consistent among all of my data sets.

Direct search of the parameter space is obviously not the most convenient approach. My model involved only two lognormal distributions with a total of five parameters. Direct search of the optimal parameters for more complex mixture models may not be feasible, as the number of parameter value combinations that need to be searched is too large. For simple mixture models, however, the direct search may be preferable.

The following code is the very simple program that I wrote for R to find the lowest log likelihood and corresponding parameter values. The program was written specifically for a mixture model of two lognormal distributions. The parameter value search space included the range of likely values of log mean of fish vertebra height and the log standard deviation of fish vertebrae height in both of the two distributions. I am not a professional programmer, as you will see, so any suggestions for improving and extending the code will be greatly appreciated.

#vcdata is a list of the data for which the parameter estimates and likelihood will be calculated

vcdata<-c(…your data here…)

#pvec provides the sequence of values for the proportion of vertebrae in the mode of smaller (net-caught) fish over which the program loops.

pvec=seq(0.10, 0.99, by=0.015)

#mean1vec provides the sequence of values for the log mean of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

mean1vec = seq(0.80, 1.20, by = 0.03)

#sd1vec provides the sequence of values for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish over which the program loops.

sd1vec=seq(0.02, 0.44, by=0.03)

#mean2vec provides the sequence of values for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

mean2vec=seq(1.25, 1.70, by=0.03)

#sd2vec provides the sequence of values for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish over which the program loops.

sd2vec=seq(0.02, 0.44, by=0.03)

#loglike is the negative log likelihood, which is calculated for each combination of parameter values.

loglike=100000

#p stores the parameter value for the proportion of vertebrae in the mode of smaller (net-caught) fish.

p=0

#mean1 stores the parameter value for the log mean of vertebra height in the mode of smaller (net-caught) fish.

mean1=0

#sd1 stores the parameter value for the log standard deviation of vertebra height in the mode of smaller (net-caught) fish.

sd1=0

#mean2 stores the parameter value for the log mean of vertebra height in the mode of larger (hook- or spear-caught) fish.

mean2=0

#sd2 stores the parameter value for the log standard deviation of vertebra height in the mode of larger (hook- or spear-caught) fish.

sd2=0

#the result data frame returns the value of the log likelihood at a particular combination of parameter values

result<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

#the finalresult data frame stores the log likelihood and parameter values for the combination of parameter values that returns a log likelihood that is smaller than all other log likelihoods generated.

#looping over the combination of parameter values

finalresult<-data.frame(loglike, p, mean1, sd1, mean2, sd2)

for (j in 1:length(mean1vec)) {

for (k in 1:length(pvec)) {

for (m in 1:length(sd1vec)) {

for (n in 1:length(mean2vec)){

for (q in 1:length(sd2vec)){

#the following function returns the negative log likelihood value at a particular combination of parameter values

L= -sum(log(pvec[k]*(dlnorm(vcdata, meanlog=mean1vec[j], sdlog=sd1vec[m]))+(1-pvec[k])*(dlnorm(vcdata, meanlog=mean2vec[n], sdlog=sd2vec[q]))))

result$loglike=L

result$p=pvec[k]

result$mean1=mean1vec[j]

result$sd1=sd1vec[m]

result$mean2=mean2vec[n]

result$sd2=sd2vec[q]

#the following comparison stores the lowest log likelihood and the corresponding combination of parameter values seen up to this point

if (L<finalresult$loglike) finalresult=result

}

}

}

}

}

finalresult

When I employed the direct search, I ran it twice for each data set. The first time, I looped over a wider range of parameter values. The sequence of parameter values searched within each variable was spaced sufficiently far apart so the direct search would not bog down. The one exception was the proportion of fish vertebrae in each mode. For this variable, the step-size between the values of proportion for which I calculated the log likelihood was fairly small from the start. Experience with my data sets showed that this variable had the biggest effect on the log likelihood.

The second time that I ran the direct search for each data set, I focused on a narrower range of parameter values. The range of parameter values that I searched centered around the values found in the initial run. The sequence of values searched within each variable was spaced closer together in the second run.

Additional code, of course, needs to be written in order to determine the standard errors of the parameter estimates.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

Identifying and Explaining Intensification in Prehistoric Fishing Practices IX: Model Evaluation and Parameter Estimates

October 16, 2009

The previous post in this series introduced the use of mixture models to fit two lognormal distributions to my data on the frequencies of fish caudal vertebra sizes. I have also discussed some technical issues that I encountered while fitting the models. This post presents some results. The following table gives the estimated parameter values for the two lognormal distributions, including the proportion of the assemblage comprised by fish from each distribution, the log means of the two distributions, and the log standard deviations of the two distributions.

Parameter Estimates for Mixture Models

Parameter Estimates for Mixture Models

The log mean and log standard deviation parameters describe the distribution of caudal vertebra height for the fish bone in the modes of smaller (net-caught) and larger (hook- or spear-caught) fish. The estimates of all the parameters also have associated standard errors, but I am still calculating those errors. They will be reported in a later installment in the series.

A future post in the series will also present several theories by which these estimates might be interpreted. For now, I will make a few general observations. Note that the standard deviation of the distribution of smaller fish is consistently greater than the standard deviation of the distribution of larger fish. The standard deviation is scale-dependent, but I can offer an explanation for this observation without standardizing the standard deviations. As explained elsewhere, nets should capture both small and large fish. Small fish were likely more common than large fish. Thus, the distribution of net-caught fish should have a small mean and a relatively large standard deviation. The distribution of fish caught by hook or by spear should have a larger mean and a smaller standard deviation. Hooks and spears would not be likely to catch fish smaller than some threshold size. The estimates support these assumptions. The estimated proportion of net-caught fish is also consistently higher than the proportion of fish caught with other gear. Some variability exists, however, and this variability may be significant.

Having fit the models, another issue must be resolved now. I also need to consider whether these models provide an appropriate fit to the data. In particular, I need to evaluate whether a simpler model might also explain the observed patterns. In this case, an example of a simpler model than the mixture model of two lognormal distributions might be a single lognormal distribution. I fit such single lognormal distributions to the size data from each assemblage using the maximum likelihood method. The negative log likelihoods of the mixture models was consistently lower (showing that it fit the data better) than the negative log likelihoods of the single lognormal distribution models, but this result is not surprising.

Models with many parameters can generally be made to fit data better than models with fewer parameters. Models with fewer parameters, however, should generally be preferred to models with more parameters, following the principle that simpler explanations are better than more complex explanations. Models with many parameters may also be worse at predicting the variability in new data sets. In essence, more complex models may be finely tuned to match the particular, random factors that affected one data set. The next data set will have been affected by those random factors differently. Thus, a simpler model that does not try to “explain” random variation may often do better at predicting additional data. Such models focus on the deterministic factors that pattern variation. These observations

The likelihood ratio test provides a way to compare “nested” models. Models are nested when more complex models can be reduced to simpler models by setting parameters to particular values. In the case of my fish data, I can reduce my mixture model of two lognormal distributions to a single lognormal distribution by setting p=1 or p=0. Recall that p is the proportion of fish in the assemblage that derive from the distribution of mainly smaller (and presumably net-caught) fish.

As the name implies, the likelihood ratio test compares the likelihood values of a complex model and a simpler model. A theorem states that the ratio of these values has a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the models being compared. Using this theorem, I want to know if the observed ratio attests to a sufficiently significant increase in the fit to the data of the more complex model to justify the added complexity. The following table shows the results of this analysis for my mixture models and the corresponding single lognormal distribution models.

Likelihood Ratio Test Results for Mixture Model and Single Lognormal Distribution

Likelihood Ratio Test Results for Mixture Model and Single Lognormal Distribution

The results provide some support for the use of the mixture models on my data. Many of the p-values for my likelihood ratio tests exceed the arbitrary 0.05 value often employed in studies, although some are lower than this value. Notice that the p-values are generally lower when the sample size is higher. The following scatterplot illustrates this relationship.

Sample size of fish bone and p-value for likelihood ratio tests

Sample size of fish bone and p-value for likelihood ratio tests

P-values often reflect such sample size effects. In addition, no universal threshold exists at which a p-value can be said to truly “significant”. For these reasons, I am comfortable applying the mixture models to all of my assemblages. The mixture models seem sufficiently better at explaining the variability among all of the assemblages to justify the added complexity. I intend to use the mixture models to determine the importance of net-caught fish in each assemblage.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

Mixture Models and Maximum Likelihood Methods

October 7, 2009

In this post, I will highlight some of the technical issues that I encountered while trying to model the variability in the sizes of fish vertebrae from a midden deposit. As described elsewhere, my goal was to distinguish fish caught by nets from fish caught by other gear. The use of these gear types should produce different distributions of fish size. Mixture models are appropriate for cases where variability in a characteristic results from the combination of two or more different distributions. I fit mixture models to the data using the maximum likelihood method. This approach is common in modern statistics but has not been widely employed within archaeology.

The maximum likelihood method addresses the question: “what are the parameter values that make the observed data most likely to occur?” The parameter estimates can be determined from the corresponding likelihood value. The likelihood is calculated from the product of the probability of observing each case in the data given a particular set of parameter values. In practice the log likelihood is usually calculated because the log probabilities can be summed. Calculating the product of many small numbers can be computationally more difficult than summing the log of those numbers. The best parameter estimates have the highest likelihood value or the lowest negative log likelihood value. Algorithms that search the parameter space are typically used to determine those values.

To use this method, a probability distribution has to be selected that is appropriate for the variability in the data. A simple linear regression, for example, is essentially a maximum likelihood analysis which assumes that the data are normally distributed with a mean of μ= a+x*b and a standard deviation of σ2. In this example, the maximum likelihood analysis finds the values of a, b, and σ that best account for the variability in the data.

The application of a mixture model to my data on fish bone size provides another example of this approach. The mixture model that I used assumes that the size-frequency distribution of fish in each assemblage was a result of the combination of two lognormal distributions. Such distributions are appropriate to the data for a couple reasons. First, those distributions can have long tails to the right, and the histograms of caudal vertebrae height for my assemblages also have long tails.

Histograms of Caudal Vertebra Height (mm) by Level from the Midden Deposit

Histograms of Caudal Vertebra Height (mm) by Level from the Midden Deposit

Second, consider the average size of those modern fish species that are also found at archaeological sites in the region. A histogram of the average size of these fish also displays a long tail to the right as shown in the following histogram.

Fish live weight histogram

Average live weight of modern fish species in study region

This distribution is probably not lognormal. Nevertheless, smaller fish species are clearly more common than larger fish. The distribution of fish sizes from which fishers obtained individual fish was likely affected by the abundance of these fish species, habitat, climate, and other factors. Better data on the modern distribution of fish size for my study area is not available, so this discussion will have to be sufficient for now.

I used the mixdist package for R to find the maximum likelihood estimate of the parameter values, including the proportion of the fish in the two modes, the mean size of fish in each mode, and standard deviation of each mode. This package uses a special algorithm to arrive at those estimates. I also searched the parameter space directly, writing a simple program in R to loop over the plausible range of values for my parameters and find the best parameter estimates.

The direct search of the parameter space proved to be the most informative approach. I could examine the results to see how the likelihood varied with parameter values. Examination of this variation showed that the likelihood value was not converging smoothly with those parameter values. Wildly different combinations of parameter values had very similar likelihoods.

The problem turned out to be the large fish at the extreme end of the distributions in my assemblages. Too many of these fish occurred in the samples for the models to readily converge on parameter estimates. These difficulties were largely hidden when I used the mixdist package to fit the mixture models. The very large fish probably belong to a third mode and may therefore have been obtained in a different manner from the techniques used to acquire the fish in the two smaller modes. Once I removed these large fish from the analysis, the likelihood varied smoothly with the parameter values.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.