## Posts Tagged ‘middle-range theory’

### Identification of Lithic Reduction Strategies from Mixed Assemblages

November 11, 2013

This post is the first in a series that will try to characterize lithic debitage assemblages formed from more than one reduction strategy. The primary goals are to estimate the proportions of the various reduction strategies represented within these mixed assemblages and to quantify the uncertainty of these estimates. I plan to use mixture models and the method of maximum likelihood to identify the distinct components of such assemblages.

Brown (2001) suggests that the distribution of debitage size follows a power law. Power law distributions have the following probability density function:

$f(x\vert \alpha) = C*x^{\alpha}$,

where C is a constant that normalizes the distribution, so the density integrates to one. The value of C thus depends entirely on the exponent $\alpha$.

Based on analysis of experimentally-produced assemblages, Brown further suggests that the exponent, $\alpha$, of these power law distributions varies among different reduction strategies. Thus, different reduction strategies produce distinctive debitage size distributions. This result could be very powerful, allowing reduction strategies from a wide variety of contexts to be characterized and distinguished. The technique used by Brown to estimate the value of the exponent, however, has some technical flaws.

Brown (2001) fits a linear regression to the relationship between the log of flake size grade and the log of the cumulative count of flakes in each size grade. In its favor, this approach seemingly reduces the effects of small sample sizes and can be easily replicated. The regression approach, on the other hand, also produces biased estimates of the exponent and does not allow the fit of the power law model to be compared to other probability density functions.

Maximum likelihood estimates, using data on the size of each piece of debitage, produce more reliable estimates of the exponent of a power law. Maximum likelihood estimates can also be readily compared among different distributions fit to the data, to evaluate whether a power law is the best model to describe debitage size distributions. The next post will illustrate the use of the linear regression approach and the maximum likelihood approach on simulated data drawn from a power law distribution.

Reference cited

Brown, Clifford T.
2001 The Fractal Dimensions of Lithic Reduction. Journal of Archaeological Science 28: 619-631.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013

### On Monument Volume IV

April 29, 2013

This post evaluates burial mound volume, fitting various probability models to the data. As noted previously, the exponential distribution seems like an appropriate model to fit to the mound volume data. This model is not the only possibility, of course, so I will also consider an alternative, the gamma distribution. The exponential distribution is a simplified version of the gamma distribution.

The gamma probability density function (pdf) is:

$f(x\vert \alpha , \lambda) = \frac{\lambda ^{\alpha} x^{\alpha - 1}e^{- \lambda x}}{\Gamma (\alpha)}$,

where:

$\alpha$ is the shape parameter,

$\lambda$ is the rate parameter,

and $\Gamma$ is the gamma function.

The gamma function typically takes the following form:

$\Gamma (\alpha) = \int_{0}^{\infty} t^{\alpha -1} e^{-t} dt$

Depending on the parameter values, the graph of the gamma pdf can take a wide variety of shapes, including forms that resemble the bell-shaped curve of the normal distribution. The following illustration shows some of the possible variation.

To evaluate the relationship between mound volume and mound condition (plowed and whole) under the gamma and exponential distributions, I analyzed model fit using the maximum likelihood method. The following R code details the analysis.

>library(bbmle)
>mdvol_g.mle=mle2(Allmds$Mound.Volume~dgamma(shape=shape, rate = gvar), start=list(shape = 1, gvar = 1/mean(Allmds$Mound.Volume)), data=Allmds, parameters = list(gvar~Allmds$Condition)) >mdvol_g.mle >mdvol_e_cov.mle=mle2(Allmds$Mound.Volume~dexp(rate = avar), start=list(avar = 1/mean(Allmds$Mound.Volume)), data=Allmds, parameters = list(avar~Allmds$Condition))
>mdvol_e_cov.mle

>mdvol_e.mle= mle2(Allmds$Mound.Volume~dexp(rate = bvar), start=list(bvar = 1/mean(Allmds$Mound.Volume)), data=Allmds)
>mdvol_e.mle

In this code, Allmds refers to an R data frame containing the variables Mound.Volume and Condition. The code uses the maximum likelihood method to evaluate the fit of an exponential distribution to the data and to estimate parameter values. I performed the analysis three times. In the first analysis, I fit the gamma distribution, using Condition as a covariate. In the second and third analyses, I fit the exponential distribution to the data, once with the covariate Condition and once without the covariate.

The models are “nested”. The gamma distribution can be reduced to the exponential distribution by setting the gamma’s shape parameter to one. The exponential model without the covariate is a simplified version of the model with the covariate. Nested models can be compared using an ANOVA test to see whether the more complex model gives a significantly better fit to the data, justifying the extra complexity. The following two tables show the results of the analysis.

The initial results suggest that the exponential distribution with the covariate provides a significantly better fit to the data than the simpler model without the covariate. The gamma distribution does not provide a significantly better fit. Notice that the gamma’s shape parameter is estimated to be one, which reduces the gamma to the exponential distribution.

From this preliminary analysis, I offer the following conclusions. The exponential distribution appears to be an appropriate model for mound volume. In addition, plowed mounds may be distinctly smaller than whole mounds, contradicting my initial hypothesis. In subsequent posts, I will consider some archaeological implications and address some additional considerations that may help to explain these results.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013.

### On Monument Volume III

April 20, 2013

For my study area, the distribution of burial mound volume for plowed and whole mounds looks similar. This distribution is also quite different from the normal distribution that characterizes so many traits in the natural world. The distribution of burial mound volume resembles the form of an exponential distribution. Exponential distributions have a peak at the extreme left end of the distribution and decline steadily and rapidly from that point. The exponential distribution has a single parameter, the rate, typically denoted by $\lambda$. The following function gives the probability density (sometimes called the pdf) of the exponential distribution.

$f(x\vert \lambda) = \lambda e^{-\lambda x}$

The pdf defines a curve. For a continuous distribution such as the exponential distribution, the area under this curve provides the probability of a sample taking on the value within the interval along the x-axis under the curve. The following illustration depicts these relationships. In the illustration, the shaded area under the curve represents the probability of a given sample falling between the two values of x.

As a check on my intuition regarding the applicability of the exponential distribution, I generated a random sample of 2000 from an exponential distribution with a mean of 500. The following figure shows what such a distribution may look like. The simulation does not provide definitive proof, but it may nevertheless indicate whether a more rigorous analysis that employs the exponential distribution is worth pursuing.

At least superficially, the histogram of the simulation results resembles the histograms of mound volume shown in the previous post. This simulation did not produce the apparent outliers seen in the mound data, but the resemblance suggests that burial mound volume can be modeled with an exponential distribution. I thus modeled mound volume with an exponential distribution, using mound condition (plowed or whole) as a covariate. I performed this analysis in R with the bbmle module. In the next post, I’ll present the code and initial results.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2013.

### Identifying and Explaining Intensification in Prehistoric Fishing Practices X: Quantifying the Relative Importance of Netted Fish

October 23, 2009

In the previous post in this series, I showed that mixture models of two lognormal distributions provide a reasonable fit to the size-frequency data for fish caudal vertebrae from my midden assemblages. I have interpreted these distributions as result of the use of nets and other gear types. Nets should take smaller fish than other gear types, and both nets and other gear types may also take large fish. During the course of my model-fitting, I determined that a few vertebrae in each assemblage were too large to fit the mixture model. Such fish were excluded from the data used to fit the mixture models. These very large fish may be attributable to some other gear type than was used to take the other fish. The mixture models show that most fish, quantified in terms of minimum number of individuals, were caught by nets in each assemblage. Archeological analysis should not just end, however, with an identification of the number of fish in an assemblage that were caught by nets or by other gear.

The overall contribution to the diet of fish caught by nets in comparison to fish caught by other gear is of particular interest. Smaller fish produce a lower return on the work invested in fishing, all other things being equal. Many small fish may have to be caught to provide the contribution to the diet that a single, large fish would provide. The proportion of “net-caught” or “hook/spear-caught” fish bone in an assemblage thus does not by itself accurately reflect that return.

This contribution can be determined by calculating the total live weight of fish represented by the modeled distributions of net-caught fish and fish caught with other gear. The total live weight of fish in an assemblage has a more obvious relationship to the potential dietary contribution of the fish than the count of those fish. The positive correlation between caudal vertebra height and fish live weight allows these amounts to be inferred, using a simple transformation of the data.

The total live weight of “net-caught” and “hook/spear-caught” fish can be calculated from the mixture model results. The mixture model provides parameters that can be employed to create an idealized size-frequency distribution for each population. These distributions are scaled using the inferred number of fish from each population and the live weight equation. Remember that the relationship between live weight of fish and caudal vertebrae height can be represented by the following equation:

$y=4.54x^{2.77}\,$

where the parameters were estimated from modern data.

The next equation illustrates the calculation of total live fish weight from one of the modeled distributions represented in an assemblage, where N is the inferred number of fish in the assemblage that belongs to that population and f(x, µ, σ) represents the lognormal probability density function:

$wt=\int_0^{14.2}(4.54x^{2.77})(N)f(x, \mu, \sigma)\,\mathrm{d}x \,$

The parameters N, µ, and σ are estimated from the maximum likelihood analysis of the mixture models.

The scaled distributions are integrated over the range of observed vertebra heights to obtain the total weight of fish. For this study, the equation was integrated from zero to the maximum observed caudal vertebra height among all of the assemblages, which was 14.2 mm. The scaled distributions are then integrated over the range of observed vertebra heights to obtain the total weight of fish. Remember that some fish vertebrae were so large that they were considered outliers and possibly part of a third mode, caught using a different technique than was used to catch fish from the other two distributions. The weight represented by these very large fish was calculated directly from the live weight (power law) equation. Once these mathematical operations have been completed for both of the populations that comprise the mixture distribution, assemblages can be compared for patterns in the amount of fish caught by net and by other gear. The following table provides the results.

Total weight of fish by distribution from each level

The table shows the weight of net-caught fish from the distribution of smaller fish, the weight of larger fish from the second distribution, and the weight of very large fish. The weight of net-caught fish was compared to the combined weight of other fish to calculate the proportion of net-caught fish in each assemblage. These results contrast with my earlier results based on the number of fish from these distributions. Net-caught fish are much less important by weight in all of the assemblages.

At this point, sufficient middle-level theory has been developed to consider the variation among these assemblages. High-level theory provides possible explanations for the patterns observed through the application middle-theory. In subsequent posts in this series, I will present and apply some formal high-level theories that may explain variation in the intensification of fishing.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

### Identifying and Explaining Intensification in Prehistoric Fishing Practices IX: Model Evaluation and Parameter Estimates

October 16, 2009

The previous post in this series introduced the use of mixture models to fit two lognormal distributions to my data on the frequencies of fish caudal vertebra sizes. I have also discussed some technical issues that I encountered while fitting the models. This post presents some results. The following table gives the estimated parameter values for the two lognormal distributions, including the proportion of the assemblage comprised by fish from each distribution, the log means of the two distributions, and the log standard deviations of the two distributions.

Parameter Estimates for Mixture Models

The log mean and log standard deviation parameters describe the distribution of caudal vertebra height for the fish bone in the modes of smaller (net-caught) and larger (hook- or spear-caught) fish. The estimates of all the parameters also have associated standard errors, but I am still calculating those errors. They will be reported in a later installment in the series.

A future post in the series will also present several theories by which these estimates might be interpreted. For now, I will make a few general observations. Note that the standard deviation of the distribution of smaller fish is consistently greater than the standard deviation of the distribution of larger fish. The standard deviation is scale-dependent, but I can offer an explanation for this observation without standardizing the standard deviations. As explained elsewhere, nets should capture both small and large fish. Small fish were likely more common than large fish. Thus, the distribution of net-caught fish should have a small mean and a relatively large standard deviation. The distribution of fish caught by hook or by spear should have a larger mean and a smaller standard deviation. Hooks and spears would not be likely to catch fish smaller than some threshold size. The estimates support these assumptions. The estimated proportion of net-caught fish is also consistently higher than the proportion of fish caught with other gear. Some variability exists, however, and this variability may be significant.

Having fit the models, another issue must be resolved now. I also need to consider whether these models provide an appropriate fit to the data. In particular, I need to evaluate whether a simpler model might also explain the observed patterns. In this case, an example of a simpler model than the mixture model of two lognormal distributions might be a single lognormal distribution. I fit such single lognormal distributions to the size data from each assemblage using the maximum likelihood method. The negative log likelihoods of the mixture models was consistently lower (showing that it fit the data better) than the negative log likelihoods of the single lognormal distribution models, but this result is not surprising.

Models with many parameters can generally be made to fit data better than models with fewer parameters. Models with fewer parameters, however, should generally be preferred to models with more parameters, following the principle that simpler explanations are better than more complex explanations. Models with many parameters may also be worse at predicting the variability in new data sets. In essence, more complex models may be finely tuned to match the particular, random factors that affected one data set. The next data set will have been affected by those random factors differently. Thus, a simpler model that does not try to “explain” random variation may often do better at predicting additional data. Such models focus on the deterministic factors that pattern variation. These observations

The likelihood ratio test provides a way to compare “nested” models. Models are nested when more complex models can be reduced to simpler models by setting parameters to particular values. In the case of my fish data, I can reduce my mixture model of two lognormal distributions to a single lognormal distribution by setting p=1 or p=0. Recall that p is the proportion of fish in the assemblage that derive from the distribution of mainly smaller (and presumably net-caught) fish.

As the name implies, the likelihood ratio test compares the likelihood values of a complex model and a simpler model. A theorem states that the ratio of these values has a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between the models being compared. Using this theorem, I want to know if the observed ratio attests to a sufficiently significant increase in the fit to the data of the more complex model to justify the added complexity. The following table shows the results of this analysis for my mixture models and the corresponding single lognormal distribution models.

Likelihood Ratio Test Results for Mixture Model and Single Lognormal Distribution

The results provide some support for the use of the mixture models on my data. Many of the p-values for my likelihood ratio tests exceed the arbitrary 0.05 value often employed in studies, although some are lower than this value. Notice that the p-values are generally lower when the sample size is higher. The following scatterplot illustrates this relationship.

Sample size of fish bone and p-value for likelihood ratio tests

P-values often reflect such sample size effects. In addition, no universal threshold exists at which a p-value can be said to truly “significant”. For these reasons, I am comfortable applying the mixture models to all of my assemblages. The mixture models seem sufficiently better at explaining the variability among all of the assemblages to justify the added complexity. I intend to use the mixture models to determine the importance of net-caught fish in each assemblage.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

### Mixture Models and Maximum Likelihood Methods

October 7, 2009

In this post, I will highlight some of the technical issues that I encountered while trying to model the variability in the sizes of fish vertebrae from a midden deposit. As described elsewhere, my goal was to distinguish fish caught by nets from fish caught by other gear. The use of these gear types should produce different distributions of fish size. Mixture models are appropriate for cases where variability in a characteristic results from the combination of two or more different distributions. I fit mixture models to the data using the maximum likelihood method. This approach is common in modern statistics but has not been widely employed within archaeology.

The maximum likelihood method addresses the question: “what are the parameter values that make the observed data most likely to occur?” The parameter estimates can be determined from the corresponding likelihood value. The likelihood is calculated from the product of the probability of observing each case in the data given a particular set of parameter values. In practice the log likelihood is usually calculated because the log probabilities can be summed. Calculating the product of many small numbers can be computationally more difficult than summing the log of those numbers. The best parameter estimates have the highest likelihood value or the lowest negative log likelihood value. Algorithms that search the parameter space are typically used to determine those values.

To use this method, a probability distribution has to be selected that is appropriate for the variability in the data. A simple linear regression, for example, is essentially a maximum likelihood analysis which assumes that the data are normally distributed with a mean of μ= a+x*b and a standard deviation of σ2. In this example, the maximum likelihood analysis finds the values of a, b, and σ that best account for the variability in the data.

The application of a mixture model to my data on fish bone size provides another example of this approach. The mixture model that I used assumes that the size-frequency distribution of fish in each assemblage was a result of the combination of two lognormal distributions. Such distributions are appropriate to the data for a couple reasons. First, those distributions can have long tails to the right, and the histograms of caudal vertebrae height for my assemblages also have long tails.

Histograms of Caudal Vertebra Height (mm) by Level from the Midden Deposit

Second, consider the average size of those modern fish species that are also found at archaeological sites in the region. A histogram of the average size of these fish also displays a long tail to the right as shown in the following histogram.

Average live weight of modern fish species in study region

This distribution is probably not lognormal. Nevertheless, smaller fish species are clearly more common than larger fish. The distribution of fish sizes from which fishers obtained individual fish was likely affected by the abundance of these fish species, habitat, climate, and other factors. Better data on the modern distribution of fish size for my study area is not available, so this discussion will have to be sufficient for now.

I used the mixdist package for R to find the maximum likelihood estimate of the parameter values, including the proportion of the fish in the two modes, the mean size of fish in each mode, and standard deviation of each mode. This package uses a special algorithm to arrive at those estimates. I also searched the parameter space directly, writing a simple program in R to loop over the plausible range of values for my parameters and find the best parameter estimates.

The direct search of the parameter space proved to be the most informative approach. I could examine the results to see how the likelihood varied with parameter values. Examination of this variation showed that the likelihood value was not converging smoothly with those parameter values. Wildly different combinations of parameter values had very similar likelihoods.

The problem turned out to be the large fish at the extreme end of the distributions in my assemblages. Too many of these fish occurred in the samples for the models to readily converge on parameter estimates. These difficulties were largely hidden when I used the mixdist package to fit the mixture models. The very large fish probably belong to a third mode and may therefore have been obtained in a different manner from the techniques used to acquire the fish in the two smaller modes. Once I removed these large fish from the analysis, the likelihood varied smoothly with the parameter values.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

### Identifying and Explaining Intensification in Prehistoric Fishing Practices VIII: Establishing the Number of Fish Caught by Nets or Other Gear through Mixture Models

October 2, 2009

The previous post in this series provided the middle-level theory needed to quantify the number and size of fish represented in an archeological assemblage. Recall that I am interested in determining how many of these fish were caught by nets and how many were caught by other gear. These gear should differ in the size range of fish that they are likely to capture.

The next step is to look at histograms of the data, which show the number of fish bone that occur within a particular size interval. Modes, or peaks, in the fish size-frequency histograms should reflect the use of different fishing gear. The mode of smaller fish should represent fish taken by nets, and the mode of larger fish should reflect fish taken by hooks and line or by spears. The following histograms show data from my archeological site. As noted in an earlier post, the fish bone assemblages derive from different levels within a single site, where minimal mixing has occurred among the levels.

Histograms of Fish Caudal Vertebra Height by Level from an Archaeological Site

The histograms for some of the levels show distinct modes. In particular, two modes seem to be present in the 50-55 cm level, while three modes apparently occur in the 30-35 cm level. The other levels are harder to interpret.

Clearly, the identification of these modes is not straightforward. The lack of more clear-cut patterning likely results from a heavy reliance on nets. Nets may catch both large fish as well as small fish. Other gear like hook and line or spears is much more likely to catch large fish. Prehistoric fishers used spears tipped with large stone points or sharpened bone. They employed hooks made from shells. Hooks and line or spears may not be able to catch fish smaller than some threshold value of size. In assemblages where net-caught fish predominate, the prevalence of net-caught fish may obscure any mode in the fish size distribution formed by fish caught with other gear.

Fortunately, statistical techniques exist which may help to distinguish separate populations which are mixed together in a single distribution. Finite mixture distributions model such situations. Such distributions can be analyzed using the mixdist package for R. This package allows the parameters of the contributing populations to be estimated, including the proportion of each population represented in the distribution and the mean vertebra size in each separate population. The following graph illustrates the application of a mixture model to data from the 50-55 cm level at the site.

Example of the Mixture Distribution Fit to Data from the 50-55 cm Level

For the mixture model, I fit two lognormal distributions to the data. The histogram depicts the original data. Note that the histogram interval differs from the interval used in the previous graph. The two dotted lines show the separate lognormal distributions fit to the data, and the black triangles identify the means of those distributions. The solid line shows the mixture model prediction that results from combining the two individual lognormal distributions. The gray bars at the bottom of the graphic show the deviations of the model from the observed distribution. The scale of the deviations is depicted in relative terms. This model appears to fit the data reasonably well.

I have also been working on a more rigorous analysis of the mixture models and their fit. This analysis is ongoing and has been plagued by some problems that I may have finally resolved. I will present some the results and issues in the next post in this series.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

### Identifying and Explaining Intensification in Prehistoric Fishing Practices VII: Estimating the Minimum Number of Individuals

September 28, 2009

In the previous post in this series, I demonstrated that caudal vertebra height can be used to estimate live fish weight. The use of vertebrae, however, introduces an additional issue of quantification that requires resolution. Bony fish have two main types of vertebra: abdominal and caudal. Predictable variation in form occurs among the abdominal and caudal vertebrae in the vertebral column of an individual fish. Despite this predictability, an undifferentiated pile of fish vertebrae in an archaeological collection is usually just separated into these main types because the variation can be subtle. Multiple specimens of a single vertebra type (like caudal vertebrae) from a particular species thus could be attributable to a single individual or to multiple individuals. Many statistical tests, however, require that each observation be independent of the others. This assumption is particularly critical for the analysis of size-frequency distributions. The assumption of independent observations would be violated if multiple bone specimens derived from the same individual. This violation could dramatically affect inferences regarding the shape of that distribution. Some method must be used to eliminate potentially redundant specimens.

Two criteria can be used to identify vertebrae from separate individuals. First, the size of vertebrae within an individual bony fish (excluding the length of the centrum) typically varies only a little. The vertebral centra of sharks, skates, and rays (elasmobranchs) seem to vary to a much greater extent within an individual. Subsequent analysis focused on bony fish for this reason. Second, each species of bony fish has a characteristic number of abdominal and caudal vertebrae, and this number varies modestly among individuals. A small number of caudal vertebrae within a narrow size range from a particular taxon, for example, may well have come from the same individual. Vertebrae from a particular taxon that span a large size range or that occur in large number within a small size range are likely to have derived from multiple fish.

In the sample of fish specimens, vertebra height typically varied less than 0.3 mm when comparing abdominal and caudal vertebrae. The size difference within individuals was not strongly correlated with the overall size of the caudal vertebra.

Size Difference Between Caudal and Abdominal Vertebrae and Caudal Vertebra Height

A simple linear regression returned estimates of -0.19 for the y-intercept and 0.13 for the slope of the line, while r2=0.49 and p < 0.01. Two of the large vertebrae appear to be outliers, however, and may be unduly influencing these results. With these two cases removed, the simple linear regression estimates the y-intercept to be 0.05 and the slope to be 0.06, while r2=0.22 and p =0.01. This rule of thumb may therefore be generally applicable.

When the number of caudal vertebrae within a 0.3-mm size interval exceeds the typical number for that taxon, more than one individual from that size range may be represented. Additional work should be undertaken, using a larger sample of fish, to confirm and refine this observation. In the interim, the foregoing principles and observations can be used to calculate the minimum number of individuals represented in an archaeological assemblage and to estimate the size of each fish.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

### Identifying and Explaining Intensification in Prehistoric Fishing Practices VI: Modeling the Fish Live Weight-Caudal Vertebra Size Relationship

September 23, 2009

In the previous post in this series, I suggested that existing theory could provide a model appropriate to the observed relationship between live fish weight and vertebra size. This relationship appears to be an instance of allometric scaling. Many animal species exhibit a power-law relationship between the scale of particular traits and overall body size. Such relationships take the following form. Let y = body size and x = the size of a particular trait. Then:

$y=ax^b\!$

where a and b are constants and the parameters to be estimated from the data.

Note that a log transform of this relationship would result in the following equation:

$\log(y) = \log(a) + b\log(x)\,$.

This equation produces a straight line with the y-intercept at log(a) and a slope of b. The log-log transform of the data illustrated in the previous post showed this kind of line, which is a necessary (but not sufficient) condition for demonstrating a power-law relationship.

For my sample of fish, I evaluated the relationship between fish live weight and caudal vertebra size using a linear regression analysis. This technique identifies the values of a and b that minimize the deviation between the observed values and the predicted values. The estimates for the sample are a=4.54 and b=2.77. The 95 percent confidence interval for a ranges from 3.05 to 6.76, while the 95 percent confidence interval for b ranges from 2.49 to 3.05. For this model, r2=0.93, and the p-values for both parameters are less than 0.001. The low p-values indicate that the sample size was sufficiently large.

The following plot illustrates the fit of this model to the data. In the plot, the dashed lines show the prediction interval around the model. This interval depicts the range within which 95 percent of live weights for new samples would be expected to fall for a given value of caudal vertebra height.

Log of Caudal Vertebra Height and Log of Live Fish Weight

The sample of fish for this analysis should be expanded. Nevertheless, it supports the common-sense notion that fish bone size reflects the overall size of fish. The data that corroborate this middle-level theory are not comprehensive, but they provide sufficient justification to proceed. Caudal vertebra height will be used as a measure of fish size.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.

### Identifying and Explaining Intensification in Prehistoric Fishing Practices V: Quantifying the Relationship between Fish Size and Fish Bone Size

September 19, 2009

The previous post in this series established that a positive relationship exists between the live weight of fish and caudal vertebra height, providing support for the use of vertebra size as an index of overall fish size. I will attempt to quantify this relationship more precisely. Many different models could be chosen, but how should we select the most appropriate one?

The model needs to be appropriate for the structure of the data. The following graph shows a plot of the data with a linear model superimposed over the data points. The graph also depicts the deviation between data and the model with vertical lines. Notice that these deviations seem to get larger as the size of the fish gets larger. Linear models assume, among other things, that the variation remains constant. A linear model may not be appropriate.

Fish Live Weight and Vertebra Height Scatterplot with Linear Model

A transformation of the data may help. Taking the log of both the live weight and the vertebra height produces more consistent variation. The next graph shows a linear model applied to the log of the data.

Log Transform of Live Fish Weight and Vertebra Height

The deviations from the model are much more consistent. This model now seems reasonably appropriate to the structure of the transformed data in the sense that it doesn’t appear to violate the model assumptions. Those assumptions include normally-distributed variation and constant variation. Ideally, however, I’d like to fit a model that has an easier interpretation. Is there any theoretical basis for applying a particular type of model to fit to this data? As it turns out, the answer is “yes”, and I will discuss this model in the next post in the series.

© Scott Pletka and Mathematical Tools, Archaeological Problems, 2009.