To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Tables of counts and statistical summaries are quick and compact ways to begin exploring a new data set, but they are limited. To really visualize data distributions and the relationships among variables, we need charts and graphs. R has extensive graphic capabilities including the standard pie charts, bar charts, histograms, box-and-whisker plots, and scatter plots. Newer methods such as dot plots, kernel density plots, violin plots, and ways of representing three or more variables simultaneously are also available.
R has several graphics systems. One is the base graphics system that is included in R. With this system, graphs and charts are created using functions that write directly to a graphics output device that can be a window on your computer screen or a file. Graphs can be built up using multiple functions that can be stored as R script files, allowing you to regenerate the graph whenever needed (Murrell, 2011). A second graphics system focuses on interactive graphs and 3D graphs using the rgl package. We will use rgl to create an interactive 3-D plot. A third system, lattice, sets up the graph as a grid object (just like data frames or tables are created as objects) and then that object is printed to a graphics device (Sarkar, 2008). Lattice rolls all of the steps you would use with base graphics into a single function call. It is more difficult to learn, but it provides a very easy way to make multiple graphs showing different samples or subgroups of the data set all at once. Finally, a relatively new system that tries to combine the best of base and lattice graphics with a consistent command structure is ggplot2 (Wickham, 2009). We will not try to cover lattice or ggplot2. One of the advantages of R is the large community of people using R and adding new functions. Graphics is a perfect example of this productivity.
There are a number of sources available that describe how to produce graphs and charts that are clear, informative, and easy to interpret. They range from graphs that can be produced simply using a pencil and paper (Tukey, 1977) to graphs that communicate information clearly based on experiments on human perception to see what kinds of depictions are correctly interpreted (Cleveland, 1993, 1994) to graphs that retain an aesthetic quality while communicating effectively (Tufte, 1983, 1990).
Archaeological assemblages are collections of artifacts that have been assigned to different groups. The boundaries defining an assemblage can be a whole site, the part of the site excavated, a feature within a site (e.g., pit, grave, house), or an arbitrary unit defined in terms of horizontal and vertical space (Level 7 of unit N302E200). A description of an assemblage includes how its boundaries are defined and how many of each kind of archaeological material was present within those boundaries.
One of the challenges in analyzing archaeological assemblages is that the factors that control the counts are usually not controlled by the archaeologist. The boundaries of the assemblage usually do not represent a consistent amount of time from one assemblage to another. It does not matter if time is measured in years or person-years, we cannot assume that the amount of time is the same between assemblages except in rare circumstances such as graves and shipwrecks. Artifact composition, whether the result of a geological event (obsidian) or a behavioral event (ceramics) does not include the same level of uncertainty. Artifact composition is expressed using some similar measure that standardizes abundance (e.g., percent, per mil, ppm, ppb). Artifact assemblages may be similarly standardized in terms of percentage of the whole assemblage or only those items under analysis (e.g., ceramics, but not lithics; faunal material, but not botanical material; lithic artifacts, but not lithic debitage) or in terms of density (items per volume), but usually we are not certain that volume means the same thing across the various assemblages under consideration.
This means the analysis of artifact assemblages is similar, but different from artifact composition and from species composition in ecological communities. While it makes sense to borrow from the approaches used by both, it is important to recognize the differences. In comparison with ecological communities, artifact types are less clear-cut than species. The assemblage represents an accumulation of discarded material rather than the observation of living individuals present at a particular point in time and in that sense archaeological assemblages are more similar to fossil communities. Instead of the niches occupied by biological species, artifacts occupy space defined by human interaction with the physical environment and with social networks.
In the preceding chapters we have explored some of the ways quantitative analysis of archaeological data can help us to understand the past. In particular, we have used the R Project for Statistical Computing. Since R is a collaborative project, new packages have been added or updated as you were reading this book. By now, you should have the ability to evaluate statistical hypotheses, explore your data for new, interesting patterns, and find ways to communicate those patterns to others.
Analyzing data begins with a research design that should inform decisions about how to collect the data, how much to collect, and how to analyze the data collected. Analysis of the data involves trying many different approaches. It is not a matter of clicking on a few menus and scanning generically produced output. Do not assume that using quantitative methods on your data will be any easier than the process of excavating or analyzing the data was. Ask interesting questions of your data: where it came from, how it got there, what it was used for, what it means. Then find the methods that get you closer to answering those questions. You will need multiple methods, not one, and if they lead to conflicting conclusions, you will need to wrestle with that fact.
Quantitative methods help us make some progress in answering the big questions I posed in Chapter 1. But you may have to learn a new way to use those methods. Traditional inferential statistics encourages you to think in terms of collecting statistically significant results and to measuring your progress in those terms. In working with archaeological data, I have tried to show that engaging with data is a process and that a simple tabulation of hypothesis tests is insufficient. As soon as you have data, you should begin looking at it, checking for errors, checking for unexpected distributions and patterns, and adjusting your research accordingly. Evaluate your findings in terms of multiple lines of information. For example, shape can be compared to spatial or temporal distribution. Site distribution can be compared to assemblage variability and so on.
In Chapter 8, we used one categorical variable to define groups and examined differences in a numeric variable between the groups. The grouping variable was treated as an explanatory or independent variable while the numeric variable was the response or dependent variable. This relationship was indicated by the fact that the explanatory variable was on the right-hand side of the formula and the response variable was on the left side of the formula (for the Snodgrass example, Area~Segment). This terminology suggests some kind of causal relationship between the variables.
Statistical tests do not tell us which one is which or even if the relationship is causal, but they can draw our attention to the fact that when one variable changes, another one changes as well. Another way of looking at this is to say that one variable allows us to make predictions (or projections or forecasts) about another variable. This chapter focuses on methods we can use to identify variables that vary with one another. In some cases, it is clear which variable is the explanatory variable and which is the response. In other cases, the distinction is irrelevant because the statistic is symmetrical. In looking for relations between variables, we usually have one of two goals. Either we are looking for a measure of association that tells us how strongly two variables co-vary (e.g., correlation) or we are looking for a way to use one variable to make predictions about another variable (e.g., regression).
CATEGORICAL DATA
When the two variables are categorical, we generally use the Chi-square test to determine if the two variables are independent or associated. If they are independent, knowing the value of one variable does not improve our ability to guess the value of the second variable. Coin tosses are independent. If I toss a coin and the result is heads, it does not improve my ability to predict a second toss of the coin. This concept of independence underlies the Chi-square test and represents the null hypothesis.
Cluster analysis includes a number of techniques for combining observations into groups based on one or more variables. Clustering is unsupervised classification since we do not have any information about how many groups or how they should be defined. The groups can be formed in five ways. First, we can start with all of the objects and divide them into two groups and then we can subdivide each of those groups. For example, we can separate the ceramics that have shell temper from those that have sand temper. We could divide each of those groups by the shape of the pot by separating jars from bowls. Then we could look at each of those groups and divide them by decorative techniques, such as painting, cord-marking, or incising. This is a common way of approaching artifact typology for ceramics. A second way of forming groups is to use a “type specimen.” One or more artifacts is used to identify distinct types. Then artifacts are placed with the type specimen they most closely resemble. New projectile points are classified by comparing them to established descriptions of existing types. If the specimen does not match any of the known types, a new one can be defined. Third, we can divide the specimens into groups so that each group is relatively homogeneous. This approach is similar to the type specimen approach, but we do not identify type specimens in advance although we do have to decide how many groups seem reasonable. Fourth, we could start with all of the objects and find the two that are most similar. Then the next two and so on. This process includes adding a third specimen to an existing pair and to combining two pairs into a larger group. The process continues until there is only one group. This approach requires very careful operational definitions of what “similar” means and how it is to be measured. Until computers became widely available, archaeologists rarely used this approach. Finally, we could use a distance measure (rather than a single variable) to divide the collection into more and more groups.
R is an open source programming language that includes functions for managing and analyzing data (R Core Team, 2016). It includes extensive graphical and statistical capabilities that are continually being expanded. In addition to the basic installation, there are numerous packages contributed by R users worldwide that provide additional functions. R is freely available worldwide and compiled versions are available for Windows®, Mac OS X®, and several versions of Linux (Debian®, Redhat®, Suse®, and Ubuntu®). In this chapter, you will learn how to install R, type commands, and get help.
FIRST STEPS USING R
The main R Project website is www.r-project.org/ (Figure 1). The main page provides access to information about R and to extensive documentation. Select CRAN under “Download” on the left side of the page. At the top of the page, labeled 0-Cloud, click on https://cloud.r-project.org/. That will take you a secure server near your location to download R. Download the version for your computer's operating system (Windows, OS X, or Linux). The download link for Windows is straightforward. The link for Linux requires you to select which flavor of Linux you are using and then provides instructions for installing the software. The link for Mac OS X is a bit more complicated. Assuming you have a recent version of OS X (10.9 or later), follow the instructions and install XQuartz first and then download the.pkg file to install R.
You interact with R by typing commands and executing them. There are several ways to do this. We will start with the basic interface and then describe some other options at the end of the chapter. When you start R by clicking on the icon, you will see the R Console window (Figure 2). The window includes information on what version of R you are using, how to get help, how to quit, and some basic instructions on how to cite R in publications. Below that is a line beginning with “>” which is R's way of saying it is waiting for you to type a command. The menu bar at the top of the window includes File, Edit, Misc, Packages, Windows, and Help tabs (in Windows®, the OS X® menus are slightly different and the tabs do not appear in the Linux terminal window).
Principal components analysis and factor analysis are related, but distinct, methods for analyzing the structure of multivariate data. In Chapter 11, we identified linear combinations of variables that would separate known groups. If we suspect there are groups in the data or that the variables are related to one another, principal components may help us to identify those patterns.
Principal components analysis looks for a way to simplify the dimensionality of the data. If we have only two variables (e.g., length and width), we can display their relationship with a simple scatterplot of length against width that indicates if the variables are correlated with one another or if there are distinct clusters of observations. If there are three variables (add thickness for example), we can display a 3-D representation that can be rotated to look for patterns in the data. After three variables, there is no simple way to display the data. One approach is to find a way of projecting the data into a smaller number of dimensions just as a shadow is a two-dimensional projection of a three-dimensional object. That projection will result in the loss of information so we would like to find the projection that captures as much detail as possible. Principal components analysis does this by identifying the direction of maximum covariance (or correlation) in the data. The first component identifies that direction. The second component finds the next largest direction of covariance (or correlation) subject to the constraint that it must be orthogonal (at a right angle or uncorrelated) with the first dimension. If the data are highly correlated, we may be able to accurately summarize many variables in a few dimensions. Principal components analysis tells us about the structure of the data and provides us with a way of displaying the observations in a reduced number of dimensions. Sometimes that can help us to see clustering in the data that might indicate distinct artifact types (Christenson and Read, 1977; Read, 2007).
While principal components focus primarily on summarizing multivariate data, factor analysis has a different goal. Factor analysis assumes that the data are the product of unobserved (or latent) factors. The correlations between variables provide the evidence of these latent factors.
In Chapter 10, we expanded on linear regression by using more than one explanatory variable on the right-hand side of the formula. In this chapter, we will expand on t-tests and analysis of variance from Chapter 8 by adding more than one response variable on the left-hand side of the formula. Hotelling's T test is a multivariate expansion of the t-test and multivariate analysis of variance (MANOVA) is a multivariate expansion of analysis of variance. In many cases, we have multiple measures of artifact shape or composition and running t-tests separately on each variable creates multiple comparisons problems. Also the tests are not really independent if the variables are correlated with one another as they often are. Hotelling's T and MANOVA provide an overall test of the difference between the groups based on all of the numeric variables. The tests of significance are on these linear combinations rather than the original separate variables.
Discriminant analysis involves a similar process in that we are looking for linear combinations of variables that allow us to predict a categorical variable. The most common archaeological application is in compositional analysis where we are trying to characterize different sources (geological sources or manufacturing sources) on the basis of molecular or elemental composition. Discriminant analysis includes two separate but related analyses. One is the description of differences between groups (descriptive discriminant analysis) and the second involves predicting to what group an observation belongs (predictive discriminant analysis, Huberty and Olejink 2006).
Descriptive discriminant analysis is based on multivariate analysis of variance. Instead of a single numeric dependent (response) variable, we have several variables. To test for differences between groups, we compute linear combinations of the original variables and then test for significant differences between the linear combinations. A linear combination is like a multiple regression equation in the sense that each variable is multiplied by a value and summed to produce a new value that summarizes variability in the original variables. Descriptive discriminant analysis is also described as canonical discriminant analysis and the linear components are referred to as canonical variates. The method is used to visualize the similarities and differences between groups in two or three dimensions.
A table is simply a two-dimensional presentation of data or a summary of the data. We use tables to inspect the original data for errors or problems such as missing entries. We used tables to present condensed summaries of data values in Chapter 3 (e.g., numSummary()). Those summaries involved computing summary statistics by a categorical variable to see how the groups differed from one another. We can also use tables to see how categorical variables covary.
Nominal or categorical data play a large role in archaeological research. At the regional level, sites are the categories and we are interested in the number of different types of artifacts (also a category) found in each site. The same applies at the site level where the artifact categories are distributed across excavation units. Within sites, different kinds of features are present and features contain different types of artifacts. At the artifact level, some properties of artifacts are represented by categories. Because of this, the same data are often represented in different ways for different purposes. That is not a problem unless the statistical procedures we are using expect a format different from the one we are currently using. In Chapter 3, we created tables of descriptive statistics. In this chapter we are concerned with tables in which the cell entries consist of counts of objects.
R distinguishes between tables and data frames and some functions will work with one but not the other. Data frames have columns that represent different types of data (e.g., character strings, factors, numbers), but tables in R represent numeric data only. In fact, R tables are a kind of matrix. Before constructing tables, we will briefly describe how R encodes categorical data using factors.
FACTORS IN R
Factors are a way of storing categorical information in R. If you have coded a variable into a set of categories, you have the choice of storing the information as a character or factor vector. A factor stores each category as an integer and the category labels are stored as levels. If you import your data into a data frame, R will automatically convert character vectors into factors unless you use the argument stringsAsFactors=FALSE.
When we plotted the dart point lengths, some types seemed similar to one another and some seemed different. Under most circumstances, we assume that the data we are analyzing is a sample of a larger population to which we do not have access. When we computed the mean and the standard deviation of length for different point types, we were computing sample statistics, values that characterized the distribution of values in the sample. If we increase the sample by adding more points, the statistics would change somewhat. If we could collect all of the points of a particular type and compute the mean and the standard deviation, those values would now be parameters. They would represent the entire population of a point type.
Of course, we cannot hope to find all of the points of a particular type, or all of the pots of a particular type. We never have more than a sample to work with, but we would like to estimate the population parameters on the basis of a sample. Since two samples contain only estimates of the population, it also makes sense to wonder if the two samples are part of the same population or if they come from two different populations. It is only when we are looking at a part of the whole that we have to consider if the statistics computed from the sample are representative of the population as a whole. One of the goals of inferential statistics is to formalize the concepts of similar and dissimilar in terms of probability and this leads to the concepts of confidence intervals and hypothesis testing.
Confidence intervals provide a probability distribution around a statistic such that if we had many samples, we can say that a certain percentage of the confidence intervals of those samples would include the population value. Hypothesis testing allows us to assign a probability to the possibility that two samples were drawn from the same population (or from populations with the same parameters).
Classical inferential statistics often depends upon the normal or Gaussian distribution to determine those probabilities. These methods are generally referred to as parametric statistics.
Seriation involves finding a one-dimensional ordering of multivariate data (Marquardt, 1978). In archaeology, it is usually expected that the ordering will reflect chronological change, but the methods cannot guarantee that the ordering will be chronological. Seriation has a long history in archaeology beginning with Sir Flinders Petrie (1899) who was attempting to order 900 graves chronologically. The idea provoked the interest of a number of mathematicians over the last century because it involves interesting problems in combinatorial mathematics including David Kendall (1963) and W. S. Robinson (1951). Ecologists share an interest in finding one-dimensional orderings of ecological communities that match environmental gradients although they refer to the process as ordination rather than seriation.
The usual organization of data for seriation is a data frame where the columns represent artifact types (whether present/absent, or percentages) and the rows represent assemblages (graves, houses, sites, stratigraphic layers within sites, etc.). Before the widespread use of computers, seriation involved shuffling the rows of the data set to concentrate the values in each column into as few contiguous rows as possible. Ford (1962) proposed an approach to seriation of assemblages with types represented as percentages that involved shuffling rows to form “battleship curves.” In 1951, Robinson proposed an alternative approach that involved the construction of a similarity matrix. The rows and columns of the matrix are shuffled until the “best” solution is reached based on criteria that Robinson proposed. As computers became available, programs were written to implement both types of seriation. More recently, multivariate methods including multidimensional scaling, principal components, and correspondence analysis have been applied to seriation. These methods often represent the seriation as a parabola, which is referred to in the archaeological, statistical, and ecological literature as a “horseshoe” (Kendall, 1971). Kendall noted that the horseshoe results from the fact that distance measures generally have a maximum distance such that we cannot resolve the relative distances of objects beyond the maximum distance (a horizon effect). Assemblages that do not share any types are on this horizon. Unwrapping the horseshoe is necessary to produce a one-dimensional ordering.
Archaeology is the study of human culture and behavior through its material evidence. Although archaeology sometimes works with the material evidence of contemporary societies (ethnoarchaeology) or historical societies (historical archaeology and classical archaeology), for most of our past, the archaeological record is the only source of information. What we can learn about that past must come from surviving artifacts and modifications of the earth's surface produced by human activity. Fortunately, people tend to be messy.
Our basic sources of evidence consist of artifacts, waste products produced during the manufacture of artifacts or their use, food waste, ground disturbances including pits and mounds, constructions that enclose spaces such as buildings and walls, and the physical remains of people themselves. Study of this evidence includes identification of the raw materials used, what modifications occurred to make the object useful, and the physical shape and dimensions of the final product. Wear and breakage of the object and its repair are also examined.
In addition to its life history, each object has a context. It was discovered in a particular part of a site, in a particular site in a region, occupied by humans at a particular time. Together these make up the three dimensions that Albert Spaulding referred to as the “dimensions of archaeology” (Spaulding 1960).
Our discovery and analysis of archaeological evidence is directed toward the broad goal of understanding our past. The range of questions archaeologists are attempting to answer about the past is substantial. Broadly they could be grouped into a number of big questions:
1. How did our ancestors come to develop a radically new way of living that involved changes in locomotion (bipedalism), increasing use of tools, the formation of social groups unlike any other living primate, and increases in cranial capacity? Quantitative methods are used to identify sources of raw material for stone tools to determine how far they were transported. They are also used to classify stone tools, to compare the kinds of tools and the kinds of animals found at different sites, and to look for correlations between the distributions of stone tools and animal bones.