To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The designs we have discussed to this point represent techniques of experimental control in order to reduce error variance in a research study (Kirk, 1995). It is also possible to reduce error variance by employing statistical control using analysis of covariance (ANCOVA). Statistical control is used when we know how participants stand (we know their score) on an additional variable that was not part of or could not readily be incorporated into the experimental design. This additional variable is brought into the data analysis by treating it as a covariate. Although covariates can be categorical or quantitative, we will limit our discussion and data analysis procedures to quantitatively measured covariates.
The covariate represents a potential source of variance that has not been experimentally controlled but could covary with the dependent variable and could therefore confound our interpretation of the results. By identifying that variable as a covariate and collecting measures of it on the study participants, it is possible to “remove” or “neutralize” its effect on the dependent variable (by statistical procedures described in this chapter) prior to analyzing the effects of the independent variable on the dependent variable.
A SIMPLE ILLUSTRATION OF COVARIANCE
Consider a test of math skills that we might administer to school children in a particular grade. We wish to compare girls and boys in their math problem solving skills.
In between-subjects designs, participants contribute a single value of the dependent variable, and different participants are represented under each level of the independent variable. The situation is different for within-subjects designs. Here, we register a value of the dependent variable in each and every condition of the study for every one of the participants or cases in the study. That is, each participant is measured under each level of the independent variable.
Within-subjects designs are still univariate studies in that there is only one dependent variable in the design. As is true for between-subjects designs, the dependent variable is operationalized in a particular manner. What differentiates between-subjects and within-subjects designs is the number of such measurements in the data file that are associated with each participant: In between-subjects designs, researchers record just one data point (measurement) for each participant; in within-subjects designs, researchers record as many measurements for each participant as there are conditions in the study.
Within-subjects designs can theoretically contain any number of independent variables combined in a factorial manner. Practically, because all participants must contribute data under each level of each independent variable, the logistics of the data collection procedures can become quite challenging with several independent variables in the design. Also, there is the possibility that subjects could be affected or changed by exposure to one level of an independent variable, a situation known as carry-over effects, such that it may not be appropriate to expose them to more than that single level; the likelihood of eliminating all of the possible carry-over effects diminishes with each additional independent variable included in the design (we will discuss carry-over effects more in Section 10.4).
The present text is an exploration of univariate methodology where the effects of one or more independent variables are assessed on a single dependent variable. Such univariate designs are ubiquitous in the social, behavioral, and biological science literature. We have chosen, in this book, to focus our efforts on analysis of variance (ANOVA). Issues concerning multivariate methodology, including multiple regression analysis, are not covered in the present text as a result of space limitations, but they are addressed in a companion text (see Meyers, Gamst, & Guarino, 2006).
This book owes both a conceptual and computational debt to early ANOVA pioneers, beginning with the seminal work of Fisher (1925, 1935), who focused on solving agricultural problems with experimental methods. Fisher's early work was adapted to other fields, including the social and behavioral sciences, and in doing so moved from childhood to early adolescence with the work of Baxter (1940, 1941), Crutchfield (1938), Garrett and Zubin (1943), Lindquist (1940), Snedecor (1934), and Yates (1937). By the 1950s, ANOVA procedures were well established within most social and behavioral sciences (e.g., Cochran & Cox, 1957; Lindquist, 1953; Scheffé, 1959).
Beginning in the early 1960s, ANOVA procedures were further delineated and popularized by Winer (1962, 1971) and Winer, Brown, and Michels (1991). These works, while sometimes challenging to read, were considered the “gold standard” by many ANOVA practitioners.
The term analysis of variance is very descriptive of the process we use in the statistical treatment of the data. In a general sense, to analyze something is to examine the individual elements of a whole. In ANOVA, we start with a totality and break it apart – partition it – into portions that we then examine separately.
The totality that we break apart in ANOVA, as you might suspect from the name of the procedure, is the variance of the scores on the dependent variable. As you recall from Chapter 2, variance is equal to the sum of squares – the sum of the squared deviations from the mean – divided by the degrees of freedom. In an ANOVA design, it is the variance of the total set of scores that is being partitioned. The various ANOVA designs are different precisely because they allow this total variance to be partitioned in different ways.
ANOVA is a statistical procedure that allows us to partition (divide) the total variance measured in the study into its sources or component parts. This total measured variance is the variance of the scores that we have obtained when participants were measured on the dependent variable. Therefore, whenever we talk about the variance associated with a given partition or whenever we talk about the total variance, we are always referring to the variance of the dependent variable.
A two-way within-subjects design is one in which the levels of two within-subjects variables are combined factorially. Recall from Chapter 11 that within-subjects independent variables can be either time related or not. There are three possible combinations of types of within-subjects variables that can be represented in a two-way design: (a) both can be time independent, (b) one can be time independent and the other time based, and (c) both can be time based. This latter combination is not often found in the research literature; the other two are more common, and we provide illustrations of them here.
Both variables time independent: Students can be asked to indicate if short bursts of music are either rock or classical (music variable) under either silent or noisy conditions (noise variable). In such a situation, participants would be exposed randomly to the various combinations of the independent variables. The dependent variable could be speed of recognition of the music.
Recall the major steps in inverted index construction:
Collect the documents to be indexed.
Tokenize the text.
Do linguistic preprocessing of tokens.
Index the documents that each term occurs in.
In this chapter, we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms that a system uses (Section 2.2). Tokenization is the process of chopping character streams into tokens; linguistic preprocessing then deals with building equivalence classes of tokens, which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4. Then we return to the implementation of postings lists. In Section 2.3, we examine an extended postings list data structure that supports faster querying, and Section 2.4 covers building postings data structures suitable for handling phrase and proximity queries, of the sort that commonly appear in both extended Boolean models and on the web.
Document delineation and character sequence decoding
Obtaining the character sequence in a document
Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to convert this byte sequence into a linear sequence of characters.
The meaning of the term information retrieval (IR) can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, information retrieval might be defined thus:
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email. Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style searching (the sort that is going on when a clerk says to you: “I'm sorry, I can only look up your order if you can give me your order ID”).
Information retrieval can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data that does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records.
On page 113, we introduced the notion of a term-document matrix: an M × N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection. Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousands of rows and columns. In Section 18.1.1, we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2, we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. Although latent semantic indexing has not been established as a significant force in scoring and ranking for information retrieval (IR), it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section 16.6, page 343). Understanding its full potential remains an area of active research.
Readers who do not require a refresher on linear algebra may skip Section 18.1, although Example 18.1 is especially recommended as it highlights a property of eigenvalues that we exploit later in the chapter.
Linear algebra review
We briefly review some necessary background in linear algebra. Let C be an M × N matrix with real-valued entries; for a term–document matrix, all entries are in fact non-negative.
Thus far, this book has mainly discussed the process of ad hoc retrieval, where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips. One way of doing this is to issue the query multicore and computer and chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems support standing queries. A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time.
If your standing query is just multicore and computer and chip, you will tend to miss many relevant new articles which use other terms such as multicore processors. To achieve good recall, standing queries thus have to be refined over time and can gradually become quite complex. In this example, using a Boolean search engine with stemming, you might end up with a query like (multicore or multi-core) and (chip or processor or microprocessor).
To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem. Given a set of classes, we seek to determine which class(es) a given object belongs to.
In Chapters 1 and 2, we developed the ideas underlying inverted indexes for handling Boolean and proximity queries. Here, we develop techniques that are robust to typographical errors in the query, as well as alternative spellings. In Section 3.1, we develop data structures that help the search for terms in the vocabulary in an inverted index. In Section 3.2, we study the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks documents containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; for instance, the query automat* seeks documents containing any of the terms automatic, automation, and automated.
We then turn to other forms of imprecisely posed queries, focusing on spelling errors in Section 3.3. Users make spelling errors either by accident, or because the term they are searching for (e.g., Herman) has no unambiguous spelling in the collection. We detail a number of techniques for correcting spelling errors in queries, one term at a time as well as for an entire string of query terms. Finally, in Section 3.4 we study a method for seeking vocabulary terms that are phonetically close to the query term(s).
Web crawling is the process by which we gather pages from the Web to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. In Chapter 19, we studied the complexities of the Web stemming from its creation by millions of uncoordinated individuals. In this chapter, we study the resulting difficulties for crawling the Web. The focus of this chapter is the component shown in Figure 19.7 as web crawler; it is sometimes referred to as a spider.
The goal of this chapter is not to describe how to build the crawler for a full-scale commercial web search engine. We focus instead on a range of issues that are generic to crawling from the student project scale to substantial research projects. We begin (Section 20.1.1) by listing desiderata for web crawlers, and then discuss in Section 20.2 how each of these issues is addressed. The remainder of this chapter describes the architecture and some implementation details for a distributed web crawler that satisfies these features. Section 20.3 discusses distributing indexes across many machines for a web-scale implementation.
Features a crawler must provide
We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide.
Thus far, we have dealt with indexes that support Boolean queries: A document either matches or does not match a query. In the case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly sift through. Accordingly, it is essential for a search engine to rank-order the documents matching a query. To do this, the search engine computes, for each matching document, a score with respect to the query at hand. In this chapter, we initiate the study of assigning a score to a (query, document) pair. This chapter consists of three main ideas.
We introduce parametric and zone indexes in Section 6.1, which serve two purposes. First, they allow us to index and retrieve documents by metadata, such as the language in which a document is written. Second, they give us a simple means for scoring (and thereby ranking) documents in response to a query.
Next, in Section 6.2 we develop the idea of weighting the importance of a term in a document, based on the statistics of occurrence of the term.
In Section 6.3, we show that by viewing each document as a vector of such weights, we can compute a score between a query and each document. This view is known as vector space scoring.
Section 6.4 develops several variants of term-weighting for the vector space model. Chapter 7 develops computational aspects of vector space scoring and related topics.
Clustering algorithms group a set of documents into subsets or clusters. The algorithms' goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.
Clustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.
The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13, page 237): Our goal is to replicate a categorical distinction that a human supervisor imposes on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.
The key input to a clustering algorithm is the distance measure. In Figure 16.1, the distance measure is distance in the two-dimensional (2D) plane.
Improving classifier effectiveness has been an area of intensive machine-learning research over the last two decades, and this work has led to a new generation of state-of-the-art classifiers, such as support vector machines, boosted decision trees, regularized logistic regression, neural networks, and random forests. Many of these methods, including support vector machines (SVMs), the main topic of this chapter, have been applied with success to information retrieval problems, particularly text classification. An SVM is a kind of large-margin classifier: It is a vector-space–based machine-learning method where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise).
We will initially motivate and develop SVMs for the case of two-class data sets that are separable by a linear classifier (Section 15.1), and then extend the model in Section 15.2 to nonseparable data, multiclass problems, and nonlinear models, and also present some additional discussion of SVM performance. The chapter then moves to consider the practical deployment of text classifiers in Section 15.3: What sorts of classifiers are appropriate when, and how can you exploit domain-specific text features in classification? Finally, we will consider how the machine-learning technology that we have been building for text classification can be applied back to the problem of learning how to rank documents in ad hoc retrieval (Section 15.4).