Content Listing

Summary

EXPERIMENTAL AND STATISTICAL CONTROL

The designs we have discussed to this point represent techniques of experimental control in order to reduce error variance in a research study (Kirk, 1995). It is also possible to reduce error variance by employing statistical control using analysis of covariance (ANCOVA). Statistical control is used when we know how participants stand (we know their score) on an additional variable that was not part of or could not readily be incorporated into the experimental design. This additional variable is brought into the data analysis by treating it as a covariate. Although covariates can be categorical or quantitative, we will limit our discussion and data analysis procedures to quantitatively measured covariates.

The covariate represents a potential source of variance that has not been experimentally controlled but could covary with the dependent variable and could therefore confound our interpretation of the results. By identifying that variable as a covariate and collecting measures of it on the study participants, it is possible to “remove” or “neutralize” its effect on the dependent variable (by statistical procedures described in this chapter) prior to analyzing the effects of the independent variable on the dependent variable.

A SIMPLE ILLUSTRATION OF COVARIANCE

Consider a test of math skills that we might administer to school children in a particular grade. We wish to compare girls and boys in their math problem solving skills.

Summary

THE CONCEPT OF WITHIN-SUBJECTS VARIANCE

In between-subjects designs, participants contribute a single value of the dependent variable, and different participants are represented under each level of the independent variable. The situation is different for within-subjects designs. Here, we register a value of the dependent variable in each and every condition of the study for every one of the participants or cases in the study. That is, each participant is measured under each level of the independent variable.

Within-subjects designs are still univariate studies in that there is only one dependent variable in the design. As is true for between-subjects designs, the dependent variable is operationalized in a particular manner. What differentiates between-subjects and within-subjects designs is the number of such measurements in the data file that are associated with each participant: In between-subjects designs, researchers record just one data point (measurement) for each participant; in within-subjects designs, researchers record as many measurements for each participant as there are conditions in the study.

Within-subjects designs can theoretically contain any number of independent variables combined in a factorial manner. Practically, because all participants must contribute data under each level of each independent variable, the logistics of the data collection procedures can become quite challenging with several independent variables in the design. Also, there is the possibility that subjects could be affected or changed by exposure to one level of an independent variable, a situation known as carry-over effects, such that it may not be appropriate to expose them to more than that single level; the likelihood of eliminating all of the possible carry-over effects diminishes with each additional independent variable included in the design (we will discuss carry-over effects more in Section 10.4).

Summary

TO THE INSTRUCTOR

The present text is an exploration of univariate methodology where the effects of one or more independent variables are assessed on a single dependent variable. Such univariate designs are ubiquitous in the social, behavioral, and biological science literature. We have chosen, in this book, to focus our efforts on analysis of variance (ANOVA). Issues concerning multivariate methodology, including multiple regression analysis, are not covered in the present text as a result of space limitations, but they are addressed in a companion text (see Meyers, Gamst, & Guarino, 2006).

This book owes both a conceptual and computational debt to early ANOVA pioneers, beginning with the seminal work of Fisher (1925, 1935), who focused on solving agricultural problems with experimental methods. Fisher's early work was adapted to other fields, including the social and behavioral sciences, and in doing so moved from childhood to early adolescence with the work of Baxter (1940, 1941), Crutchfield (1938), Garrett and Zubin (1943), Lindquist (1940), Snedecor (1934), and Yates (1937). By the 1950s, ANOVA procedures were well established within most social and behavioral sciences (e.g., Cochran & Cox, 1957; Lindquist, 1953; Scheffé, 1959).

Beginning in the early 1960s, ANOVA procedures were further delineated and popularized by Winer (1962, 1971) and Winer, Brown, and Michels (1991). These works, while sometimes challenging to read, were considered the “gold standard” by many ANOVA practitioners.

Summary

PARTITIONING OF THE VARIANCE

The term analysis of variance is very descriptive of the process we use in the statistical treatment of the data. In a general sense, to analyze something is to examine the individual elements of a whole. In ANOVA, we start with a totality and break it apart – partition it – into portions that we then examine separately.

The totality that we break apart in ANOVA, as you might suspect from the name of the procedure, is the variance of the scores on the dependent variable. As you recall from Chapter 2, variance is equal to the sum of squares – the sum of the squared deviations from the mean – divided by the degrees of freedom. In an ANOVA design, it is the variance of the total set of scores that is being partitioned. The various ANOVA designs are different precisely because they allow this total variance to be partitioned in different ways.

ANOVA is a statistical procedure that allows us to partition (divide) the total variance measured in the study into its sources or component parts. This total measured variance is the variance of the scores that we have obtained when participants were measured on the dependent variable. Therefore, whenever we talk about the variance associated with a given partition or whenever we talk about the total variance, we are always referring to the variance of the dependent variable.

Summary

COMBINING TWO WITHIN-SUBJECTS FACTORS

A two-way within-subjects design is one in which the levels of two within-subjects variables are combined factorially. Recall from Chapter 11 that within-subjects independent variables can be either time related or not. There are three possible combinations of types of within-subjects variables that can be represented in a two-way design: (a) both can be time independent, (b) one can be time independent and the other time based, and (c) both can be time based. This latter combination is not often found in the research literature; the other two are more common, and we provide illustrations of them here.

Both variables time independent: Students can be asked to indicate if short bursts of music are either rock or classical (music variable) under either silent or noisy conditions (noise variable). In such a situation, participants would be exposed randomly to the various combinations of the independent variables. The dependent variable could be speed of recognition of the music.
[…]

Summary

Recall the major steps in inverted index construction:

Summary

INFORMATION RETRIEVAL

The meaning of the term information retrieval (IR) can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, information retrieval might be defined thus:

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email. Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style searching (the sort that is going on when a clerk says to you: “I'm sorry, I can only look up your order if you can give me your order ID”).

Information retrieval can also cover other kinds of data and information problems beyond that specified in the core definition above. The term “unstructured data” refers to data that does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records.

Summary

On page 113, we introduced the notion of a term-document matrix: an M × N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection. Even for a collection of modest size, the term-document matrix C is likely to have several tens of thousands of rows and columns. In Section 18.1.1, we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2, we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. Although latent semantic indexing has not been established as a significant force in scoring and ranking for information retrieval (IR), it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section 16.6, page 343). Understanding its full potential remains an area of active research.

Readers who do not require a refresher on linear algebra may skip Section 18.1, although Example 18.1 is especially recommended as it highlights a property of eigenvalues that we exploit later in the chapter.

Linear algebra review

We briefly review some necessary background in linear algebra. Let C be an M × N matrix with real-valued entries; for a term–document matrix, all entries are in fact non-negative.

Summary

Thus far, this book has mainly discussed the process of ad hoc retrieval, where users have transient information needs that they try to address by posing one or more queries to a search engine. However, many users have ongoing information needs. For example, you might need to track developments in multicore computer chips. One way of doing this is to issue the query multicore and computer and chip against an index of recent newswire articles each morning. In this and the following two chapters we examine the question: How can this repetitive task be automated? To this end, many systems support standing queries. A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time.

If your standing query is just multicore and computer and chip, you will tend to miss many relevant new articles which use other terms such as multicore processors. To achieve good recall, standing queries thus have to be refined over time and can gradually become quite complex. In this example, using a Boolean search engine with stemming, you might end up with a query like (multicore or multi-core) and (chip or processor or microprocessor).

To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem. Given a set of classes, we seek to determine which class(es) a given object belongs to.

Summary

WILDCARD QUERY

In Chapters 1 and 2, we developed the ideas underlying inverted indexes for handling Boolean and proximity queries. Here, we develop techniques that are robust to typographical errors in the query, as well as alternative spellings. In Section 3.1, we develop data structures that help the search for terms in the vocabulary in an inverted index. In Section 3.2, we study the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks documents containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; for instance, the query automat* seeks documents containing any of the terms automatic, automation, and automated.

We then turn to other forms of imprecisely posed queries, focusing on spelling errors in Section 3.3. Users make spelling errors either by accident, or because the term they are searching for (e.g., Herman) has no unambiguous spelling in the collection. We detail a number of techniques for correcting spelling errors in queries, one term at a time as well as for an entire string of query terms. Finally, in Section 3.4 we study a method for seeking vocabulary terms that are phonetically close to the query term(s).

Summary

Overview

Web crawling is the process by which we gather pages from the Web to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. In Chapter 19, we studied the complexities of the Web stemming from its creation by millions of uncoordinated individuals. In this chapter, we study the resulting difficulties for crawling the Web. The focus of this chapter is the component shown in Figure 19.7 as web crawler; it is sometimes referred to as a spider.

The goal of this chapter is not to describe how to build the crawler for a full-scale commercial web search engine. We focus instead on a range of issues that are generic to crawling from the student project scale to substantial research projects. We begin (Section 20.1.1) by listing desiderata for web crawlers, and then discuss in Section 20.2 how each of these issues is addressed. The remainder of this chapter describes the architecture and some implementation details for a distributed web crawler that satisfies these features. Section 20.3 discusses distributing indexes across many machines for a web-scale implementation.

Features a crawler must provide

We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide.

Summary

Thus far, we have dealt with indexes that support Boolean queries: A document either matches or does not match a query. In the case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly sift through. Accordingly, it is essential for a search engine to rank-order the documents matching a query. To do this, the search engine computes, for each matching document, a score with respect to the query at hand. In this chapter, we initiate the study of assigning a score to a (query, document) pair. This chapter consists of three main ideas.

Summary

Clustering algorithms group a set of documents into subsets or clusters. The algorithms' goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.

Clustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.

The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13, page 237): Our goal is to replicate a categorical distinction that a human supervisor imposes on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.

The key input to a clustering algorithm is the distance measure. In Figure 16.1, the distance measure is distance in the two-dimensional (2D) plane.

Summary

Improving classifier effectiveness has been an area of intensive machine-learning research over the last two decades, and this work has led to a new generation of state-of-the-art classifiers, such as support vector machines, boosted decision trees, regularized logistic regression, neural networks, and random forests. Many of these methods, including support vector machines (SVMs), the main topic of this chapter, have been applied with success to information retrieval problems, particularly text classification. An SVM is a kind of large-margin classifier: It is a vector-space–based machine-learning method where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise).

We will initially motivate and develop SVMs for the case of two-class data sets that are separable by a linear classifier (Section 15.1), and then extend the model in Section 15.2 to nonseparable data, multiclass problems, and nonlinear models, and also present some additional discussion of SVM performance. The chapter then moves to consider the practical deployment of text classifiers in Section 15.3: What sorts of classifiers are appropriate when, and how can you exploit domain-specific text features in classification? Finally, we will consider how the machine-learning technology that we have been building for text classification can be applied back to the problem of learning how to rank documents in ad hoc retrieval (Section 15.4).

Textbooks

Refine search

Refine search

Actions for selected content:

36886 results in Cambridge Textbooks

APPENDIXES

16 - ANALYSIS OF COVARIANCE

Summary

APPENDIX D - DEVIATIONAL FORMULA FOR SUMS OF SQUARES

10 - ONE-WAY WITHIN-SUBJECTS DESIGN

Summary

Preface

Summary

SECTION 5 - MIXED DESIGNS

3 - ELEMENTS OF ANOVA

Summary

11 - TWO-WAY WITHIN-SUBJECTS DESIGN

Summary

APPENDIX C - TABLE OF CRITICAL F VALUES

2 - The term vocabulary and postings lists

Summary

1 - Boolean retrieval

Summary

Table of Notation

18 - Matrix decompositions and latent semantic indexing

Summary

13 - Text classification and Naive Bayes

Summary

Contents

3 - Dictionaries and tolerant retrieval

Summary

20 - Web crawling and indexes

Summary

6 - Scoring, term weighting, and the vector space model

Summary

16 - Flat clustering

Summary

15 - Support vector machines and machine learning on documents

Summary

Textbooks

Refine search

Refine search

Actions for selected content:

Save Search

36886 results in Cambridge Textbooks

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary