To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In order to learn to work with R, you have to learn to speak its language, the S language, developed originally at Bell Laboratories (Becker et al., 1988). The grammar of this programming language is beautiful and easy to learn. It is important to master its basics, as this grammar is designed to guide you towards the appropriate way of thinking about your data and how you might want to carry out your analysis.
When you begin to use R on an Apple Macintosh or a Windows PC, you will start R either through a menu guiding you to applications, or by clicking on R's icon. As a result, a graphical user interface is started up, with as its central part a window with a prompt (>), the place where you type your commands. On unix or linux systems, the same window is obtained by opening a terminal and typing R at its prompt.
The sequence of commands in a given R session and the objects created are stored in files named .Rhistory and .RData when you quit R and respond positively to the question of whether you want to save your workspace. If you do so, then your results will be available to you the next time you start up R. If you are using a graphical user interface, this .RData file will be located by default in the folder where R has been installed.
Chapter 1 introduced the data frame as the data structure for storing vectors of numbers as well as factors. Numerical vectors and factors represent in R what statisticians call random variables. A random variable is the outcome of an experiment. Here are some examples of experiments and their associated random variables:
tossing a coin Tossing a coin will result in either “head” or “tail.” Hence, the toss of a coin is a random variable with two outcomes.
throwing a dice In this case, we are dealing with a random variable with six possible outcomes, 1, 2, …, 6.
counting words We can count the frequencies with which words occur in a given corpus or text. Word frequency is a random variable with, as possible values, 1,2,3, …, N, with N the size of the corpus.
familiarity rating Participants are asked to indicate on a seven-point scale how frequently they think words are used. The ratings elicited for a given word will vary from participant to participant, and constitute a random variable.
lexical decision Participants are asked to indicate, by means of button presses, whether a word presented visually or auditorily is an existing word of the language. There are two outcomes, and hence two random variables, for this type of experiment: the accuracy of a response (with levels “correct” and “incorrect”) and the latency of the response (in milliseconds).
Sections 4.3 and 4.4 introduced the basics of linear regression and analysis of covariance. This chapter begins with a recapitulation of the central concepts and ideas introduced in Chapter 4. It then broadens the horizon on linear regression in several ways. Section 6.2 discusses multiple linear regression and various analytical strategies for dealing with multiple predictors simultaneously. Section 6.3 introduces the generalized linear model, which extends the linear modeling approach to binary dependent variables (successes versus failures, correct versus incorrect responses, np or pp realizations of the dative, etc.) and factors with ordered levels (e.g. low, mid, and high education level). (The varbrul program used widely in sociolinguistics implements the general linear model for binary variables.) Finally, section 6.4 outlines a method for dealing with breakpoints, and section 6.5 discusses the special care required for dealing with word frequency distributions.
Introduction
Consider again the ratings data set that we studied in Chapter 4. We are interested in whether the rated size (averaged over subjects) of the referents of 81 English nouns can be predicted from the subjective estimates of these words' familiarity and from the class of their referents (plant versus animal). We begin by fitting a model of covariance with meanFamiliarity as nonlinear numeric predictor and Class as factorial predictor.
This book provides an introduction to the statistical analysis of quantitative data for researchers studying aspects of language and language processing. The statistical analysis of quantitative data is often seen as an onerous task that we would rather leave to others. Statistical packages tend to be used as a kind of oracle, from which you elicit a verdict as to whether you have one or more significant effects in your data. In order to elicit a response from the oracle, you have to click your way through cascades of menus. After a magic button press, voluminous output tends to be produced that hides the p-values, the ultimate goal of the statistical pilgrimage, among lots of other numbers that are completely meaningless to the user, as befits a true oracle.
The approach to data analysis to which this book provides a guide is fundamentally different in several ways. First of all, we will make use of a radically different tool for doing statistics, the interactive programming environment known as R. R is an open source implementation of the (object-oriented) S language for statistical analysis originally developed at Bell Laboratories. It is the platform par excellence for research and development in computational statistics. It can be downloaded from the Comprehensive R Archive Network (cran) at http://cran.r-project.org or one of the many mirror sites. Learning to work with R is in many ways similar to learning a new language.
Many statistical tests exploit the properties of the probability distributions of random variables. This chapter provides an introduction to some of the most important probability distributions, and lays the groundwork for the statistical tests introduced in Chapter 4.
Distributions
When we count how often a word is used, or when we measure the duration of a vowel, we carry out a statistical experiment. The outcome of such a statistical experiment varies each time it is carried out. For instance, the frequency of a word (the outcome of a counting experiment) will vary from text to text and from corpus to corpus, and similarly the length of a given vowel (the outcome of a measuring experiment) will vary from syllable to syllable and from word to word. For a given random variable, some outcomes may be more likely than others. The probability distribution of a random variable specifies the likelihood of the different outcomes. Random variables fall into two important categories. Random variables such as frequency counts are discrete (with values that are integers), random variables such as durational measurements are continuous (with values that are reals). We begin by introducing two discrete distributions.
Discrete distributions
The celex lexical database (Baayen et al., 1995) lists the frequencies of a large number of English words in a corpus of 18.6 million words. Table 3.1 provides these frequencies for four words, the high-frequency definite article the, the medium-frequency word president, and two low-frequency words, hare and harpsichord.
The previous chapter introduced various techniques for analyzing data with one or two vectors. The remaining chapters of this book discuss various ways of dealing with data sets with more than two vectors. Data sets with many vectors are typically brought together in matrices. These matrices list the observations on the rows, with the vectors (column variables) specifying the different properties of the observations. Data sets like this are referred to as multivariate data.
There are two approaches for discovering the structure in multivariate data sets that we discuss in this chapter. In one approach, we seek to find structure in the data in terms of groupings of observations. These techniques are unsupervised in the sense that we do not prescribe what groupings should be there. We discuss these techniques under the heading of clustering. In the other approach, we know what groups there are in theory, and the question is whether the data support these groups. This second group of techniques can be described as supervised, because the techniques work with a grouping that is imposed by the analyst on the data. We will refer to these techniques as methods for classication.
Clustering
Tables with measurements: principal components analysis
Words such as goodness and sharpness can be analyzed as consisting of a stem, good, sharp, and an affix, the suffix -ness. Some affixes are used in many words, -ness is an example.
Consider a study addressing the consequences of adding white noise to the comprehension of words presented auditorily over headphones to a group of subjects, using auditory lexical decision latencies as a measure of speed of lexical access. In such a study, the presence or absence of white noise would be the treatment factor, with two levels (noise versus no noise). In addition, we would need identifiers for the individual words (items), and identifiers for the individual participants (or subjects) in the experiment. The item and subject factors, however, differ from the treatment factor in that we would normally only regard the treatment factor as repeatable.
A factor is repeatable, if the set of possible levels for that factor is fixed, and if, moreover, each of these levels can be repeated. In our example, the treatment factor is repeatable, because we can take any new acoustic signal and either add or not add a fixed amount of white noise. We would not normally regard the identifiers of items or subjects as repeatable. Items and subjects are sampled randomly from populations of words and participants, and replicating the experiment would involve selecting other words and other participants. For these new units, we would need new identifiers. In other words, we would be introducing new levels of these subject and item factors in the experiment that had not been seen previously.
The logic underlying the statistical tests described in this book is simple. A statistical test produces a test statistic of which the distribution is known. What we want to know is whether the test statistic has a value that is extreme, so extreme that it is unlikely to be attributable to chance. In the traditional terminology, we pit a null-hypothesis, actually a straw man, that the test statistic does not have an extreme value, against an alternative hypothesis according to which its value is indeed extreme. Whether a test statistic has an extreme value is evaluated by calculating how far out it is in one of the tails of the distribution. Functions like pt (), pf (), and pchisq () tell us how far out we are in a tail by means of p-values, which assess what proportion of the population has even more extreme values. The smaller this proportion is, the more reason we have for surprise that our test statistic is as extreme as it actually is.
However, the fuzzy notion of what counts as extreme needs to be made more precise. It is generally assumed that a probability begins to count as extreme by the time it drops below 0.05. However, opinions differ with respect to how significance should be assessed.
One tradition holds that the researcher should begin by defining what counts as extreme, before gathering and analyzing data.
Portions of this chapter are taken from Introduction to Chemical Engineering Analysis by Russell and Denn (1972) and are used with permission.
In Chapter 2, a constitutive equation for reaction rate was introduced, and the experimental means of verifying it was discussed for some simple systems. The use of the verified reaction-rate expression in some introductory design problems was illustrated in Chapter 2. Chapter 3 expanded on the analysis of reactors presented in Chapter 2 by dealing with heat exchangers and showing how the analysis is carried out for systems with two control volumes. A constitutive rate expression for heat transfer was presented, and experiments to verify it were discussed.
This chapter considers the analysis of mass contactors, devices in which there are at least two phases and in which some species are transferred between the phases. The analysis will produce a set of equations for two control volumes just as it did for heat exchangers. The rate expression for mass transfer is similar to that for heat transfer; both have a term to account for the area between the two control volumes. In heat exchangers this area is determined by the geometry of the exchanger and is readily obtained. In a mass contactor this area is determined by multiphase fluid mechanics, and its estimation requires more effort. In mass contactors in which transfer occurs across a membrane the nominal area determination is readily done just as for heat exchangers, but the actual area for transfer may be less well defined.
Figure 1.2 presents the logic leading to technically feasible analysis and design. In this chapter we illustrate the design process that follows from the analysis of existing equipment, experiment, and the development of model equations capable of predicting equipment performance. Design requests can come in the form of memos, but an ongoing dialogue between those requesting a design and those carrying out the design helps to properly define the problem. This is difficult to illustrate in a textbook but we will try to give some sense of the process in the case studies presented here.
Technically feasible heat exchanger and mass contactor design procedures were outlined in Sections 3.5 and 4.5. In this chapter we present case studies to illustrate how one can proceed to a technically feasible design. Recall that such a design must satisfy only the design criteria, i.e., the volume of a reactor that will produce the required amount of product, the heat exchanger configuration that will meet the heat load needed with the utilities available, or the mass contactor that will transfer the required amount of material from one phase to another given the flow rate of the material to be processed. Even for relatively simple situations, design is always an iterative process and requires one to make decisions that cannot be verified until more information is available and additional calculations are made.
The coefficients of heat and mass transfer rate expressions depend on any fluid flows in the system. Our personal experience with “wind-chill” factors on chilly winter days and in dissolving sugar or instant coffee in hot liquids by stirring suggests that the rate of heat and mass transfer can be greatly increased with increasing wind speed or mixing rates. The technically feasible design of heat and mass transfer equipment requires calculating the transport coefficients and their variation with the fluid flows in the device, which depend intimately on the design of the device. For example, the area for heat transfer calculated for a tubular–tubular heat exchanger can be achieved by an infinite combination of pipe diameters, lengths, and for shell-and-tube exchanges, the number of tubes. However, selecting a pipe diameter for a given volumetric flow rate sets the fluid velocity in the pipe and the type of flow (i.e., laminar versus turbulent), which sets the overall heat transfer coefficient. This is why the design of heat and mass transfer equipment is often an iterative process. This chapter presents methods for estimating transport coefficients in systems with fluid motion.
The central hypothesis for flowing systems is that the friction, resistance to heat transfer, and resistance to mass transfer are predominately located in a thin boundary layer at the interface between the bulk flowing fluid and either another fluid (liquid or gas) or a solid surface.
In Chapter 3 we presented model equations for heat exchangers with our mixed–mixed, mixed–plug, and plug–plug classifications. All these fluid motions generally require some degree of turbulence, and all heat exchangers, except for those for which there is direct contact between phases, require a solid surface dividing the two control volumes of the exchanger. To predict the overall heat transfer coefficient, denoted as U in the analyses in Part I, we must be able to determine how U is affected by the turbulent eddies in the fluids and the physical properties of the fluids and how the rate of heat transfer depends on the conduction of heat through the solid surface of the exchanger.
We begin our study of conductive transport by considering the transfer of heat in a uniform solid such as that employed as the boundary between the two control volumes of any exchanger. This requires a Level III analysis and verification of a constitutive equation for conduction. This is followed by a complementary analysis of molecular diffusion through solids and stagnant fluids.
Experimental Determination of Thermal Conductivity k and Verification of Fourier's Constitutive Equation
Consider an experiment whereby the heat flow through the wall between the tank and the jacket in Figure 3.7 is measured. For the purposes of this analysis, we consider the heat transfer to be essentially one dimensional in the y direction, with the barrier essentially infinite in the z–x plane.