Skip to main content Accessibility help
×
Hostname: page-component-68c7f8b79f-qcl88 Total loading time: 0 Render date: 2025-12-21T00:11:32.847Z Has data issue: false hasContentIssue false

Chapter 1 - Introduction

Published online by Cambridge University Press:  16 September 2025

Wolfgang Wiedermann
Affiliation:
University of Missouri, Columbia
Alexander von Eye
Affiliation:
Michigan State University

Summary

Chapter 1 starts by embedding the methods presented in this monograph into the rich landscape of statistical methods for causality research. Specifically, it starts with contrasting methods of causal inference and methods of causal structure learning (also known as causal discovery). While the former class of statistical methods can be considered well established across the developmental, psychological, and social sciences, the latter class only recently received attention. The methods of direction of dependence presented here can be characterized as a confirmatory approach to probe hypothesized causal structures of variable relations. To introduce the reader to the line of thinking that is involved when using methods of direction of dependence, prototypical research questions are presented that can be answered with the presented statistical tools and application areas that can benefit from taking a direction of dependence perspective in the analysis of research data are outlined. The methods of direction of dependence rely on higher moments of variables to discern causal structures from observational data. Thus, the chapter closes with an introductory discussion of moments of variables.

Information

Type
Chapter
Information
Direction Dependence Analysis
Foundations and Statistical Methods
, pp. 1 - 27
Publisher: Cambridge University Press
Print publication year: 2025

Chapter 1 Introduction

The first chapter of this book starts by embedding the methods presented in this monograph into the rich landscape of statistical methods for causality research. Specifically, it starts with contrasting methods of causal inference and methods of causal structure learning (also known as causal discovery). While the former class of statistical methods can be considered as well established across the developmental, psychological, and social sciences, the latter class has received attention only recently. The methods of direction of dependence presented here can be characterized as a confirmatory approach to probe hypothesized causal structures of variable relations. To introduce the reader to the line of thinking that is involved when using methods of direction of dependence, we present prototypical research questions that can be answered with the presented statistical tools and outline application areas that can benefit from taking a direction of dependence perspective in the analysis of research data. The methods of direction of dependence rely on higher moments of variables to discern causal structures of observational data. Thus, the chapter closes with an introductory discussion of moments of variables.

***

Elucidating cause–effect relations of constructs of interest lies at the heart of empirical research across the developmental, psychological, educational, and social sciences. Causal models offer a mathematical representation of the cause–effect relations within a system of variables. Any statistical inquiry of cause–effect relations, however, rests on assumptions about the population and about the data-generating mechanism under study. In other words, assumption-free causality research does not exist, and the accuracy and quality of elucidated causal claims depend on how well empirical data agree with the assumptions made about them. The importance of causal assumptions (such as proper specification of the causal model and proper adjustment of confounding influences) is even amplified when studying observational (nonexperimental) data. The statistical framework of Direction Dependence Analysis (DDA) presented in this book is designed to critically evaluate these causal assumptions when researchers rely on observational data.

1.1 Causal Inference and Causal Discovery

The methodological landscape of statistical methods of causality research is manifold and diverse, and can, roughly, be classified into statistical methods for causal inference and methods for the purpose of causal discovery (Nogueira et al., Reference Nogueira, Pugnana, Ruggieri, Pedreschi and Gama2022; Wang et al., Reference Wang, Huang, Wang, Liao, Li and Liu2024; Zanga et al., Reference Zanga, Ozkirimli and Stella2022). The former class – methods for causal inference – typically deals with questions concerning the magnitude of causal effects, that is, the quantification of changes in an outcome variable that are induced by changes in the variable system. Here, in particular, developments in causal graph theory (Greenland & Robins, Reference Greenland and Robins1986; Greenland et al., Reference Greenland, Pearl and Robins1999; Pearl, Reference Pearl2009) have uncovered the methodological link between subject-matter theory and sound formal foundations of causality. Due to these developments, scholars are now equipped with tools to evaluate if and under what conditions causal effects can be identified, provided that the underlying causal structure of variables is a priori known. In addition, several statistical methods for causal inference such as propensity score methods (Rosenbaum & Rubin, Reference Rosenbaum and Rubin1983), instrumental variable approaches (Angrist et al., Reference Angrist, Imbens and Rubin1996; Wright, Reference Wright1928), and regression discontinuity designs (Thistlethwaite & Campbell, Reference Thistlethwaite and Campbell1960), have become de rigueur in causality research and each method can, under certain assumptions, provide a solution to the issue of estimating causal effects in observational data. These methods, however, are not the main focus of this book – excellent introductions into these methods are given by, for example, Morgan and Winship (Reference Morgan and Winship2015), Imbens and Rubin (Reference Imbens and Rubin2015), or Cunningham (Reference Cunningham2021), to name a few.

Instead, the present monograph focuses on the latter class of methods – statistical approaches for causal discovery. These methods have been proposed for data situations in which a priori knowledge concerning the underlying causal structure of variables is limited and, thus, needs to be learned from the data. That is, instead of quantifying the magnitude of causal effects, the focus lies on discerning the direction of causation of variable relations. Methods of causal discovery have experienced rapid development, in particular, in the area of causal machine learning. Early approaches of causal learning include the IC algorithm (Verma & Pearl, Reference Verma and Pearl1990) and the PC algorithm (Spirtes et al., Reference Spirtes, Glymour and Scheines2000) which focus on conditional independence structures of variables to discern Markov-equivalent causal structures (i.e., a set of plausible causal structures that have the same support by the data). To identify causal structures beyond Markov-equivalent classes, structural causal models have been suggested that, for example, focus on higher-order moments of variables (Cai et al., Reference Cai, Xie, Chen and Hao2017, Reference Cai, Ye, Qiao, Fu and Hao2020; Dodge & Rousson, Reference Dodge and Rousson2000, Reference Dodge and Rousson2001; Hoyer et al., Reference Hoyer, Shimizu, Kerminen and Palviainen2008; Hyvärinen & Smith, Reference Hyvärinen and Smith2013; Shimizu & Kano, Reference Shimizu and Kano2008; Shimizu et al., Reference Shimizu, Hoyer, Hyvärinen and Kerminen2006; Wiedermann & Li, Reference Wiedermann and Li2018; Wiedermann & von Eye, Reference Wiedermann and von Eye2015c) and higher-order independence properties of competing causal models (Hoyer et al., Reference Hoyer, Janzing, Mooij, Peters and Schölkopf2009; Peters et al., Reference Peters, Mooij, Janzing and Schölkopf2014; Shimizu et al., Reference Shimizu, Inazumi, Sogawa, Hyvärinen, Kawahara, Washio, Hoyer and Bollen2011; Wiedermann & Li, Reference Wiedermann and Li2020; Wiedermann et al., Reference Wiedermann, Artner and von Eye2017). For excellent overviews of methods of causal machine learning see Peters et al. (Reference Peters, Janzing and Schölkopf2017), Guyon et al. (Reference Guyon, Statnikov and Batu2019), or Shimizu (Reference Shimizu2022).

Methods of direction of dependence can be characterized as a subfield of causal machine learning. Instead of learning an entire network of variables, methods of direction of dependence are designed to integrate statistical principles of causal machine learning in the confirmatory process of testing and refining theories about the variables under study. In other words, while causal discovery algorithms are well-suited to discern causal structures in an exploratory multivariable research setting (where all considered variables are potentially eligible as either causes or effects), DDA (the focus of the present book) addresses an equally prevalent confirmatory bivariate research setting, that is, situations in which one is interested in validating a causal target model that explains the relation between two variables x and y against plausible alternatives, while potentially adjusting for covariates.

Methods of direction of dependence and methods for causal machine learning share that both rely on structural causal models (Pearl, Reference Pearl2009; Peters et al., Reference Peters, Janzing and Schölkopf2017). In its generalized form, for p observed variables (xi, i = 1, …, p), the structural causal model can be written as

xi=fpaiei,(1.1)

with pai being a set of variables that are the parents of xi (i.e., xi is caused by the set of variables in pai) and ei denoting independent error components that are assumed to be jointly independent. For linear variable relations (the main focus of the methods presented in this book; nonlinear data situations are, however, discussed in Chapter 9), the model takes the form (cf. Shimizu, Reference Shimizu2022)

xi=jpaiβijxj+ei.(1.2)

In words, each xi is assumed to be generated by the linear sum of their parent variables and their additive error term. When βij 0, xj is said to have a direct causal effect on xi. Again, the error variables ei (i = 1, …, p) are assumed to be jointly independent.

The model presented in Eq. (1.2) describes a multivariate causal network. In the confirmatory bivariate research setting (the perspective taken throughout this book), we are considering the situation in which one is interested in probing the causal link between two focal variables (x and y) while adjusting for a separate set of covariates (zj) which is assumed to be exogenous (covariate adjustment will be discussed in detail in Chapter 7). That is, the methods presented here are designed to address the situation of competing causal theories where the causal model xy (in words, x causes y) is defensible under one theory and the competing model yx (y causes x) is defensible under an alternative theory (while adjusting for potential covariates zj). The two competing causal models of interest can be expressed as

y=βyxx+jβyzjzj+ey,(1.3)

and

x=βxyy+jβxzjzj+ex,(1.4)

where the model in Eq. (1.3) describes the causal target model and Eq. (1.4) represents the causal alternative model. In other words, instead of learning a multivariate causal network, one wishes to select which of the two candidate models better approximates the underlying mechanism that generated the xy relation. The presented DDA framework is intended to provide an empirical basis for causal model selection. The following sections present examples of research questions that can be addressed with methods of direction of dependence.

1.2 Research Questions that Can Be Answered with Direction Dependence Analysis

The present section discusses prototypical questions that can be answered with the framework of DDA. We do not claim that the presented list of sample questions is exhaustive. Various additional data scenarios may exist where methods of direction of dependence can provide valuable insights into the data-generating mechanisms at play. Instead, the presented sample questions have been selected to achieve two goals: First, introducing the reader into the style of thinking that is involved when using DDA and, second, providing the reader with a preview of the subsequent chapters of this book.

1.2.1 Distinguishing Between Cause and Effect

Does money make people happy, or do happier people make more money (Diener & Biswas-Diener, Reference Diener and Biswas-Diener2002)? Do violent video games cause consumers to be aggressive, or are people with aggressive tendencies more attracted to violent video games (Gentile et al., Reference Gentile, Lynch, Linder and Walsh2004)? Does social media use reduce the well-being of users, or are unhappy people more likely to engage in social media (Kross et al., Reference Kross, Verduyn, Demiralp, Park, Lee, Lin, Shablack, Jonides and Ybarra2013)? Does counterproductive work behavior lead to sleep disturbances or do people with poor sleep quality tend to be counterproductive in their work behavior (Shi et al., Reference Shi, Fairchild and Wiedermann2023)? Does smoking contribute to the development of depression, or are people with depressive symptoms more likely to engage in health-damaging behavior (Munafò & Araya, Reference Munafò and Araya2010)?

These are typical questions raised by researchers in the social sciences when relying on observational data to test their hypotheses (e.g., due to ethical or financial constraints to perform experiments). For each question, it is, in principle, conceivable that at least two causal explanations exist for a variable association. In other words, in each case the hypothesized causal mechanism is, in theory, reversible. That is, for two variables x and y, a causal model of the form xy (x causes y) or the causally competing model yx (y causes x) can be entertained to explain the link between x and y. Unfortunately, standard covariance-based methods to evaluate variable associations (such as correlation analysis and linear regression models) do not help one to distinguish causally competing models (Chapter 2 introduces correlation and regression in their symmetric forms).

In contrast, methods of DDA have specifically been developed for the task of analyzing reversible data scenarios. In contrast to standard methods of association, DDA makes use of higher than second moments (e.g., skewness, kurtosis, co-skewness, and co-kurtosis) to test causally competing models. Provided that certain data conditions are fulfilled, the framework of DDA enables one to probe which causal explanation (xy or yx) finds more support from the data. Three DDA components are available: (1) distributional properties of observed variables; (2) distributional properties of errors of competing models; and (3) independence properties of predictors and errors. Chapters 3, 4, and 5 introduce the three components in detail. Specifically, each chapter (a) presents test statistics compatible with principles of direction of dependence, (b) summarizes guidelines and decision rules for causal model selection, and (c) illustrates the application of DDA components using synthetic and real-world data examples.

1.2.2 Identifying the Presence of Hidden Confounders

When relying on observational (nonexperimental) data, confounders (i.e., common causes of x and y; sometimes called lurking or omitted variables) are likely to be present which affect the causal relation between focal variables. In the worst case, an observed association between two variables may entirely be attributable to the presence of a confounder. For example, common genetic factors may (in parts) be responsible for depressive symptomology and health-damaging behavior (Tsuang et al., Reference Tsuang, Francis, Minor, Thomas and Stone2012), or parenting style may affect both violent video game exposure and aggression (Cote et al., Reference Cote, Coles and Dal Cin2021). Furthermore, such common causes may not be available to the researchers.

In observational studies, hidden confounders are omnipresent. Here, the main issue is that causal effect estimates are biased whenever common factors are present that are not statistically accounted for. Depending on the sign of the confounding effects, the bias introduced by confounders can go in either direction, that is, the causal effect estimate of interest can be smaller or larger than the true population effect. Thus, including confounders is crucial to guarantee consistent causal effect estimation. In practice, since researchers almost never have access to all influential common factors, the question is whether the available set of covariates is appropriate to sufficiently de-confound the relation between a focal predictor and a focal outcome. The methods presented in this book can, under certain data conditions, be used to answer this question. Here, the key ingredient for confounder detection is that unconsidered common factors create systematic dependence between predictors and model errors. Provided that the variables under study deviate from the Gaussian distribution, non-independence between predictors and errors can be detected. Chapter 6 discusses the identification of confounders in detail.

The task of proper confounder control is further complicated by the fact that availability alone is not an appropriate criterion for covariate selection. Covariates can, for example, take the role of a collider variable (i.e., a variable that is caused by the predictor and outcome under study). Here, it is well-known that adjusting for a collider variable induces bias into the causal effect estimate of interest (Elwert & Winship, Reference Elwert and Winship2014). Distinguishing eligible (confounders) and non-eligible covariates (colliders) is, thus, a critical element in causal modeling. Chapter 7 discusses covariate selection in the context of direction of dependence, shows how one can empirically identify eligible covariates, and integrates the components of DDA into a cohesive framework.

1.2.3 Identifying Data Sectors of Predominant Causal Models

Consider the development of brain and behavior. This development can be described by processes in a dynamic network of components (see, e.g., Fischer & van Geert, Reference Fischer, van Geert, Molenaar, Lerner and Newell2014) which are subject to change over time. These changes can involve the strength as well as the causal orientation of the connection between components. For example, let x be a child’s skill of adding numbers and y the child’s abstract understanding of addition. Since a certain level of skill of addition is required to develop an abstract understanding of addition, in early developmental stages, x can be conceptualized as a causal precursor of y (or, in path notation, xy). Gaining an abstract understanding of what addition means, however, will lead to improved addition skills. In other words, in later stages, a causal connection of the form yx is established. The stage of development, thus, takes the role of a causal path modifying factor. Now, suppose one uses the child’s age as a proxy for the stage of development. Thus, conditioned on the age of the child, the causal path may either go from x to y or from y to x. The methods presented in Chapter 9 (i.e., extensions of DDA to conditional causal models) can help evaluating such structural changes through identifying data sectors for which xy (or yx) constitutes the predominant causal model.

Causal heterogeneity can come in many different facets. In addition to potential heterogeneity of the magnitude and the direction of causation, the influence of hidden common causes can also vary from subpopulation to subpopulation. For example, in studying the causal connection between a child’s skill in adding numbers and the child’s abstract understanding of addition, factors that affect both (e.g., skill acquisition in other areas of numerical cognition) can become more prominent in later stages. In addition, contextual factors (e.g., additional educational experiences that facilitate mathematics comprehension) can change over time. In other words, depending on the developmental stage of the child, a confounder may be more or less active. Such subtle differences in the causal structure are, however, important when one aims at identifying the most plausible explanation for an underlying data-generating mechanism – in particular, when information on confounding factors is not available to the researcher. Methods of conditional DDA, introduced in Chapter 9, can help to answer for whom and under what conditions one can expect hidden confounding to be most pronounced.

1.2.4 Identifying Mechanisms of Mediation

Statistical mediation analysis (see, e.g., Baron & Kenny, Reference Baron and Kenny1986; MacKinnon, Reference MacKinnon2008; Wiedermann & von Eye, Reference Wiedermann and von Eye2015a; Wright, Reference Wright1934) constitutes a valuable tool in testing causal pathways of variable relations. Here, in addition to a tentative predictor and a tentative outcome, a third variable (the mediator) is considered which is hypothesized to transmit the effect from the predictor to the outcome. In other words, one decomposes the effect of the predictor on the outcome into a direct and indirect effect component. To interpret direct and indirect effects as causal, a series of assumptions must be met: First, the mediational structure must be correctly specified. For three variables (the predictor, the mediator, and the outcome), however, various acyclic models can be specified and, from a purely statistical perspective, each model fits the data equally well (known as Markov equivalence). Second, confounders must be properly accounted for. In other words, even when the mediation model is correctly specified with respect to the direction of causation, unconfoundedness assumptions are required to endow direct and indirect effects with causal meaning. The methods presented in this book are designed to critically evaluate both absence of reverse causation and confounding biases. Chapter 9 outlines statistical principles and decision guidelines of direction of dependence in mediation models.

1.2.5 How Robust Are the Causal Claims?

The output of any mathematical model of empirical data depends to some degree on the characteristics of the input. Consequently, a critical examination of the stability and robustness of model results constitutes an indispensable element of any data analysis. In the context of direction of dependence, stability analysis can be used to address questions concerning the robustness of the discerned causal claims against sample composition. Furthermore, sensitivity analysis addresses questions concerning the robustness of DDA results against violated assumptions. Chapter 8 presents approaches to test the stability and sensitivity in the context of probing the causal direction of variable relations.

1.3 Basics of Distributions of Variables

The methods discussed in this monograph utilize variable information beyond second-order moments (means, variances, covariances) to elucidate statements about causal effect directionality and hidden confounding. In other words, to apply methods of direction of dependence, we relax the assumption that variables (and errors) follow a Gaussian (normal) distribution. This section, therefore, introduces various concepts to determine the distributional shape of variables. That is, it deals with moments of variables.

Starting with a description of the different types of moments, definitions and interpretations of the first four moments quantifying the location, scale, and shape of variable distributions are presented. The section closes with a discussion of higher than fourth moments as well as joint higher moments. Artificial and real-world data examples are used to illustrate these concepts in action.

1.3.1 Types of Moments of Variables

In general, moments describe the distribution of probability mass. This distribution is determined by parameters that describe the location, scale, and shape of a distribution. Specifically, suppose that x is a random variable with density function pxx. The k-th moment of x around a constant (nonrandom) value c can be written as

Exck=xckpxxdx,(1.5)

with E denoting the so-called expected value operator. Depending on the constant c, one can distinguish between raw moments and central moments. Raw moments describe the moments of a variable around the origin of zero (i.e., c = 0), whereas central moments describe moments around the arithmetic mean of the distribution, i.e., c=Ex=μx. Thus, the k-th central moment of a variable can be expressed as

mk=Exμxk=xμxkpxxdx.(1.6)

Central moments have the advantage that, due to subtracting the mean μx, statements about the probability distribution of x are mean (location)-invariant. Suppose, for example, one wants to compare the variability of daily temperature for two seasons, winter and summer. Because, in summer, the earth is tilted toward the sun and, in winter, it is tilted away from the sun, one expects (on average) higher temperatures in the summer and colder temperatures in the winter. To compare the fluctuation in temperature for the two seasons one must, therefore, account for the average temperature. Here, central moments serve this purpose.

In addition to central moments around the mean, one can define so-called standardized moments. The k-th standardized moment is given by

m¯k=mkσk=Exμxσxk,(1.7)

with mk=Exμxk and σk denoting the standard deviation of x (i.e., σx=Exμx2; for further details see Section 1.3.3) taken to the k-th power. Dividing (xμx) by the standard deviation, i.e., xμx/σx, leads to standardized scores of x. As a result, k-th standardized moments are both, location (mean)- and scale (variance)-invariant, which allows statements about the distribution of the probability mass along the tails of the distribution, regardless of the location and the spread of the x scores. In other words, third-, fourth-, and higher than fourth standardized moments quantify the tailedness of the distribution while accounting for the center of the distribution (the mean) and the variation around the center (variance). The following sections discuss the first four moments (mean, variance, skewness, and kurtosis) in detail.

1.3.2 Mean

The first raw moment, also known as the expectation of x, describes the center of mass of a probability distribution and is given by

Ex=μx=xpxxdx.(1.8)

In other words, the first raw moment describes the one point where the probability distribution is perfectly balanced. Alternatively, we can interpret the first moment as the distance of the center of the distribution from the origin of zero. The corresponding sample estimate is given by the well-known arithmetic mean equation

μ̂x=1nixi(1.9)

with i = 1, …, n where n denotes the sample size.

To illustrate, consider Galton’s (Reference Galton1886) classical height data, which were used to perform one of the first regression analyses in the history of statistics. The dataset contains height measures (in inches) from 928 (adult) children and the average heights of the two parents (“midparents”). The present illustration only focuses on the height of the (adult) children. The left panel of Figure 1.1 gives the density distribution of children’s heights. The vertical line gives the first raw moment, that is, the mean of children’s heights of 68.09 inches (172.95 cm). In other words, at a height of 68.09 inches, the probability distribution is perfectly balanced. Next, consider the right panel of Figure 1.1. Here heights of the 928 children were subtracted by the mean prior visualization. That is, instead of x, we plot the quantities x – 68.09. Because subtracting a constant (here, the mean) constitutes a linear transformation, the overall shape of the distribution is not affected by this subtraction. However, this transformation systematically changes the center mass of the distribution. The mean of the distribution is shifted to the origin of zero. In other words, the first centered moment of (xμx) is of lower interest because it always takes the value zero, that is,

Exμx=ExEμx=μxμx=0,(1.10)

where we make use of the fact that the expected value of a constant equals the constant itself (Eμx=μx). Section 1.3.3 focuses on the variance of a probability distribution.

Two side-by-side kernel density plots show children’s height distribution and its mean-standardised version, both with vertical lines at the mean. See long description.

Figure 1.1 Density of Galton’s data on the height of 928 children. Left panel: Density distribution of the original data. Right panel: Density distribution of mean-standardized scores.

Figure 1.1Long description

The left plot is labelled Children’s Height on the x-axis and Density on the y-axis, with values on the x-axis ranging from 60 to 75. It displays a grey-shaded kernel density curve with a vertical line near 68, indicating the sample mean. The boundary of the shaded area passes through: (60, 0), (65, 0.06), (70, 0.17), and (75, 0). The right plot is labelled Mean-Standardised Children’s Height on the x-axis and Density on the y-axis, with x-values ranging from minus 5 to plus 5. It presents a similarly shaped grey-shaded density curve with a vertical line at 0, representing the mean of the standardised data. The boundary of the shaded area passes through (negative 8, 0.025), (negative 4, 0.05), (1, 0.017), and (7, 0.025). Both plots display a right-skewed distribution. All values are estimated.

1.3.3 Variance

The second central moment describes the variance of a probability distribution and measures the spread of the distribution about the mean. Formally, the variance can be expressed as

m2=σx2=Exμx2=xμx2pxxdx.(1.11)

In words, the variance is defined as the expected value of the squared deviation of x about the mean. Note that the variance is a measure of the spread of the distribution in squared units of the data. Taking the square root of the variance leads to the standard deviation of x,

σx=Exμx2,(1.12)

which serves as a measure of the spread of the distribution in the measurement unit of x. For both measures it holds that larger values indicate larger variability of the scores about the center (mean) of the probability distribution. In contrast, a variance (standard deviation) of zero implies that all x scores take the same value. The corresponding sample estimates of the variance and the standard deviation are given by

σ̂x2=1nixiμ̂x2σ̂x=1nixiμ̂x2.(1.13)

Reconsider Galton’s height data of the 928 (adult) children. The left panel of Figure 1.2, again, gives the density distribution of the raw data. However, now, we focus on both the center and the spread of the distribution. Here, a variance of σ̂x2 = 6.34 and, thus, a standard deviation of σ̂x=6.34 = 2.52 are observed. In other words, on average, children’s heights deviate by 2.52 inches from the mean (visualized through horizonal lines in Figure 1.2). Next, we focus on the height distribution given in the right panel of Figure 1.2. Here, prior data visualization, data points have been standardized using xμ̂x/σ̂x with μ̂x = 68.09 and σ̂x = 2.52. Thus, the resulting probability distribution again exhibits a mean of zero. In addition, the probability distribution shows a variance (and standard deviation) of one, which is a necessity due to the fact that

m¯2=m2σ2=Exμx2Exμx22=Exμx2Exμx2=1(1.14)
Two density plots compares children's height and standardised children's height, each with vertical lines marking the mean and horizontal bars. See long description.

Figure 1.2 Density of Galton’s height data obtained from 928 (adult) children. Left panel: Density distribution of the raw data. Right panel: Density distribution of standardized scores. Vertical lines give the center (mean) of the distributions, horizontal lines indicate the spread (standard deviations) of the distributions.

Figure 1.2Long description

The left plot is labelled Children’s Height on the x-axis, ranging from 60 to 75, and Density on the y-axis, ranging from 0 to 0.15. A grey-shaded density curve illustrates the distribution of heights, with a vertical line near 68 indicating the mean, and a horizontal bar centred below this line displays standard deviation. The horizontal bar spans the distance 64 to 71. The curve that forms the boundary of the shaded area passes through the points (60, 0), (65, 0.065), (70, 0.16), and (75, 0). The right plot is labelled Standardised Children’s Height on the x-axis, ranging from minus 3 to plus 3, and Density on the y-axis peaking near 0.4. The curve that forms the boundary of the shaded area, passes through the following points (negative 3, 0), (negative 1, 0.14), (0.5, 0.4), and (3, 0). A vertical line at 0 marks the mean, and a horizontal bar underneath spans the distance negative 1 to 1. All values are estimated.

Note, again, however, that the overall distributional shape is not altered by this standardization. Therefore, higher than second moments (e.g., skewness and kurtosis discussed in detail in Sections 1.3.4 and 1.3.5) are not affected by data standardization. Section 1.3.4 moves toward higher than second moments of probability distributions and focuses on the third moment as a measure of skewness.

1.3.4 Skewness

The third standardized moment serves as a measure of the skewness and quantifies the relative size of the two tails of a probability distribution. Formally, the skewness of x (γx) can be expressed as

m¯3=γx=m3m23/2=Exμxσx3,(1.15)

and is commonly interpreted as a measure of the symmetry of the probability distribution. The measure in Eq. (1.15) can be interpreted as the expected value of a standardized variable taken to the third power. Because the cubic function preserves the sign of the differences, one obtains (1) a value of zero for symmetric distributions, that is, cases where, relative to the mean of the distribution, the two tails are equal in size, (2) a value larger than zero for right skewed (or positively skewed) distributions, and (3) a value smaller than zero for left (negatively) skewed distributions (note, however, that special cases exist in which a skewed distribution can result in a skewness of zero; cf. Dey et al., Reference Dey, Al-Zahrani and Basloom2017). The sample estimate of the skewness is given by

γ̂x=m̂3m̂23/2=ixiμ̂x3/nixiμ̂x2/n3/2.(1.16)

Note that additional multiplicative factors, resulting in the estimates γ̂xnn1/n2 and γ̂xn1/n3/2, can be imposed on Eq. (1.16) to adjust for biases when estimating the population value (Joanes & Gill, Reference Joanes and Gill1998; Wright & Herrington, Reference Wright and Herrington2011). As n approaches infinity (i.e., n → ∞), these alternative skewness estimates approach γ̂x, given in Eq. (1.16).

To illustrate, consider the two artificial distributions given in Figure 1.3. Here, the left panel gives an example of a right (positively) skewed distribution exhibiting a (population) skewness of γ = 1.07. Note that, according to Eq. (1.15), data points within one standard deviation from the mean (center) of the distribution can be expected to contribute little to the overall skewness measure. The reason for this is that deviations smaller than one in absolute values (i.e., xμx < 1) will be even smaller when taken to the third power. In contrast, absolute deviations larger than 1 (xμx > 1) tend to become larger when cubed. Therefore, data points outside one standard deviation contribute more to the overall skewness than data points within the range of one standard deviation. In the case of a right skewed distribution, the right tail of the distribution tends to be longer, resulting in a skewness larger than zero.

Two area plots display the skewed distributions, with vertical lines at the mean and horizontal bars denoting standard deviation. See long description.

Figure 1.3 Artificial examples of a right (positively) skewed probability distribution (left panel) and a left (negatively) skewed probability distribution (right panel). Vertical lines give the center (mean) of the distributions, horizontal lines indicate the spread (standard deviations) of the distributions.

Figure 1.3Long description

Two side-by-side area plots, both filled with grey shading under the curve. The left plot displays a right-skewed distribution with a peak toward the left side and a long tail extending to the right. A vertical line marks the mean, positioned to the right of the peak, and a horizontal bar below indicates the range of standard deviation. The right plot shows a left-skewed distribution, with a peak toward the right and a long tail on the left. A vertical line denotes the mean, positioned left of the peak, with a horizontal bar below the x-axis showing standard deviation. Both plots are unlabelled on the axes but visually emphasise the impact of skewness on the position of the mean and standard deviation.

In contrast, consider the distribution presented in the right panel of Figure 1.3. Here, the overall skewness will be smaller than zero (specifically, the distribution exhibits a (population) skewness of γ = –1.07) because, relative to the center (mean) of the distribution, the left tail of the distribution is longer than the tail to the right. Thus, more observations outside the one standard deviation interval will produce large negative values, resulting in a negative skewness estimate.

Next, we reconsider Galton’s height data with respect to its distributional symmetry. Figure 1.4 again shows the distribution of standardized children’s heights. However, this time, we focus on the tails of the distribution with respect to the one standard deviation interval. Observations outside the one standard deviation interval are marked dark gray, observations within one standard deviation are marked light gray. Clearly, the majority, that is, ~ 69% of the sample exhibit body heights within one standard deviation of the mean. The remaining 31% fall outside the one standard deviation interval. However, we now inspect the contribution of each observation within and outside one standard deviation to the estimated skewness (a similar approach has been carried out by Westfall (Reference Westfall2014) in the context of the kurtosis, discussed in Section 1.3.5). For this purpose, we start with standardizing the observations using zi=xiμ̂x/σ̂x = (xi – 68.09)/2.52. Next, we separate the computation of the skewness for data below, within, and above one standard deviation, which results in

γ̂xbelow=1928zi1zi3=0.749,γ̂xwithin=19281<zi<1zi3=0.019,γ̂xabove=1928zi1zi3=0.643.(1.17)
A density plot of standardised children's height displays the mean line, standard deviation range, and shaded area between minus 1 and 1. See long description.

Figure 1.4 Standardized density of Galton’s height data obtained from 928 (adult) children. Areas shaded in dark gray represent data outside one standard deviation of the mean.

Figure 1.4Long description

The x-axis is labelled Standardised Children’s Height and ranges from negative 3 to positive 3, and the y-axis is labelled Density ranging from 0 to 0.4. A curve that forms the boundary of the shaded area, passes through the following points (negative 3, 0), (negative 1, 0.14), (0.5, 0.4), and (3, 0). The curve is filled with dark grey, with a lighter grey shaded region between minus 1 and plus 1 standard deviation. A vertical line at 0 marks the mean, and a horizontal black bar beneath the curve shows the plus or minus 1 standard deviation range. The area on either side of the negative 1 and the positive 1 mark, is shaded darker than the area between them.

Clearly, observations within one standard deviation contribute little to the overall skewness, whereas observations below and above the one standard deviation interval are considerably larger. Furthermore, the contribution of observations below one standard deviation is slightly larger (in absolute values) than the contribution of observations above one standard deviation. We can, thus, expect that the left tail contributes more to the computation of the skewness than the right tail. In other words, we expect a skewness smaller than zero. A skewness estimate of γ̂ = –0.088 confirms this conjecture and we can conclude that the distribution of body heights is slightly left skewed. Note, however, that the skewness estimate does not reach statistical significance (D’Agostino z = –1.10, p = 0.272). Section 1.3.5 focuses on the fourth higher moment to characterize the shape of probability distributions, the kurtosis.

1.3.5 Kurtosis

Section 1.3.4 introduced the third central moment of a probability distribution (the skewness) as a measure that quantifies the relative size of the tails of the distribution. The fourth central moment of a probability distribution, as a measure of kurtosis, can be introduced in a similar fashion. However, in contrast to the third moment (quantifying the relative tailedness around the mean), the fourth moment quantifies the total tailedness of a distribution around the mean. Formally, the kurtosis (κx) can be defined as

m¯4=κx=m4m22=Exμxσx4.(1.18)

In contrast to the skewness, for decades, the interpretation of kurtosis was subject to considerable debate. Some authors suggested that kurtosis can be interpreted as a measure of the peakedness of a distribution (see, e.g., Cooligan, Reference Cooligan2013; Katz et al., Reference Katz, Elmore, Wild and Lucan2013; Lee et al., Reference Lee, Lee and Lee2013), while others suggested that kurtosis is correctly interpreted as a measure of tail heaviness (e.g., Ali, Reference Ali1974). That kurtosis is measuring both, peakedness and tail heaviness of a distribution has been suggested by, for example, Ruppert (Reference Ruppert1987) and DeCarlo (Reference DeCarlo1997). Alternative interpretations included kurtosis as a measure of bimodality (Darlington, Reference Darlington1970) and kurtosis as a measure to detect the presence of outliers (Livesey, Reference Livesey2007). Overall, however, agreement has been reached that the correct interpretation of kurtosis is that of a measure of the tail heaviness or tail extremity only, and that kurtosis does not convey any useful information regarding the peakedness of a distribution (Westfall, Reference Westfall2014).

Related to the kurtosis given in Eq. (1.18) is the so-called excess-kurtosis

δx=κx3,(1.19)

where the additional “– 3” accounts for the fact that the Gaussian (normal) distribution exhibits a kurtosis of 3. In other words, the excess-kurtosis measures the tail extremity of a distribution relative to the Gaussian distribution. Here, following Pearson’s (Reference Pearson1905) terminology, one can distinguish “platykurtic,” “mesokurtic,” and “leptokurtic” distributions. A platykurtic distribution exhibits a kurtosis that is smaller than the one observed for a Gaussian distribution, that is, an excess-kurtosis smaller than zero (δx < 0; also known as a sub-Gaussian distribution). A distribution is mesokurtic when the kurtosis equals 3 (or, equivalently, when the excess-kurtosis is zero), that is, when the tail heaviness of a distribution is similar to the one of the Gaussian distribution. Lastly, a leptokurtic distribution exhibits a kurtosis larger than 3 (or, equivalently, an excess-kurtosis larger than zero; also known as a super-Gaussian distribution). Note, however, that platykurtic and leptokurtic distributions are not directly comparable. The reason for this is that the positive excess-kurtosis values of leptokurtic distributions are unbounded (i.e., in principle, the excess-kurtosis can reach infinity). In contrast, for platykurtic distributions, the minimum possible (excess-kurtosis) value is δx = –2, which is, for example, attained by the Bernoulli distribution with a probability of 50% (see, e.g., Hyvärinen et al., Reference Hyvärinen, Karhunen and Oja2001). Furthermore, it is worth noting that the skewness and the (excess-)kurtosis are not independent of each other. For arbitrary standardized distributions and relative to the skewness, the lower bound of the excess-kurtosis in relation to the skewness is given by δxγx22 (see, e.g., Pearson, Reference Pearson1916; Teuscher & Guiard, Reference Teuscher and Guiard1995).

Next, we focus on two artificial examples of platy- and leptokurtic distributions given in Figure 1.5. The distribution in the left panel of Figure 1.5 exhibits a kurtosis of κx = 2 which is equivalent to an excess-kurtosis of δx = –1. Thus, relative to the Gaussian curve, the present distribution has lighter tails; hence, the distribution is platykurtic. In contrast, the right panel of Figure 1.5 gives an example of a leptokurtic distribution with a kurtosis of κx = 9 (or an excess-kurtosis of δx = 6), exhibiting heavier tails than that of a Gaussian distribution.

Two density plots display a semicircular distribution and a sharply peaked distribution, both with vertical mean lines and horizontal bars. See long description.

Figure 1.5 Artificial examples of a platykurtic (left panel) and leptokurtic (right panel) distribution. Vertical lines give the center (mean) of the distributions, horizontal lines indicate the spread (standard deviations) of the distributions.

Figure 1.5Long description

Two side-by-side density plots with grey shading under the curves. The left plot depicts a symmetric semicircular distribution with a flat base and a curved top, resembling a dome. A vertical line is centred at the peak, indicating the mean, with a horizontal bar beneath the x-axis showing the range of standard deviation. The right plot displays a sharply peaked symmetric distribution with steep slopes on either side of the narrow peak. A vertical line is aligned at the peak representing the mean, and a horizontal bar below marks the standard deviation range.

Sample estimates of the kurtosis and the excess-kurtosis are available via

κ̂x=m̂4m̂22=ixiμ̂x4/nixiμ̂x2/n2δ̂x=κ̂x3.(1.20)

Here, adjustments based on the sample size (n), e.g., using δ̂x+3n1/n23, are recommended to account for biases in the population estimate (Wright & Herrington, Reference Wright and Herrington2011). However, as n → ∞, the adjusted excess-kurtosis estimate approaches δ̂x.

To illustrate the sample computation of the (excess-)kurtosis, we, again, revisit Galton’s height data of (adult) children. Using Eq. (1.20), we obtain a kurtosis of κ̂ = 2.656, which corresponds to an excess-kurtosis of δ̂ = 2.656 – 3 = –0.344. In other words, the height distribution is platykurtic. Furthermore, according to the Anscombe–Glynn test (Anscombe & Glynn, Reference Anscombe and Glynn1983), we obtain a z-score of –2.499 which comes with a p-value of 0.012. It is therefore concluded that the distribution significantly deviates from the Gaussian distribution with respect to its kurtosis. Next, to emphasize the interpretation as a measure of tail heaviness, we focus on the relative contributions the tails of the distribution have on the computation of its kurtosis (see, also, Westfall, Reference Westfall2014). For this purpose, we use children’s heights in standardized form, zi=xiμ̂x/σ̂x = (xi – 68.09)/2.52 (see Figure 1.4), and separate the computation of its kurtosis by data within and outside one standard deviation of the mean. Here, we obtain

κ̂xwithin=1928zi<1zi4=0.102,κ̂xoutside=1928zi1zi4=2.548,(1.21)

suggesting that about 2.548 / (0.102 + 2.548) × 100 = 96.2% of the kurtosis statistic is determined by observations outside one standard deviation of the mean, confirming that the kurtosis measures the extremity of the tails of the distribution.

1.3.6 Higher than Fourth Moments

Sections 1.3.4 and 1.3.5 introduced third and fourth higher moments to characterize the shape of probability distributions. Both higher moments come with clear interpretations and are useful to quantify the degree of non-Gaussianity. One natural question is whether higher than fourth moments (e.g., fifth, sixth, …, moments) exist and whether these higher moments can also contribute to the characterization of the shape of probability distributions.

From the definition of the k-th standardized moment, i.e., m¯k=mk/σk with mk=Exμxk, it follows that standardized moments can be taken to any higher powers (e.g., k = 5, 6, 7, …). Here, odd-powered standardized moments (e.g., using k = 5, 7, …) quantify the relative tailedness, whereas even-powered standardized moments (e.g., using k = 6, 8, …) quantify the total tailedness of a probability distribution. The fifth and sixth standardized moments, for example, are referred to as the hyper-skewness and the hyper-kurtosis (Bazavov et al., Reference Bazavov, Bollweg, Ding, Enns, Goswami, Hegde, Kaczmarek, Karsch, Larsen, Mukherjee, Ohno, Petreczky, Schmidt, Sharma and Steinbrecher2020), suggesting that third and fifth as well as fourth and sixth standardized moments quantify similar characteristics of a probability distribution. Such similarities exist for any even and odd higher powers. To visualize these similarities, consider Figures 1.6 and 1.7 (see also Gundersen, Reference Gundersen2020).

A graph plots two functions, f of x equals x cubed and f of x equals x to the power 4, with a shaded density curve between them. See long description.

Figure 1.6 Artificial right skewed distribution with the functions f(x) = x3 and f(x) = x4 superimposed.

Figure 1.6Long description

The graph plots two mathematical functions and a shaded density area. The solid curve represents the function f of x equals x cubed, which is antisymmetric about the origin and crosses through it, with negative values to the left and positive values to the right. The dashed curve shows f of x equals x to the power 4, forming a symmetric parabolic shape with its minimum at the origin and increasing steeply on both sides. A smooth shaded area is plotted between these two curves, concentrated around the interval from minus 0.5 to 1 along the x-axis, peaking slightly above 1 on the y-axis. A horizontal line is drawn at y equals 0.

A graph plots the four power functions with a shaded region between the curves. See long description.

Figure 1.7 Artificial right skewed distribution with the functions f(x, k) = xk (k = 5, 6, 7, 8) superimposed.

Figure 1.7Long description

The graph displays the curves of four functions as follows. f of x equals x superscript 5 end superscript, f of x equals x superscript 6 end superscript, f of x equals x superscript 7 end superscript, and f of x equals x superscript 8 end superscript, each represented by a distinct line style, solid, dotted, dashed, and dash-dot, respectively. The x-axis ranges from minus 1.5 to 1.5, and the y-axis ranges from minus 1.5 to 2. A shaded region lies between the curves of f of x equals x superscript 5 end superscript and f of x equals x superscript 8 end superscript in the interval around x equals 0. The functions are plotted symmetrically around the origin.

Figure 1.6 shows an artificial positively skewed distribution with the functions f(x) = x3 and f(x) = x4 superimposed. In line with the properties of skewness and kurtosis described in Sections 1.3.4 and 1.3.5, values close to the center of the distribution can be expected to contribute little to the statistics, however, for larger negative deviations from the center, large negative values are observed for f(x) = x3 and large positive values are obtained for f(x) = x4. In contrast, for large positive deviations from the center, both f(x) = x3 and f(x) = x4 attain large positive values.

Next, consider Figure 1.7 which gives the same probability distribution with functions involving higher than fourth powers superimposed. Specifically, f(x, k) = xk with k = 5, 6, 7, and 8. Clearly, the selected odd powers (k = 5, 7) behave in a way similar to raising x to the third power, and even powers (k = 6, 8) behave in a way similar to raising x to the fourth power. In other words, higher than fourth standardized moments, in essence, replicate the information gathered by third and fourth standardized moments. Since it is easier to reliably estimate lower moments than higher moments from empirical data, third and fourth moments are preferred over higher than fourth moments.

1.3.7 Cumulants

So far, this chapter has focused on moments of variables to characterize their distributions. Moments, however, are not the only constants that can be used for this purpose. Cumulants constitute an alternative set of constants to characterize distributional features of variables, which are, from a theoretical perspective, sometimes more useful than moments (Chapter 4 will make use of cumulants to derive direction of dependence statistics based on error distributions of causally competing linear models). Formally, cumulants describe the coefficients of the Taylor series expansion of the second characteristic function (or cumulant generating function). The second characteristic function is given by the natural logarithm of the first characteristic function (or moment generating function) which itself is defined as the continuous Fourier transform of the probability density function pxx (Hyvärinen et al., Reference Hyvärinen, Karhunen and Oja2001; Stuart & Ord, Reference Stuart and Ord1994).

Let cumkx be the k-th cumulant of x. When the expected value of x is nonzero, i.e., E[x] ≠ 0, the first four cumulants can be expressed as (see, e.g., Hyvärinen et al., Reference Hyvärinen, Karhunen and Oja2001)

cum1x=Excum2x=Ex2Ex2cum3x=Ex33Ex2Ex+2Ex3cum4x=Ex43Ex224Ex3Ex+12Ex2Ex26Ex4(1.22)

which, for a zero-mean variable (i.e., E[x] = 0) reduce to

cum1x=0cum2x=Ex2cum3x=Ex3cum4x=Ex43Ex22.(1.23)

In other words, a simple correspondence between cumulants and central moments exists for the first three moments. That is, the first three cumulants equal the first three central moments. However, this simple relation no longer holds for the fourth moment (see also Dodge & Rousson, Reference Dodge and Rousson1999).

As mentioned at the beginning of this section, cumulants have several advantageous properties. First, if x is a random variable and c is an additive constant, then the cumulants of x + c are given by

cumkx+c=cumkx(1.24)

for k > 1. For k = 1, one obtains cum1x+c=cum1x+c. Second, if c is a multiplicative constant, then

cumkcx=ckcumkx,(1.25)

Third, for p independent random variables x1, x2, …, xp, it holds that

cumkx1+x2++xp=cumkx1+cumkx2++cumkxp.(1.26)

In words, the k-th cumulant of the sum of independent variables equals the sum of the k-th cumulants of the summands. Note that this additivity property also holds for the first three moments, but does not hold for the fourth moment (Dodge & Rousson, Reference Dodge and Rousson1999).

The three properties listed in Eqs. (1.24), (1.25), and (1.26) have immediate consequences when working with cumulants in the context of linear models. To illustrate, consider the simple linear function y=a+bx+z with x, y, and z being continuous random variables, and a and b denoting constants. Making use of Eqs. (1.24), (1.25), and (1.26) implies that the k-th cumulant of y (with k > 1) is given by

cumky=cumka+bx+z=cumkbx+cumkz=bkcumkx+cumkz,(1.27)

that is, the k-th cumulant of y equals the sum of the weighted k-th cumulant of x and the k-th cumulant of z. These properties will play a vital role in the development of distributional measures of direction of dependence.

1.3.8 Joint Moments

So far, this chapter has focused on moments and cumulants of single random variables. Both sets of quantities, however, can be extended to the multivariate case. For simplicity, we focus on the bivariate case with the understanding that the presented concepts can be extended to more than two variables. Let x and y be two random variates. The bivariate moment mkl about the origins μx and μy is defined as (cf. Stuart & Ord, Reference Stuart and Ord1994)

mkl=Exμxkyμyl=xμxkyμyldF,(1.28)

where mk0 denotes the k-th moment of the marginal distribution of x and m0l denotes the l-th moment of the marginal distribution of y. When k ≠ 0 and l ≠ 0, mkl is known as the product moment of x and y. For example, when k = l = 1, the quantity m11 refers to the covariance of x and y,

m11=covxy=Exμxyμy.(1.29)

When x and y have zero means and unit variances (which can be achieved through variable standardization), k = l = 1 leads to the Pearson correlation ρxy (for further details see Chapter 2),

covxy=Exy=ρxy.(1.30)

In a similar way, we can define higher-order cross-products, sometimes called higher-order covariances (Dodge & Rousson, Reference Dodge and Rousson2001; Wiedermann et al., Reference Wiedermann, Wiedermann, Kim, Sungur and von Eye2021),

mkl=covxykl=Exμxkyμyl,(1.31)

and higher-order correlations,

corxykl=covxyklσxkσyl.(1.32)

Using Eqs. (1.31) and (1.32) one can define higher-order co-moments. For example, using {k, l} = {1, 2} and {2, 1} one obtains two estimates of the co-skewness of x and y,

corxy12=covxy12σx1σy2corxy21=covxy21σx2σy1.(1.33)

In a similar way, measures of the co-kurtosis of x and y can be defined when {k, l} = {1, 3}, {2, 2}, and {3, 1},

corxy13=covxy13σx1σy3corxy22=covxy22σx2σy2corxy31=covxy31σx3σy1.(1.34)

Co-skewness and co-kurtosis are natural extensions of the covariance to higher than second moments. Here, the co-skewness is a measure of distributional asymmetry that preserves the association between the two variables. Similarly, co-kurtosis is a measure of tailedness that preserves the xy relation. Chapter 3 shows that these co-moment measures exhibit properties that are useful in determining the direction of causation in linear models.

To illustrate the application of higher co-moment measures, we, once again, revisit Galton’s height data. However, this time, we consider the heights of both (adult) children and their parents. In Galton’s (Reference Galton1886) original study, parents’ heights were averaged (forming midparent’s heights), with female heights being multiplied by a factor of 1.08 (for further details see Hanley, Reference Hanley2004). Figure 1.8 shows the scatterplot of the standardized children’s and midparent’s heights (data points were slightly jittered to avoid overlap). Here, the LOWESS fit suggests a slightly nonlinear relationship (see also Wachsmuth et al., Reference Wachsmuth, Wilkinson and Dallal2003).

A scatter plot of children's standardised height against mid parent's standardised height with a fitted curve. See long description.

Figure 1.8 Scatterplot of standardized Galton’s height data with LOWESS line superimposed.

Figure 1.8Long description

The scatter plot displays individual data points representing the relationship between standardised midparental height on the x-axis and standardised children's height on the y-axis, with both axes ranging approximately from minus 2.5 to 2.5. Each black circle denotes an observation. A smooth fitted curve passes through the cloud of points, indicating a positive non-linear trend between the two variables. The data points are spread with considerable vertical variation for each horizontal value. The best fit curve passes through the following points. (negative 2, negative 1), (negative 1, negative 0.5), (0, 0), (1, 0.5), and (2, 1). All values are estimated.

While Galton’s data have repeatedly been used to illustrate linear regression and correlation, the present re-analysis focuses on higher moment characteristics of the data using the higher-order correlation formula in Eq. (1.32). Recall that corxy11 leads to the standard Pearson correlation coefficient, and corxyk0 and corxy0l can be used to describe higher moments of the marginal distributions. Table 1.1 summarizes the higher-order correlation estimates for several sets of power values. In addition, 95% bootstrap percentile confidence limits (based on 2,000 resamples) are reported for statistical inference.

Table 1.1Higher-order correlation estimates corxykl for different power values (x = children’s height, y = mid-parent’s height) together with the 95% percentile bootstrap confidence intervals (based on 2,000 resamples).
A table recording the correlation estimates for various statistics. See long description.
Table 1.1Long description

The column-headers are labelled: K, l, estimate, and 95% C I. The last column is divided further into lower and upper. The statistics are labelled: correlation, skewness of child, skewness of mid-parent, co-skewness, kurtosis of child, kurtosis of mid-parent, and co-kurtosis. The listed values are as follows. Row number 1: Correlation, 1, 1, 0.459, 0.406, 0.513. Row number 2: Skewness of Child, 3, 0, negative 0.088, negative 0.189, 0.025. Row number 3: Skewness of Mid-Parent, 0, 3, negative 0.035, negative 2.8E minus 17, 2.9E minus 17. Row number 4: Co-Skewness, 1, 2, 0.074, negative 0.029, 0.176. Row number 5: Co-Skewness, 2, 1, 0.095, negative 0.001, 0.189. Row number 6: Kurtosis of Child, 4, 0, 2.656, 2.507, 2.825. Row number 7: Kurtosis of Mid-Parent, 0, 4, 3.058, 2.873, 3.270. Row number 8: Co-Kurtosis, 1, 3, 1.537, 1.333, 1.722. Row number 9: Co-Kurtosis, 3, 1, 1.298, 1.113, 1.485. Row number 10: Co-Kurtosis, 2, 2, 1.516, 1.367, 1.660.

First, using k = l = 1 gives a Pearson correlation estimate of 0.459 (95% CI = [0.406, 0.513]), suggesting that children’s and midparent’s heights are positively correlated. Next, to estimate the skewness of the marginal height distributions, we use k = 3 and l = 0 as well as k = 0 and l = 3, leading to skewness estimates of children’s and midparent’s heights. Both estimates are close to zero, suggesting that both distributions are symmetric. In addition, the two co-skewness estimates (i.e., using k = 2 and l = 1 as well k = 1 and l = 2) are close to zero as well. The last step focuses on the tailedness of the height distributions and estimates the kurtosis of the marginal distributions (with {k, l} = {4, 0} and {0, 4}) and the three co-kurtosis measures (with {k, l} = {3, 1}, {2, 2}, and {1, 3}). Results suggest that children’s heights tend to be platykurtic (with a kurtosis value significantly smaller than 3, which replicates results presented above) and midparent’s heights tend to be mesokurtic (with a kurtosis estimate close to 3). Finally, to interpret results of the three co-kurtosis estimates, it is helpful to consider the theoretical values one would expect when both variables were Gaussian. In this case, the co-kurtosis reduces to

corxy22=1+2ρxy2corxy13=corxy31=3ρxy.(1.35)

In other words, under the assumption that children’s and midparent’s heights are Gaussian, one would expect corxy22=1+20.4592 = 1.421 and corxy13=corxy31=30.459 = 1.377. Considering the point estimate together with the 95% CIs suggests that corxy22 is not significantly larger with corxy22 = 1.516 (95% CI = [1.367, 1.660]). In addition, 95% CIs of corxy13 and corxy31 include the theoretical value of 1.377. While, in the present application, we used estimates of higher-order correlations to characterize marginal and joint height distributions, Chapters 3 and 4 will make use of higher-order correlations to construct test statistics that can be informative in discerning the direction of causation.

1.4 Take-Home Messages

  • The toolbox of statistical methods for causality research can, roughly, be classified into methods for the pursuit of causal inference and methods for causal structure learning (also known as causal discovery). Causal inference methods are typically designed to deal with the identification of causal effects, that is, with quantifying the change in an outcome variable that is induced by changes in the exogenous variables. In contrast, methods of causal structure learning are designed to discern statements about the connectivity of variables. That is, learning the underlying causal structure of variable relations.

  • Methods of Direction Dependence Analysis (DDA) constitute a class of causal discovery approaches that are designed to probe hypothesized causal models against plausible alternative causal structures. That is, in contrast to common methods of causal machine learning where elucidation of causal structures of multivariate networks are the focus of analysis (carrying an exploratory element), DDA is concerned with a confirmatory evaluation of hypothesized causal target models.

  • To test causally competing models, methods of DDA use information that goes beyond second-order moments of variables, that is, variable information that becomes available when the constructs under study deviate from the normal (Gaussian) distribution. Thus, higher moments of variables constitute one of the key elements in DDA.

  • Moments of variables describe the distribution of probability mass and characterize the location (center), scale (spread), and shape of variable distributions. While raw moments describe moments of a variable around the origin of zero, central moments describe moments around the arithmetic mean of a variable. In addition to central moments, standardized moments can be used to make statements about the tails of a probability distribution while accounting for its location and its spread.

  • The first four moments describe the mean, the variance, the skewness, and the kurtosis of a probability distribution. The mean describes the center of mass (i.e., the one point where the probability distribution is perfectly balanced) and the variance describes the spread of a distribution around the mean. The third moment describes the degree of symmetry and serves as a measure of the relative tailedness of a distribution. Negative skewness values indicate that the distribution is left skewed; positive skewness values are indicative for a right skewed distribution. The fourth moment serves as a measure of the total tailedness of a distribution. Here, the Gaussian (normal) distribution can be used as an interpretational benchmark: A distribution is said to be (1) platykurtic when its tails are lighter than those of a Gaussian distribution, (2) mesokurtic when its tails are in line with the tailedness of a Gaussian distribution, and (3) leptokurtic when the distribution has heavier tails compared to the Gaussian distribution.

  • Cumulants constitute another set of constants to describe distributional features of variables, which are, from a theoretical perspective, sometimes more useful than moments. The first three cumulants equal the first three central moments. This simple relation, however, does not hold for the fourth moment.

  • Joint moments extend the concept of moments to the multivariate case. For two variables, joint moments of first-order provide a measure of the covariance. Higher joint moments lead to estimates of the co-skewness and the co-kurtosis of variables, that is, measures of distributional symmetry and tail heaviness that preserve the relation between variables. Both marginal and joint higher moments contain valuable information in the process of discerning the direction of causation in variable relations.

Figure 0

Figure 1.1 Density of Galton’s data on the height of 928 children. Left panel: Density distribution of the original data. Right panel: Density distribution of mean-standardized scores.Figure 1.1 long description.

Figure 1

Figure 1.2 Density of Galton’s height data obtained from 928 (adult) children. Left panel: Density distribution of the raw data. Right panel: Density distribution of standardized scores. Vertical lines give the center (mean) of the distributions, horizontal lines indicate the spread (standard deviations) of the distributions.Figure 1.2 long description.

Figure 2

Figure 1.3 Artificial examples of a right (positively) skewed probability distribution (left panel) and a left (negatively) skewed probability distribution (right panel). Vertical lines give the center (mean) of the distributions, horizontal lines indicate the spread (standard deviations) of the distributions.Figure 1.3 long description.

Figure 3

Figure 1.4 Standardized density of Galton’s height data obtained from 928 (adult) children. Areas shaded in dark gray represent data outside one standard deviation of the mean.Figure 1.4 long description.

Figure 4

Figure 1.5 Artificial examples of a platykurtic (left panel) and leptokurtic (right panel) distribution. Vertical lines give the center (mean) of the distributions, horizontal lines indicate the spread (standard deviations) of the distributions.Figure 1.5 long description.

Figure 5

Figure 1.6 Artificial right skewed distribution with the functions f(x) = x3 and f(x) = x4 superimposed.Figure 1.6 long description.

Figure 6

Figure 1.7 Artificial right skewed distribution with the functions f(x, k) = xk (k = 5, 6, 7, 8) superimposed.Figure 1.7 long description.

Figure 7

Figure 1.8 Scatterplot of standardized Galton’s height data with LOWESS line superimposed.Figure 1.8 long description.

Figure 8

Table 1.1 Higher-order correlation estimates corxykl for different power values (x = children’s height, y = mid-parent’s height) together with the 95% percentile bootstrap confidence intervals (based on 2,000 resamples).Table 1.1 long description.

Accessibility standard: WCAG 2.2 AAA

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

The HTML of this book complies with version 2.2 of the Web Content Accessibility Guidelines (WCAG), offering more comprehensive accessibility measures for a broad range of users and attains the highest (AAA) level of WCAG compliance, optimising the user experience by meeting the most extensive accessibility guidelines.

Content Navigation

Table of contents navigation
Allows you to navigate directly to chapters, sections, or non‐text items through a linked table of contents, reducing the need for extensive scrolling.
Index navigation
Provides an interactive index, letting you go straight to where a term or subject appears in the text without manual searching.

Reading Order & Textual Equivalents

Single logical reading order
You will encounter all content (including footnotes, captions, etc.) in a clear, sequential flow, making it easier to follow with assistive tools like screen readers.
Short alternative textual descriptions
You get concise descriptions (for images, charts, or media clips), ensuring you do not miss crucial information when visual or audio elements are not accessible.
Full alternative textual descriptions
You get more than just short alt text: you have comprehensive text equivalents, transcripts, captions, or audio descriptions for substantial non‐text content, which is especially helpful for complex visuals or multimedia.

Visual Accessibility

Use of colour is not sole means of conveying information
You will still understand key ideas or prompts without relying solely on colour, which is especially helpful if you have colour vision deficiencies.
Use of high contrast between text and background colour
You benefit from high‐contrast text, which improves legibility if you have low vision or if you are reading in less‐than‐ideal lighting conditions.

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Introduction
  • Wolfgang Wiedermann, University of Missouri, Columbia, Alexander von Eye, Michigan State University
  • Book: Direction Dependence Analysis
  • Online publication: 16 September 2025
  • Chapter DOI: https://doi.org/10.1017/9781009381437.002
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Introduction
  • Wolfgang Wiedermann, University of Missouri, Columbia, Alexander von Eye, Michigan State University
  • Book: Direction Dependence Analysis
  • Online publication: 16 September 2025
  • Chapter DOI: https://doi.org/10.1017/9781009381437.002
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Introduction
  • Wolfgang Wiedermann, University of Missouri, Columbia, Alexander von Eye, Michigan State University
  • Book: Direction Dependence Analysis
  • Online publication: 16 September 2025
  • Chapter DOI: https://doi.org/10.1017/9781009381437.002
Available formats
×