To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Data management concerns collecting, processing, analyzing, organizing, storing, and maintaining the data you collect for a research design. The focus in this chapter is on learning how to use Stata and apply data-management techniques to a provided dataset. No previous knowledge is required for the applications. The chapter goes through the basic operations for data management, including missing-value analysis and outlier analysis. It then covers descriptive statistics (univariate analysis) and bivariate analysis. Finally, it ends by discussing how to merge and append datasets. This chapter is important to proceed with the applications, lab work, and mini case studies in the following chapters, since it is a means to become familiar with the software. Stata codes are provided in the main text. For those who are interested in using Python or R instead, the corresponding code is provided on the online resources page (www.cambridge.org/mavruk).
In the evaluation of experiments often the problem arises of how to compare the predictive success of competing probabilistic theories. The quadratic scoring rule can be used for this purpose. Originally, this rule was proposed as an incentive compatible elicitation method for probabilistic expert judgments. It is shown that up to a positive linear transformation, the quadratic scoring rule is characterized by four desirable properties.
This chapter introduces a novel dataset encompassing 731,810 witnesses across 74,077 House, Senate, and Joint standing committee hearings held between 1961 and 2018. The dataset includes comprehensive details such as witness names, organizational affiliations, hearing summaries, committee titles, dates, and bill numbers discussed. The chapter describes the meticulous construction process, emphasizing the extraction of key variables focusing on witness affiliations, affiliation type, and gender. With eighteen categorized affiliation types and nine broader parent categories, this classification captures the diverse spectrum of external groups represented in congressional hearings. The chapter also provides rich descriptive statistics on hearings and witness over time and across committees.
Chapter 1 explores the link between the research process and theory and the role of statistics in scientific discovery. Discrete and continuous variables, the building blocks of methodology, take center stage, with clear and elaborate examples and their applicability to scales of measurement and measures of central tendency. Understanding statistics allows us to become better consumers of science and make better judgments and decisions about claims and facts allegedly supported by statistical results.
In Chapter 7, the author introduces both content analysis and basic statistical analysis to help evaluate the effectiveness of assessments. The focus of the chapter is on guidelines for creating and evaluating reading and listening inputs and selected response item types, particularly multiple-choice items that accompany these inputs. The author guides readers through detailed evaluations of reading passages and accompanying multiple-choice items that need major revisions. The author discusses generative artificial intelligence as an aid for drafting inputs and creating items and includes an appendix which guides readers through the use of ChatGPT for this purpose. The author also introduces test-level statistics, including minimum, maximum, range, mean, variance, standard deviation, skewness, and kurtosis. The author shows how to calculate these statistics for an actual grammar tense test and includes an appendix with detailed guidelines for conducting these analyses using Excel software.
This is the main methodology and first-results chapter. It opens with an introduction to the lexeme-based approach used for the investigation, contrasting this to previous, variationist approaches. The chapter proceeds to explain the data retrieval and screening processes and presents an overview of the data, the nearly 65,000 intensifier tokens found in the corpus, across the three main categories (maximizers, boosters, downtoners), and the descriptive results across time for the most frequent items. The word counts of the different sociopragmatic groups of speakers (divided by speakers’ role in the courtroom, gender and social class) are introduced, as well as the diachronic distribution of intensifiers across the genders and social classes. Results are presented within the descriptive statistics framework, but the chapter also briefly introduces the regression model, or the inferential, multivariate statistical method to be used in Chapters 8–11 to disentangle the complex interplay of the sociopragmatic variables of speakers on the use of intensifiers.
This chapter covers the two topics of descriptive statistics and the normal distribution. We first discuss the role of descriptive statistics and the measures of central tendency, variance, and standard deviation. We also provide examples of the kinds of graphs often used in descriptive statistics. We next discuss the normal distribution, its properties and its role in descriptive and inferential statistical analysis.
The chapter ’#StatsWithCats’ shows some statistical methods to interpret and visualise the cat-related online data. The selected sociolinguistic variables are the social media platforms and the cat account types. The chapter takes frequencies and crosstabs to describe linguistic variation across four social media platforms and four cat account types. The selected linguistic variables refer to the choices of non-meowlogisms and meowlogisms on Facebook, Instagram, Twitter, and Youtube as well as in collective, for-profit celebrity, working-for-cause, and individual cat accounts. Additionally, the chapter uses social network analysis to illustrate the networks in cat-related digital spaces.
We offer methods to analyze the “differentially private” Facebook URLs Dataset which, at over 40 trillion cell values, is one of the largest social science research datasets ever constructed. The version of differential privacy used in the URLs dataset has specially calibrated random noise added, which provides mathematical guarantees for the privacy of individual research subjects while still making it possible to learn about aggregate patterns of interest to social scientists. Unfortunately, random noise creates measurement error which induces statistical bias—including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We adapt methods developed to correct for naturally occurring measurement error, with special attention to computational efficiency for large datasets. The result is statistically valid linear regression estimates and descriptive statistics that can be interpreted as ordinary analyses of nonconfidential data but with appropriately larger standard errors.
As data analytic methods in the managerial sciences become more sophisticated, the gap between the descriptive data typically presented in Table 1 and the analyses used to test the principal hypotheses advanced has become increasingly large. This contributes to several problems including: (1) the increasing likelihood that analyses presented in published research will be performed and/or interpreted incorrectly, (2) an increasing reliance on statistical significance as the principal criterion for evaluating results, and (3) the increasing difficulty of describing our research and explaining our findings to non-specialists. A set of simple methods for assessing whether hypotheses about interventions, moderator relationships and mediation, are plausible that are based on the simplest possible examination of descriptive statistics are proposed.
Stress is the most important proximal precipitant of depression, yet most large genome-wide association studies (GWAS) do not include stress as a variable. Here, we review how gene × environment (G × E) interaction might impede the discovery of genetic factors, discuss two examples of G × E interaction in depression and addiction, studies incorporating high-stress environments, as well as upcoming waves of genome-wide environment interaction studies (GWEIS). We discuss recent studies which have shown that genetic distributions can be affected by social factors such as migrations and socioeconomic background. These distinctions are not just academic but have practical consequences. Owing to interaction with the environment, genetic predispositions to depression should not be viewed as unmodifiable destiny. Patients may genetically differ not just in their response to drugs, as in the now well-recognised field of pharmacogenetics, but also in how they react to stressful environments and how they are affected by behavioural therapies.
This chapter discusses two types of descriptive statistics: models of central tendency and models of variability. Models of central tendency describe the location of the middle of the distribution, and models of variability describe the degree that scores are spread out from one another. There are four models of central tendency in this chapter. Listed in ascending order of the complexity of their calculations, these are the mode, median, mean, and trimmed mean. There are also four principal models of variability discussed in this chapter: the range, interquartile range, standard deviation, and variance. For the latter two statistics, students are shown three possible formulas (sample standard deviation and variance, population standard deviation and variance, and population standard deviation and variance estimated from sample data), along with an explanation of when it is appropriate to use each formula. No statistical model of central tendency or variability tells you everything you may need to know about your data. Only by using multiple models in conjunction with each other can you have a thorough understanding of your data.
Chapter 3 describes in detail the data sources and research designs used throughout the book, including observational data sources, experiments on national samples of American citizens, and panel surveys tracking the same people over time. It also summarizes aggregate public opinion on key variables through time, including approval, confidence, trust, procedural perceptions, and broadly targeted and narrowly targeted Court-curbing. The chapter concludes that the Court’s “reservoir of goodwill” within the American public is not as deep or wide as many scholars suggest.
This chapter discusses Feature Engineering techniques that look holistically at the feature set, therefore replacing or enhancing the features based on their relation to the whole set of instances and features. Techniques such as normalization, scaling, dealing with outliers and generating descriptive features are covered. Scaling and normalization are the most common, it involves finding the maximum and minimum and changing the values to ensure they will lie in a given interval (e.g., [0, 1] or [−1, 1]). Discretization and binning involve, for example, analyzing a feature that is an integer (any number from -1 trillion to +1 trillion) and realize that it only takes the values 0, 1 and 10 so it can be simplified into a symbolic feature with three values (value0, value1 and value10). Descriptive features is the gathering of information that talks about the shape of the data, the discussion centres around using tables of counts (histograms) and general descriptive features such as maximum, minimum and averages. Outlier detection and treatment refers to looking at the feature values across many instances and realizing some values might present themselves very far from the rest.
This reflection presents a discussion of some common measures of variability and how they are appropriately used in descriptive and inferential statistical analyses. We argue that confidence intervals (CIs), which incorporate these measures, serve as tools to assess both clinical and statistical significance.
This chapter presents a detailed example that applies the compensation analytics concepts developed in Chapter 6. The reader is assumed to be a compensation consultant charged with evaluating whether gender-based discrimination in pay is present in a public university system in the sciences. Section 7.1 walks through the analysis step-by-step, from formulating the business question, to acquiring and cleaning data, to analyzing the data and interpreting the results from voluminous statistical output in light of the business question. Section 7.2 covers exploratory data mining, causality, and experiments. Exploratory data mining covers situations in which the manager does not know in advance which relationships in the data will be of interest, in contrast to the example in section 7.1 in which a statistical model and specific measures could be constructed that were directly tailored to address the business question at hand. Section 7.2 covers the challenges associated with establishing causality in compensation research and how experiments can sometimes be designed to address those challenges. Randomization and some pitfalls associated with compensation experiments are also covered
This chapter presents a detailed example that applies the compensation analytics concepts developed in Chapter 6. The reader is assumed to be a compensation consultant charged with evaluating whether gender-based discrimination in pay is present in a public university system in the sciences. Section 7.1 walks through the analysis step-by-step, from formulating the business question, to acquiring and cleaning data, to analyzing the data and interpreting the results from voluminous statistical output in light of the business question. Section 7.2 covers exploratory data mining, causality, and experiments. Exploratory data mining covers situations in which the manager does not know in advance which relationships in the data will be of interest, in contrast to the example in section 7.1 in which a statistical model and specific measures could be constructed that were directly tailored to address the business question at hand. Section 7.2 covers the challenges associated with establishing causality in compensation research and how experiments can sometimes be designed to address those challenges. Randomization and some pitfalls associated with compensation experiments are also covered
This chapter responds to the growing importance of business analytics on "big data" in managerial decision-making, by providing a comprehensive primer on analyzing compensation data. All aspects of compensation analytics are covered, starting with data acquisition, types of data, and formulation of a business question that can be informed by data analysis. A detailed, hands-on treatment of data cleaning is provided, equipping readers to prepare data for analysis by detecting and fixing data problems. Descriptive statistics are reviewed, and their utility in data cleaning explicated. Graphical methods are used in examples to detect and trim outliers. The basics of linear regression analysis are covered, with an emphasis on application and interpreting results in the context of the business question(s) posed. One section covers the question of whether or not the pay measure (as a dependent variable) should be transformed via a logarithm, and the implications of that choice for interpreting the results are explained. Precision of regression estimates is covered via an intuitive, non-technical treatment of standard errors. An appendix covers nonlinear relationships among variables.
This chapter responds to the growing importance of business analytics on "big data" in managerial decision-making, by providing a comprehensive primer on analyzing compensation data. All aspects of compensation analytics are covered, starting with data acquisition, types of data, and formulation of a business question that can be informed by data analysis. A detailed, hands-on treatment of data cleaning is provided, equipping readers to prepare data for analysis by detecting and fixing data problems. Descriptive statistics are reviewed, and their utility in data cleaning explicated. Graphical methods are used in examples to detect and trim outliers. The basics of linear regression analysis are covered, with an emphasis on application and interpreting results in the context of the business question(s) posed. One section covers the question of whether or not the pay measure (as a dependent variable) should be transformed via a logarithm, and the implications of that choice for interpreting the results are explained. Precision of regression estimates is covered via an intuitive, non-technical treatment of standard errors. An appendix covers nonlinear relationships among variables.