Search

3 - Basic Data Analysis in Stata
Taylan Mavruk, University of Gothenburg
Book:

Quantitative Research Methods in Corporate Finance

Published online:

20 March 2025

Print publication:

20 March 2025, pp 21-39
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Data management concerns collecting, processing, analyzing, organizing, storing, and maintaining the data you collect for a research design. The focus in this chapter is on learning how to use Stata and apply data-management techniques to a provided dataset. No previous knowledge is required for the applications. The chapter goes through the basic operations for data management, including missing-value analysis and outlier analysis. It then covers descriptive statistics (univariate analysis) and bivariate analysis. Finally, it ends by discussing how to merge and append datasets. This chapter is important to proceed with the applications, lab work, and mini case studies in the following chapters, since it is a means to become familiar with the software. Stata codes are provided in the main text. For those who are interested in using Python or R instead, the corresponding code is provided on the online resources page (www.cambridge.org/mavruk).

Rasch's Multiplicative Poisson Model with Covariates
Haruhiko Ogasawara
Journal:

Psychometrika / Volume 61 / Issue 1 / March 1996

Published online by Cambridge University Press:

01 January 2025, pp. 73-92
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
As a multivariate model of the number of events, Rasch's multiplicative Poisson model is extended such that the parameters for individuals in the prior gamma distribution have continuous covariates. The parameters for individuals are integrated out and the hyperparameters in the prior distribution are estimated by a numerical method separately from difficulty parameters that are treated as fixed parameters or random variables. In addition, a method is presented for estimating parameters in Rasch's model with missing values.

Generalized Canonical Correlation Analysis of Matrices with Missing Rows: a Simulation Study
Michel van de Velden, Tammo H. A. Bijmolt
Journal:

Psychometrika / Volume 71 / Issue 2 / June 2006

Published online by Cambridge University Press:

01 January 2025, pp. 323-331
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
A method is presented for generalized canonical correlation analysis of two or more matrices with missing rows. The method is a combination of Carroll’s (1968) method and the missing data approach of the OVERALS technique (Van der Burg, 1988). In a simulation study we assess the performance of the method and compare it to an existing procedure called GENCOM, proposed by Green and Carroll (1988). We find that the proposed method outperforms the GENCOM algorithm both with respect to model fit and recovery of the true structure.

A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models
Christian Aßmann, Jean-Christoph Gaasch, Doris Stingl
Journal:

Psychometrika / Volume 88 / Issue 4 / December 2023

Published online by Cambridge University Press:

01 January 2025, pp. 1495-1528
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The measurement of latent traits and investigation of relations between these and a potentially large set of explaining variables is typical in psychology, economics, and the social sciences. Corresponding analysis often relies on surveyed data from large-scale studies involving hierarchical structures and missing values in the set of considered covariates. This paper proposes a Bayesian estimation approach based on the device of data augmentation that addresses the handling of missing values in multilevel latent regression models. Population heterogeneity is modeled via multiple groups enriched with random intercepts. Bayesian estimation is implemented in terms of a Markov chain Monte Carlo sampling approach. To handle missing values, the sampling scheme is augmented to incorporate sampling from the full conditional distributions of missing values. We suggest to model the full conditional distributions of missing values in terms of non-parametric classification and regression trees. This offers the possibility to consider information from latent quantities functioning as sufficient statistics. A simulation study reveals that this Bayesian approach provides valid inference and outperforms complete cases analysis and multiple imputation in terms of statistical efficiency and computation time involved. An empirical illustration using data on mathematical competencies demonstrates the usefulness of the suggested approach.

Performance and development of a thin stock market: the Stockholm Stock Exchange 1912–2017
Kristian Rydqvist, Rong Guo
Journal:

Financial History Review / Volume 28 / Issue 1 / April 2021

Published online by Cambridge University Press:

10 September 2020, pp. 26-44
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
We estimate historical stock returns for Swedish listed companies in a newly constructed data set of daily stock prices that spans more than 100 years. Stock returns exhibit all the familiar characteristics. The growth of the public sector depressed the stock market, and the process of globalization revitalized it. Banks played an important role in the early development of the stock market. There was little trading in the past, and we examine the effects on return measurement from missing data. Stock selection and the replacement of missing transaction prices through search back procedures or limit orders make little difference to a value-weighted stock price index, while ignoring the price effects of capital operations makes a big difference.

STATISTICAL ANALYSIS OF PROTEOMIC MASS SPECTROMETRY DATA FOR THE IDENTIFICATION OF BIOMARKERS AND DISEASE DIAGNOSIS
Part of
TYMAN E. STANFORD
Journal:

Bulletin of the Australian Mathematical Society / Volume 94 / Issue 2 / October 2016

Published online by Cambridge University Press:

16 August 2016, pp. 345-346

Print publication:

October 2016
- Article
- - You have access
- PDF
- Export citation

Missing portion sizes in FFQ – alternatives to use of standard portions
Rasmus Køster-Rasmussen, Volkert Siersma, Thorhallur I Halldorsson, Niels de Fine Olivarius, Jan E Henriksen, Berit L Heitmann
Journal:

Public Health Nutrition / Volume 18 / Issue 11 / August 2015

Published online by Cambridge University Press:

10 November 2014, pp. 1914-1921
- Article
- - You have access
- PDF
- HTML
- Export citation
Objective
Standard portions or substitution of missing portion sizes with medians may generate bias when quantifying the dietary intake from FFQ. The present study compared four different methods to include portion sizes in FFQ.
Design
We evaluated three stochastic methods for imputation of portion sizes based on information about anthropometry, sex, physical activity and age. Energy intakes computed with standard portion sizes, defined as sex-specific medians (median), or with portion sizes estimated with multinomial logistic regression (MLR), ‘comparable categories’ (Coca) or k-nearest neighbours (KNN) were compared with a reference based on self-reported portion sizes (quantified by a photographic food atlas embedded in the FFQ).
Setting
The Danish Health Examination Survey 2007–2008.
Subjects
The study included 3728 adults with complete portion size data.
Results
Compared with the reference, the root-mean-square errors of the mean daily total energy intake (in kJ) computed with portion sizes estimated by the four methods were (men; women): median (1118; 1061), MLR (1060; 1051), Coca (1230; 1146), KNN (1281; 1181). The equivalent biases (mean error) were (in kJ): median (579; 469), MLR (248; 178), Coca (234; 188), KNN (−340; 218).
Conclusions
The methods MLR and Coca provided the best agreement with the reference. The stochastic methods allowed for estimation of meaningful portion sizes by conditioning on information about physiology and they were suitable for multiple imputation. We propose to use MLR or Coca to substitute missing portion size values or when portion sizes needs to be included in FFQ without portion size data.

A Row-Wise Stacking of the Runoff Triangle: State Space Alternatives for IBNR Reserve Prediction
Rodrigo Atherino, Adrian Pizzinga, Cristiano Fernandes
Journal:

ASTIN Bulletin: The Journal of the IAA / Volume 40 / Issue 2 / November 2010

Published online by Cambridge University Press:

09 August 2013, pp. 917-946

Print publication:

November 2010
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This work deals with prediction of IBNR reserve under a different data ordering of the non-cumulative runoff triangle. The rows of the triangle are stacked, resulting in a univariate time series with several missing values. Under this ordering, two approaches entirely based on state space models and the Kalman filter are developed, implemented with two real data sets, and compared with two well-established IBNR estimation methods — the chain ladder and an overdispersed Poisson regression model. The remarks from the empirical results are: (i) computational feasibility and efficiency; (ii) accuracy improvement for IBNR prediction; and (iii) flexibility regarding IBNR modeling possibilities.

Comparing methods for handling missing values in food-frequency questionnaires and proposing k nearest neighbours imputation: effects on dietary intake in the Norwegian Women and Cancer study (NOWAC)
Christine L Parr, Anette Hjartåker, Ida Scheel, Eiliv Lund, Petter Laake, Marit B Veierød
Journal:

Public Health Nutrition / Volume 11 / Issue 4 / April 2008

Published online by Cambridge University Press:

01 April 2008, pp. 361-370
- Article
- - You have access
- PDF
- HTML
- Export citation
Objective
To investigate item non-response in a postal food-frequency questionnaire (FFQ), and to assess the effect of substituting/imputing missing values on dietary intake levels in the Norwegian Women and Cancer study (NOWAC). We have adapted and probably for the first time applied k nearest neighbours (KNN) imputation to FFQ data.
Design
Data from a recent reproducibility study were used. The FFQ was mailed twice (test–retest) about 3 months apart to the same subjects. Missing responses in the test FFQ were imputed using the null value (frequencies = null, amount = smallest), the sample mode, the sample median, KNN, and retest values.
Setting
A methodological substudy of NOWAC, a national population-based cohort.
Subjects
A random sample of 2000 women aged 46–75 years was drawn from the cohort in 2002 (response 75%). The imputation methods were compared for 1430 women who completed at least 50% of the test FFQ.
Results
We imputed 16% missing values in the overall test data matrix. Compared to null value imputation, the largest differences in estimated dietary intake were seen for KNN, and for food items with a high proportion of missing. Imputation with retest values increased total energy intake, indicating that not all missing values are caused by respondents failing to specify no consumption, and that null value imputation may lead to underestimation and misclassification.
Conclusion
Missing values in FFQs present a methodological challenge. We encourage the application and evaluation of newer imputation methods, including KNN, which may reduce imputation errors and give more accurate intake estimates.

Alternating projections and interpolation of stationary processes
Part of
- Inference from stochastic processes
- Stochastic processes
Mohsen Pourahmadi
Journal:

Journal of Applied Probability / Volume 29 / Issue 4 / December 1992

Published online by Cambridge University Press:

14 July 2016, pp. 921-931

Print publication:

December 1992
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
By using the alternating projection theorem of J. von Neumann, we obtain explicit formulae for the best linear interpolator and interpolation error of missing values of a stationary process. These are expressed in terms of multistep predictors and autoregressive parameters of the process. The key idea is to approximate the future by a finite-dimensional space.

Search Results

Refine search

Refine search

Actions for selected content:

10 results

3 - Basic Data Analysis in Stata

Summary

Rasch's Multiplicative Poisson Model with Covariates

Generalized Canonical Correlation Analysis of Matrices with Missing Rows: a Simulation Study

A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models

Performance and development of a thin stock market: the Stockholm Stock Exchange 1912–2017

STATISTICAL ANALYSIS OF PROTEOMIC MASS SPECTROMETRY DATA FOR THE IDENTIFICATION OF BIOMARKERS AND DISEASE DIAGNOSIS

Missing portion sizes in FFQ – alternatives to use of standard portions

A Row-Wise Stacking of the Runoff Triangle: State Space Alternatives for IBNR Reserve Prediction

Comparing methods for handling missing values in food-frequency questionnaires and proposing k nearest neighbours imputation: effects on dietary intake in the Norwegian Women and Cancer study (NOWAC)

Alternating projections and interpolation of stationary processes

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

10 results

Summary